Journal get last execution’s log

Today i was needing the correct journalctl incantation to get the logs exclusively from the last systemd service execution. It seems that Google is not what it used to be, and did not infer from my Google profile that I likely was looking for technical computer information not “Botched Executions” and death penalty. Oh well here is a screen shot from the results 🙂

PS: For the actual solution it is quite more complicated than I desired. From it seems one needs to get the last InvocationID and then pass it to journalctl. So much hassle.

journalctl _SYSTEMD_INVOCATION_ID=$(systemctl show -p InvocationID --value my.service)

Infinite loops on crashing systemd units.

When using systemd it is often wise to set mission specific “global defaults”. In this article we will look at the case of “how often a unit can fail in a given amount of time” and it’s impacts.

systemd experts will create their units with properties set to the correct values for their specific units, but often these properties are not obvious or are forgotten by the unit’s creators. systemd comes with defaults as explained in systemd-system.conf manual but that does not mean they should not be evaluated for suitability to the system’s constrains and mission.

Let’s approach the requirement where we do not want a crashing service, respawning infinitely fast and thus compromising the whole system. This requirement is valid for any type of computing, be it embedded, server or even a desktop.

systemd gives us a quite a few options to achieve that requirement but let’s focus on DefaultRestartSec, DefaultStartLimitIntervalSec and DefaultStartLimitBurst. These system wide properties can be overridden in individual units, just in the unit files, the Default prefix is not used. From the manual:

DefaultStartLimitIntervalSec=, DefaultStartLimitBurst=

Configure the default unit start rate limiting, as configured per-service by StartLimitIntervalSec= and StartLimitBurst=. See systemd.service(5) for details on the per-service settings. DefaultStartLimitIntervalSec= defaults to 10s. DefaultStartLimitBurst= defaults to 5.

This means that the unit will start at most DefaultStartLimitBurstTimes times per DefaultStartLimitIntervalSec seconds. If a unit start is unsuccessful and matches the criteria described, then the unit will be set as failed and the unit will not start anymore over the defined period. Failed is not disabled. The DefaultRestartSec states how fast to restart the service inside the period defined by DefaultStartLimitIntervalSec.

With good values one can make sure the system will never be starved of resources due to an infinite crash loop. The “factory” default that comes from the systemd code is, 5(times)/10(seconds). This default by itself maybe fine, but i personally find the “factory” DefaultRestartSec=100ms” not very reasonable. In a worst case scenario, this means the unit will fail 5 times in half a second, with a “resting” period of 10 seconds.

Worst case scenario timeline for systemd restart factory defaults. Each letter F is a failure that takes 0.1s. Every 10(DefaultStartLimitIntervalSec) seconds the failure pattern illustrated can repeat itself.

If the unit’s start has side effects on the system, the system can become unusable. From a security point of view this can lead to a denial of service or infrastructure damage. Damage scenarios are:

  • A unit start side effect taking more than 2 seconds to recover from, leaving the system in a death spiral.
  • IO exhaustion, disk space exhaustion or extra infrastructure charges

To sum up, system architects should have a look at the worst-case side effect cleanup/handling of the units running. With this data, define good defaults that allow the system to continue operating degraded even when an infinite crash loop exists. Degraded here means that the service is still operating or has capacity to execute remedying measures.

systemd fallbacks to google ntp servers. Pay attention!

As the title suggests Google NTP servers are compiled by default in systemd. For common user desktops and even some servers this is harmless. For embedded or critical computing networks this is a little known phone home mechanism.

I wrote the “pay attention” in the title and decided to write about this topic because in my career more than once customers did security assessments and found devices with no business connecting to the internet, trying to connect to Google servers.

There are several hypothesis that can lead to the phone google scenario:

  • By default, systemd‘s build system has a ntp-servers option point to Google NTP servers. This will mean systemd will have Google servers hard coded as a fallback. Most people do not know of the Google hard code into binaries. After all how many people know meson and inspect the many options of systemd manually.
  • Most dhcp leases do not offer NTP servers, so systemd tries to use any NTP server. Often this means the one hard coded. In my opinion this is the most common reason the fallback is triggered.
  • Also running networkctl status -a, will not display any NTP server information.
  • Most people do not configure timesyncd services explicitly, and likely many people do not know that NTP servers are relevant to their machines.
  • timedatectl status -a states that the NTP service is active but does not display what NTP servers were used.

With all that said if you want to check what are the current NTP fallback servers you need to run:

$timedatectl show-timesync
PollIntervalMaxUSec=34min 8s
PollIntervalUSec=34min 8s
NTPMessage={ Leap=0, Version=4, Mode=4, Stratum=2, Precision=-24, RootDelay=46.966ms, RootDispersion=22.445ms, Reference=84A36001, OriginateTimestamp=Thu 2021-07-29 15:36:17 CEST, ReceiveTimestamp=Thu 2021-07-29 15:36:17 CEST, TransmitTimestamp=Thu 2021-07-29 15:36:17 CEST, DestinationTimestamp=Thu 2021-07-29 15:36:17 CEST, Ignored=no PacketCount=100, Jitter=9.101ms }

As you can see above, the Ubuntu distribution is careful to change the default to Good on Canonical.