When using systemd it is often wise to set mission specific “global defaults”. In this article we will look at the case of “how often a unit can fail in a given amount of time” and it’s impacts.
systemd experts will create their units with properties set to the correct values for their specific units, but often these properties are not obvious or are forgotten by the unit’s creators. systemd comes with defaults as explained in systemd-system.conf manual but that does not mean they should not be evaluated for suitability to the system’s constrains and mission.
Let’s approach the requirement where we do not want a crashing service, respawning infinitely fast and thus compromising the whole system. This requirement is valid for any type of computing, be it embedded, server or even a desktop.
systemd gives us a quite a few options to achieve that requirement but let’s focus on DefaultRestartSec, DefaultStartLimitIntervalSec and DefaultStartLimitBurst. These system wide properties can be overridden in individual units, just in the unit files, the Default prefix is not used. From the manual:
Configure the default unit start rate limiting, as configured per-service by StartLimitIntervalSec= and StartLimitBurst=. See systemd.service(5) for details on the per-service settings. DefaultStartLimitIntervalSec= defaults to 10s. DefaultStartLimitBurst= defaults to 5.
This means that the unit will start at most DefaultStartLimitBurstTimes times per DefaultStartLimitIntervalSec seconds. If a unit start is unsuccessful and matches the criteria described, then the unit will be set as failed and the unit will not start anymore over the defined period. Failed is not disabled. The DefaultRestartSec states how fast to restart the service inside the period defined by DefaultStartLimitIntervalSec.
With good values one can make sure the system will never be starved of resources due to an infinite crash loop. The “factory” default that comes from the systemd code is, 5(times)/10(seconds). This default by itself maybe fine, but i personally find the “factory” DefaultRestartSec=100ms” not very reasonable. In a worst case scenario, this means the unit will fail 5 times in half a second, with a “resting” period of 10 seconds.
If the unit’s start has side effects on the system, the system can become unusable. From a security point of view this can lead to a denial of service or infrastructure damage. Damage scenarios are:
- A unit start side effect taking more than 2 seconds to recover from, leaving the system in a death spiral.
- IO exhaustion, disk space exhaustion or extra infrastructure charges
To sum up, system architects should have a look at the worst-case side effect cleanup/handling of the units running. With this data, define good defaults that allow the system to continue operating degraded even when an infinite crash loop exists. Degraded here means that the service is still operating or has capacity to execute remedying measures.