Development and Production gaps in Yocto

(The original article had typos because Grammarly did not work correctly and disabled my browser’s dictionary. My apologies.)

Yocto is great to build customized Linux distributions from source code. It is a very complex tool and needs to build a huge variety of software projects from source code. It often happens that a Linux distribution image needs to be configured differently when building for a developer version vs for production. Often the production version is built in an automated way through a CI(continuous integration) server. This workflow gap between production and developer has been a major driver of developer dissatisfaction. I am going to talk about my insights on what are the parts and drivers of this dissatisfaction but also provide some hints on what works, from my experience.

The usual scenario triggering conflicts is: “It works on my machine but the verification step in the CI fails“. Such a usual scenario can have multiple origins:

  • Build time:
    • The developer is building with different configurations than the CI
    • A given recipe produces different package contents every time it is built
  • Run time:
    • A program has inherent non deterministic behavior, even with the same exact binary. A garden variety bug
    • The verification step is broken

Divergent configurations at build time

This situation is mostly unavoidable in the real world. The most common reason for different configurations are:

  • Debug symbols
  • Debug tools
  • Builds with debug flags enabled
  • Secrets or security only enabled in production

Debug symbols

Debug symbol’s existence is mostly benign and rarely leads to technical issues. Even so, it can result in:

  • Intellectual property leakage. This is not relevant from the point of product operation but can have devastating legal and business impacts
  • Too large images. Some embedded devices do not have sufficient space. Insufficient space can manifest boot failures which are hard to explain and very hard to debug, due to bootloader or flash physical constraints.

Debug tools

Images with debug tools are also pretty common. Unless the debug tools are part of a critical function they also very rarely lead to verification issues. They deserve consideration though, as they can affect Key Performance Indicators(KPIs) of the development, deployment, and physical device unit cost.

The main issues associated with including debug tools vs not, can be summarized as:

  • Excessively long build times. The more tools you put in your image the more recipes and associated packages need to be built and installed. Iteration speed can come to a crawl
  • Very hard real-world diagnostics due to lack of tools/APIs. This is specially true in embedded devices due to their one-purpose nature enforced by cryptographically verified read-only flash. This means that a developer needs to completely regenerate a new image with the diagnostic tools required
  • Increased security risks due to bigger system tampering surface
  • Increased costs due to redundant storage. In volume, extra storage can lead to massive expenses

Builds with debug flags enabled

Debug build flags can significantly impact system and application behavior that result in side effects like the stereotypical “but it works on my machine”:

  • Increased logging and diagnostics has a quality of changing concurrent execution dynamics, often masking issues in concurrent code that are only visible in production
  • Release functionality being dependent on diagnostic flags, which in turn bring unvetted and untested states. Such unvetted situations can disable or bypass security features
  • Different optimization levels exposing or hiding latent bugs result in verification vs developer mismatch
SO I SAW you broke the build but it worked on my machine - Inception Meme |  Meme Generator

This situation is the most damaging from the organizational point of view. In the end, the software is done by people and the feeling that testing is unreliable may lead to the following negative outcomes:

  • QA team vs Dev vs Customer support blame games, leading to increased bug fix turnaround
  • Decreased morale in the developer team due to belief the system is rigged against their productivity
  • Decreased confidence QA team is doing it’s job correctly and that quality checks are meaningful

Secrets only known in production

Generally, the mismatch of secrets in developer and release builds does not result in issues although I had the misfortune of having a situation where it impacted me: In development, it is normal for security systems to be disabled. Unfortunately in production, the addition of an actual key to some bootloader security internals led to a buffer overflow that was very luckily detected on time. My changes were in a completely unrelated part so I did not even know I would trigger it.

Divergent behavior at run time

The following is not necessarily Yocto specific and is valid for software engineering in general.

Normal bugs visible only at release

Often it is hard to know if a given behavior seen for the first time in a production release is due to the different configurations used for release, or if it is just bad luck it became evident in the release. The following lessons I learned:

  • A bug’s manifestation is not the end. Fixing the bug is.
  • The identification of the bug often starts with analyzing the manifestation which can be made difficult by the lack of knowledge of what triggers the bug. This is where development/release mismatch adds another layer of uncertainty and effort.

Incorrect verification/test steps

This one, highlights the old “who tests the tests?” question. My experience is no one, unless:

  • They fail to catch a bug
  • They fail when they should succeed
  • They fail in the release verification flow and succeed in the developer’s flow. The contrary is also true.

I am going to only write about the last point where a flow mismatch leads to different test and verification outcomes.

There should be a single command where all the setup, building, and testing is done owning the solutions to all mismatches between developer and release/CI. This has 2 reasons:

  • A single command means there is a technically enforced agreement and connection between release owners and developers.
  • There is only one way to run verification.

In the real world, it happens that a whole verification flow would be too time-consuming for developers but this fact is already providing insights and venues for improvement:

  • The single verification command logic could be smart enough to detect what tests are required. For example there is no need to build any software if only the testing repository changed
  • Logical steps of the whole verification flows are visible both on the CI dashboard as well as callable locally.
  • A tool where the whole organization can give feedback:
    • The release team giving feedback on verification gaps or reliability
    • The developer team giving feedback on improved modularization

Final notes

This topic became quite bigger than initially intended. It was also a much broader topic than what I wanted to write about, so if something is confusing or you have suggestions let me know as I would like to improve.

Also, my experience is mostly in embedded-devices but I have a feeling that most of it is relevant to other areas of software engineering. It was also not easy to write as it collects many disconnected experiences I had over my career and projects.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s