Scanning and removing secrets from git repositories

Recently I had a situation where I wanted to turn a private git repository, public. Immediately the problem of secrets in the repository’s history was raised. I know that secrets should not be committed but, due to practical reasons it often happens. This means that before publishing a repository its secrets need to be sanitized from all commits.

Finding secrets in a repository with thousands of commits is unpractical if done by hand. To my pleasant surprise, there is an automated tool called GitGuardian which offers the service of scanning repositories for secrets. I found it successfully discovered API keys and PEM keys in a variety of files. I was really amazed, especially as some of the keys were in obscure formats. Even inside Javascript it successfully found API keys.

An example of the dashboard for the scan results of a repository

There are some false positives but the interface is really friendly and one can easily mark some issues to be ignored.

The next step is to either revoke the keys or expunge all secrets from history. Expunging the keys requires the rewriting of history and is potentially destructive but that is the point. To achieve that, I found a StackOverflow post that suggested the use of a tool called git-filter-repo.

To install the tool just run:

# Make sure that user specific binaries are in your PATH
# In my case the following command put git-filter-repo in
# $HOME/.local/bin/
# This means you might need to expand your PATH like
# export PATH=$HOME/.local/bin:PATH
python3 -m pip install --user git-filter-repo

Due to the history modifying purpose of git-filter-repo I advise that the operation be done on a fresh clone set up specifically for the purpose.

git-filter-repo takes a very simple file format even with glob support. I leave an example for glob and non-glob:

glob:Key: "*"==>Key: "@@PLACEHOLDER@@"
9yli7hjr8y0swt1==>XXXX

The first line will replace any occurrence of the string Key: “*” with Key: “@@PLACEHOLDER@@” where the * can be any text.

The second line just turns any literal entry 9yli7hjr8y0swt1 into XXXX

As you can surmise the ==> is the delimiting string.

Finally, keep in mind the following aspects:

  • GitGuardian is not infallible. If letting a secret slip is something you cannot afford then:
    • Do not publish anything at all.
    • Squash all your history into one commit. This will help for manual inspection.
  • You are rewriting history. If you make a mistake in the replace you may permanently damage your repository for your purposes.

Forcing gerrit to regenerate a new review

Often in my work I have been asked how to detach a commit from an existing review in gerrit. The usual use-case is that somebody picked up some commit and then re-worked it significantly, making the original review, and review owner not applicable for follow-up.

Most regular git users will be confused because a git commit –amend will still end on the old review. The reason is that gerrit tracks changes not by commit hash but by metadata text called Change-Id. Think of it. If it did not have another mechanism besides the commit hash it would not know how to track a patchset review across rebases or amends

commit 71fbee6a7a41cbf8f59444fc48294e51c7cf9613 (HEAD)
Author: Paulo Neves <johny@gmail.com>
Date:   Mon Sep 6 14:07:14 2021 +0200

    My reworked stuff
    
    Change-Id: Ifdbfff8fbd8ee217b07b7b053c8927ae2a1126f0

Most people just manually change a random character in the Change-Id and this will lead to a unique Change-Id. Personally the way i do, given I often have hooks that automatically add the Change-Id, is to just edit the commit message to delete the line completely. A post-commit hook will just re-generate a new Change-Id line in the commit message and when i push it to gerrit a fresh review will be generated.