Detecting credentials in source code: choosing between open-source or commercial solutions

In modern software development we rely on hundreds, sometimes thousands of different building blocks. The glue that connects all the different building blocks are collectively known as secrets. These are typically API keys, credentials, security certificates and URIs. These are the modern day master keys.They can provide access to cloud infrastructure, payment systems, internal messaging and user information to name a few. Once an attacker has a secret, they can move laterally between systems to uncover additional information and secrets, and because they are authenticated, they look and appear like valid users, making it extremely difficult to detect. (Read More)

But even having established how sensitive these secrets are and why they should be tightly wrapped, this next statement may surprise you:

These secrets are sprawled all over the internet, sitting in code repositories in public view.

For the proprietor of the code, these secrets are difficult to identify, but malevolent actors out to find them have developed simple and effective tools to uncover secrets deeply buried and long forgotten in git history.

There are plenty of articles, whitepapers and blog posts on the importance of protecting secrets, for example Hashicorp and GitGuardian have great resources on this topic. Instead, I want to focus on the different tools available for detecting secrets as well as their pros and cons. But of course it is up to you, the reader, to decide which tools will be best to protect your secrets.

Three options for secrets detection

When it comes to secrets detection, you can choose between 3 different approaches:

Building a custom solution in house
Using open-source projects
Using commercial products

Let's run through a few examples.

Building in house detection

For some of us, the problem of secret sprawl poses a perfect problem to unpack. I would be lying if I haven't played myself with building some fun regular expression (regex) scripts to detect sensitive strings inside code. But building a comprehensive reliable secrets detection script is a huge task.

First, you need to decide how to detect secrets. There are two main options for this: using regex to detect fixed string credentials (like Stripe API keys which begin with the same characters), or implement high entropy detection, which casts a large net but brings back a huge volume of results.

Method	Pros	Cons
Entropy: look for strings that appear random	Good for penetration testing, open sourcing a project or bug bounties because it brings a lot of results. These results must be reviewed manually.	Lots of false alerts (it is very frequent to see URLs, file paths, database IDs or other hashes with high entropy), which makes it impossible to use this method alone in an automated pipeline. Some keys are inevitably missed because the entropy threshold to be applied depends on the charset used to generate the key and its length
Regular expressions: match known, distinct patterns	Low number of false alerts. Known patterns make it easier to later check if the secret is valid or not or if this is an example or test key (see Step 2).	Unknown key types will be missing Credentials without a distinct pattern will be missed, which means lots of missed credentials! Think about passwords that can be virtually any string in many possible contexts, APIs that don’t have a distinct format, ...

When using regular expression, you have a very limited scope of secrets you can detect leaving you open to vulnerabilities. Using high entropy method, you will cast a wider net but also need to sort through more false positives. Of course, in an ideal world you want to use both, but then you'd need to build in post-validators that can sift through the results to exclude likely false positives.

If you are building this as an experiment for your personal projects, this can be a fun and exciting challenge. But when you bring in the challenges of detection at scale, you have to consider resources, alerting and mitigation. The challenge can quickly spiral into a huge project.

It is always best to first learn from a real-life example, I would encourage anyone going down the path of building a secrets detection solution to first read about how SAP built its internal secrets detection solution.

If you are fixed on building a personal solution, I would have to advocate for beginning with one of the many open-source projects available to build upon. I know this can be less exciting than a personal challenge, but when you begin to unpack the scope of the problem, it will save you a ton of work.

Using open-source tools

Open-source tools are not just a good starting point for building your own custom decision patterns, but there are actually also great projects available that provide immediate value with minimal setup.

Popular open-source tools

There is a huge list of open source detection tools available on GitHub. Below are a few that are both popular and well-maintained.

Tool	Description
[Truffle Hog] (github.com/dxa4481/truffleHog)	One of the popular utilities to find secrets everywhere, including branches, commit history. Truffle Hog search using regex and entropy, and the result is printed on the screen.
[Git Secrets] (github.com/awslabs/git-secrets)	Released by AWS Labs, as you can guess by the name – it scans for the secrets. Git Secrets would help prevent committing AWS keys by adding a pattern.
[Gitrob] (github.com/michenriksen/gitrob)	Gitrob makes it easy for you to analyze the finding on a web interface. It’s based on Go, so that’s a prerequisite.
Git Hound	A git plugin based on GO, Git Hound, helps prevent sensitive data from being committed in a repository against PCRE (Perl Compatible Regular Expressions). It’s available in a binary version for Windows, Linux, Darwin, etc. Useful if you don’t have GO installed.

Pros and cons of open-source tools

Pros	Cons
Ability to define custom detectors	Slow performance at scale
Can be installed locally canceling the need for a third party investigation	Large number of false positives can result in disruptions to the workflow
Typically supported and maintained by interested parties (Yelp for example)	Hard to enforce throughout an organization
	No alerting features and cannot be integrated into a SEIM
	Does not allow for team collaboration and incident investigation

While the detection reliability and efficiency vary between solutions, the detection systems all lack enterprise features such as alerting, audit trails and in-depth investigation.

Open-source solutions, in my opinion, are best used for bug bounty and one-off pen testing exercises where high volumes of positive results can be sorted through and evaluated. When these systems are put in place in regular production, particularly within organization, the results can be overwhelming and extremely restrictive to the workflow. That being said, there are still some clear advantages over commercial systems in some situations.

Using commercial tools

Along with many high profile cases of secrets being discovered inside git repositories including Uber, many vendors have come to the party with solutions to combat this.

From the many conversations around secrets detection, the biggest concern is vendor trust. You are essentially allowing a third party to find and detect the most sensitive information that you or your organization own.

Many vendors, including the big players like GitGuardian, do offer an on premise version of their products. But this comes usually with an enterprise license which is costly for developers and smaller companies.

The idea of allowing a third party to scan for secrets inside source code can be concerning, and there are definitely some considerations to take into account. The first is that secrets inside git repositories, private and public, should already be considered compromised. Git provides the perfect platform to facilitate secret sprawl, because code is a leaky asset and git provides no audit log of who has access to it or where it has been cloned. So if secrets exist in code repositories, using a third-party application to scan for them, does not really increase the risk vector.

Commercial vendors also have larger teams and time dedicated to detecting secrets, making them more reliable in large scales but also offers additional enterprise features such as alerting, dashboards to allow investigation and remediation, as well as much easier set-up. All this means that the tool will fit into your workflow much better.

The best example of a comparison between the two most predominant open-source and commercial vendors can be found here.

Commercial secrets detection solutions

While there are additional vendors in the market, below are the four core competitors in the space that are the current market leaders.

Tool	Description
GitGuardian	The most predominant detection solution with both private and public monitoring. With over 200+ secrets supported it has the largest detection capabilities on the market. GitGuardian is a developer first company with excellent support as well as a product completely free for developers and open-source organizations.
GitHub Token Scanning	GitHub offers a commercial token scanning solution that covers 25 different secrets. This is currently only available for commercial clients with an advanced security license at a cost of $80 per developer but does come with other security features.
GitLab secrets detection	GitLab also offers limited secrets detection capabilities with their gold ultimate license at again a cost of $80 per developer. Unfortunately the current detection capabilities only support 12 common secret types and tokens which makes it capabilities very limited.
[Nightfall] (nightfall.ai)	AI-powered scanner to detect API keys, secrets, sensitive information. Watchtower Radar API lets you integrate with GitHub public or private repository, AWS, GitLab, Twilio, etc. The scan results are available on a web interface or CLI output.

Pros and cons of commercial secrets detection

Pros	Cons
Sophisticated detection algorithms with greater number of support secrets	Third party access to source code
Real time detection	Closed source (with the expectation of some GitGuardian products)
Alerting mechanisms built in
Credential validity checks (limited to GitGuardian)
Contextual analysis of the code to reduce false positives
Audit trail of secrets and remediation steps
Role based authentication

Wrap up

Implementing secrets detection should always be part of the threat mitigation strategy of all developers and organisations. There are many available solutions on the market for both open-source and commercial vendors, all with their own considerations. While commercial vendors offer more sophisticated detection without buying commercial licenses, they come with the consideration of needing to provide third-party access to source code. Although open-source solutions are a cost-effective solution, they can provide such a large number of false positives they become prohibitive to workflow. Or you can build your own, but beware of the big task ahead of you. But in the end, it comes down to what works best for you and your organisation.