Introducing Radar API: Detect Credentials & Secrets in Code via Machine Learning

Introducing Radar API: Detect Credentials & Secrets in Code via Machine Learning

Problem: Leaking Sensitive Credentials

In 2016, hackers gained access to Uber’s private code repositories and used hard-coded credentials to exfiltrate 57 million driver records from an AWS S3 bucket. As a result of this breach, and its subsequent cover-up, Uber was fined $148 million. Could they have prevented such an incident? If this can happen to Uber, it can happen to any other company.

Source code hosting providers such as GitHub and GitLab have socialized software development, making it easier to collaborate on projects & ship code quickly. But this hasn’t been without consequences: it’s led to an increase in the accidental publication of sensitive information such as API keys, secrets, and credentials. This sensitive information can range from SSH keys to API keys, and even passwords. After analyzing millions of commits, our research team found that this problem is widespread and occurs daily as new commits are pushed up as part of the software development lifecycle. Abuse of these leaked credentials can cause salient security & compliance risks, such as catastrophic loss of sensitive customer data, harming an organization both financially and reputationally, while putting consumers at risk.

When asked about the Uber leak, GitHub had the following comment:

Our recommendation is to never store access tokens, passwords, or other authentication or encryption keys in the code. If the developer must include them in the code, we recommend they implement additional operational safeguards to prevent unauthorized access or misuse.

Even AWS, the largest cloud provider by market share, has been financially impacted by leaked AWS keys and was motivated to create Git-secrets to help combat this problem. However, they acknowledge the difficulty of this problem, and even they cannot guarantee fool-proof detection of their own AWS keys. Many organizations have become hyper-aware of the potential risks of credential leakage, so related open source tools are gaining popularity on platforms like GitHub despite their low accuracy rates.

Our team has been evaluating existing tools, and we've built a solution that leverages machine learning to dramatically improve the chances of accurately detecting credentials in your source code. Read on to learn more.

Evaluating Popular Detection Tools

We evaluated several popular tools used today to protect and secure GitHub repositories:

  • truffleHog, 3k+ stars on GitHub - Searches through git repositories for high entropy strings and secrets, digging deep into commit history

  • Gitrob, 3k+ stars on GitHub - Reconnaissance tool for GitHub organizations

  • Gitleaks, 4k+ start on GitHub - Audit git repos for secrets

  • Git-secrets, 5k+ stars on GitHub - Prevents you from committing secrets and credentials into git repositories

It’s worth noting that we focused on the most popular open source projects - there are other projects like Yelp’s detect-secrets and Auth0’s repo-supervisor that serve a similar purpose, though we didn’t include them in this analysis to avoid redundancy. We’ll provide a brief introduction of each tool, an analysis of their pros & cons, followed by a more detailed comparison with examples. If you’re already familiar with common tools and are curious about our approach, feel free to skip ahead to the section titled Our Approach: Radar below.

The two main algorithms used in these tools are entropy and regular expressions:

  • Entropy: Shannon entropy, or information entropy, is the average rate at which information is produced by a stochastic source of data. A high entropy score is the result of high variability of information, e.g. a string with only one repetitive character will have low entropy, since there is only one state, and can be converted to 1 bit of information. On the other hand, a string with a diverse set of characters that appears highly random, such as API keys, will require more bits to transmit, having much higher entropy. See here for more info.

  • Regular Expressions (Regex):  A regular expression, or regex, is a sequence of characters that defines a search pattern. Regexes are used to perform any lexical searching to match a variable name or API key pattern. See here for more info.

truffleHog

truffleHog is an open-source tool written in Python that searches through git commit history for credentials/secrets that are accidentally committed.

Pros:

  1. truffleHog scans the whole history of branches and commits, thus nothing will be missed if committed.

  2. truffleHog allows for both regex based and high entropy based flagging.

  3. Users can provide custom regexes to suit their needs accordingly.

Cons:

  1. High entropy is a commonly used method to detect randomly generated API keys in many tools. However, because of the lack of contextual awareness, this method tends to be noisy. For example, it can be hard to distinguish between long variable names and credentials, e.g. “TestScrapeLoopRunReportsTargetDownOnInvalidUTF8” which is a high-entropy string and would likely get flagged as a false positive.  

  2. Regular expressions are a powerful but limited method that searches for generic patterns. Thus, they only work well on finding keys with explicitly defined & repeatable patterns, e.g. starting with some fixed characters, or very lengthy keys. Requiring these unique characteristics dramatically reduces the pool of potential tokens that can be flagged accurately.

Gitrob

Gitrob is an open-source tool written in Go that helps find potentially sensitive files. Different than the other tools listed, it has a broader range of object detection beyond API keys.

Pros:

  1. Similar to truffleHog, it drills deep into the commit history of a repository, and the user can adjust how far back in the commit history to scan.

  2. The UI is very friendly for users to manipulate and analyze results.

  3. Also provides regex for searching for generic keys, e.g. /([a-f0-9\-\$\/])/gmi.

Cons:

  1. The search mechanisms are fairly simple - mainly keyword searching, which yields many false positives. Users have to manually go through all detected files and tokens to check if they contain valid sensitive info or not, which can be very time-consuming.

  2. The regex for API keys does not have a cap on the length of the key, which can lead to a high number of false positives and is not comprehensive enough to capture keys with more diverse character sets.

Gitleaks

Gitleaks is an open-source tool written in Go that provides a way to find unwanted data types in code checked in to git. Imprecisely, it's a Go version of truffleHog, although algorithm-wise, there are unique aspects.

Pros:

  1. It combines a regex list similar to truffleHog and a high entropy method to do credential detection.

  2. It offers the user options to adjust their regex list and entropy range.

Cons:

Similar to truffleHog.

Git-secrets

Git-secrets is a tool written in bash to prevent users from committing AWS API keys.

Pros:

  1. The tool’s main methods only occupy one file, so the code is easy to understand, use, and extend.

Cons:

  1. The searching domain is limited to several regexes for AWS keys.

  2. Poses potential risks missing real AWS API keys if the user’s variable name does not match Git-secret’s regexes, which is a common problem for all regex based searching methods. In other words, Git-secrets’ detection mandates that the AWS secret key must have ‘key’ in the variable name, so a change of the variable name to ‘aws_secret’ or ‘secret_token’ would not trigger the regex, leading to a false negative.

While all the tools mentioned have their respective differentiators, they all have common limitations in their search mechanisms that severely limit their accuracy. On the one hand, searching for keys using regexes might provide high signal for distinctive API key patterns such as Google (AZia*) or Slack (xoxp-*) keys, however, other patterns such as MongoDB’s are indistinguishable from a universally unique identifier (UUID). The regex results can also be quite noisy in that they match a lot of hashes/SHAs as well. On the other hand, Shannon entropy is a more comprehensive search method, but due to the sheer quantity of its output, it yields a high degree of false positives, making it untenable to use at scale.

Our Approach: Radar

Since most approaches in this domain have mainly been a mixture of regexes and Shannon entropy, which each have their respective shortcomings, our team sought to develop a novel method leveraging deep learning to overcome the limitations of these methods.

Here we’ll introduce our deep learning based approach, called Radar, trained on features extracted from a broad set of API key patterns and their surrounding context in code. Utilizing contextual information is not a novel idea - even regex based methods attempt this by looking backward for high-value words. However, our model approaches this problem in a more comprehensive way. Other solutions are based upon detection techniques that leverage heuristics around variable naming conventions, but this approach is rigid and brittle.

For example, consider the generic API key regex in truffleHog:

/[a|A][p|P][i|I][_]?[k|K][e|E][y|Y].*['|\"][0-9a-zA-Z]['|\"]/

Now, consider the following sample code:

api_key = ‘12345678901234567890123456789012’
api_token = ‘12345678901234567890123456789012’

In this code example, when applying truffleHog’s regex, the API key in the first line will be captured, while the second will not. This demonstrates that regex searching is limited to the lexical format of the code. Because variable naming conventions differ from developer to developer, this can easily lead to situations where designed regexes do not match the variable names that a regex is searching for. Abiding by naming conventions to be compliant with a regex’s rules would not be too difficult, but as a prerequisite, developers would need to read the source code to understand and identify all situations that the tool would work well in.

We don’t believe tools should dictate and constrain how developers work - instead the optimal tool should fit their existing workflow. The way that we deal with the problem of naming variation is that we don’t require situations to be lexically similar - they only need to be semantically similar. In other words, it shouldn’t matter if the variable name is access_token or access_key since they have the same meaning.

A deeper technical dive of our model will be done in a subsequent post. If you’re eager to try out Radar on GitHub repos, feel free to jump ahead to Announcing the Radar API below.

Comparing Radar to Popular Tools

In this section, we evaluate the performance of our model against the tools mentioned above by scanning a sample code repository that mimics the potential occurrence of API keys in the real world. This was done to confirm our understanding of the algorithm in each tool, and to present visual examples of their advantages and limitations.

We copied over ten real-world examples that we collected during the scanning process and also added nine ambiguous examples, which represent different misleading situations that can yield false positive detections.

We highlighted the results found from each detector. True positives are highlighted in green. False positives are highlighted in red. We evaluated precision, recall and F1 score to quantify & compare model performance.

For reference, precision is the fraction of correctly predicted positive instances among all predicted instances. Recall is the fraction of correctly predicted instances over total number of positive instances. F1 score is an overall measure of a model’s accuracy that combines precision and recall.

truffleHog

First, we scanned the repo with truffleHog’s regex method.

Command: truffleHog https://github.com/watchtowerdlp/demo.git --entropy=False --regex

Precision: 3/3, Recall: 3/10.
F1: 46%

As mentioned, truffleHog has two regex patterns for capturing generic/application agnostic API keys, this pattern captures an alphanumeric token between 32 and 45 characters with a variable name containing “apikey” or “secret”.

We next scanned the repo with truffleHog’s Entropy method.

Command: truffleHog https://github.com/watchtowerdlp/demo.git --entropy=True

Precision: 10/17, Recall: 10/10.
F1: 74%

truffleHog’s high entropy method doesn’t check variable names, it only calculates the Shannon entropy for each token and therefore, is quite noisy. When considering the low ratio of real API keys to false positives in real-world scenarios (<1:100), the number of results from this algorithm can be overwhelming to review.

Gitleaks

Command: gitleaks --repo-path='./' --verbose

Precision: 1/1, Recall: 1/10.
F1: 18%

Gitleaks is similar to truffleHog, in that it mainly uses regexes to detect credentials; however, their regexes are slightly non-traditional. For example, Gitleaks only captures api_key_github, which is an uncommon variable naming convention. The name github_api_key is the more common variable name people would assign to a GitHub API key. Nonetheless, almost all available tools support adding custom regexes into their regex list so this will not be a major issue if users adjust or input regexes to meet their needs.

Gitrob

Precision: 8/14, Recall: 8/10.
F1: 66%

It’s hard to directly compare Gitrob to other tools because it effectively captures all strings longer than 20 characters, within the character set [a-fA-F0-9], and requires users to manually go through the results to check if they are valid. Gitrob provides a nice UI so that the user can easily go through each file.

Watchtower Radar

Precision: 9/10, Recall: 9/10.
F1: 90%

Because Radar incorporates more comprehensive features as noted above, the false negative rate and false positive rate are both quite low, enhancing the user experience when verifying results.

Announcing the Radar API

It’s important to note that this test repository was created internally, and results will vary across repositories that are scanned. There are trade-offs to any of the approaches above, each with their own merits. As such, we have released Radar as an API that you can use to scan GitHub repositories for sensitive credentials. The service is free to use for the first 5 scans, so please feel free to try it here - radar.watchtower.ai - by logging in with GitHub. Note that the service scans both public and private repos, and does not store or track sensitive findings. Radar scans the full commit history for repos that have 1000 unique commits or fewer. For larger repos, Radar scans the current working directory (i.e. the latest version of the files in the default branch). To scan the full commit history of repos larger than 1000 commits, or to increase your scan limit, please contact us via email at support@watchtower.ai. You can also schedule a demo via our website at www.watchtower.ai.

Start a Scan

For example, after logging in, you can start a new scan of public GitHub repo via the command line:

curl https://radar.watchtower.ai/api/v1/scans/new \
-u API_KEY: \
-d 'public_github_url=https://github.com/watchtowerdlp/sample'

This will run the scan asynchronously, and you’ll be notified when the results are ready to view:

{  
    "id": "0b6b08cf-f1ff-436b-a69f-7a1cb0d06e44",  
    "url": "https://github.com/watchtowerdlp/sample",  
    "duration": "0.822404097",  
    "created_at": "2019-04-28 22:51:10 UTC",  
    "scanned_files": 24,  
    "status_code": 200,  
    "results_count": 1,  
    "results": [    
        {      
            "result_id": "db2d2f85-8c06-41ef-85bf-2ea2dc786ca3",
            "repo_path": "sample",      
            "file_path": "sample.rb",      
            "branch": "origin/master",      
            "commit_hash": "3a07b08d2461b6906376081d1f13b303215bf55d",     
            "author_email": "49463194+watchtowerdlp@users.noreply.github.com",      
            "context": "a4300836696c47b4f2d7c'\n+\n+auth_token = '",
            "token": "db▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪",
            "token_length": 32,
            "permalink": "https://github.com/watchtowerdlp/sample/blob/3a07b08d2461b6906376081d1f13b303215bf55d/sample.rb#L7",
            "created_at": "2019-04-28 22:51:13 UTC"
         }  
     ]
}

Likewise, you can configure a webhook endpoint to be programmatically notified when the scan is complete and results are available for review. If you prefer, you can also view scan results in the dashboard, screenshots below. You can read the complete API docs here: radar.watchtower.ai.

View scans in the dashboard:

Click image to enlarge.

As well as their results:

Click image to enlarge.

It’s worth noting that during the course of our research we noticed that most keys that we found were stored deep in the commit history of a repo or were included in a now-deleted file. In git, deleting files with sensitive data, or deleting the tokens themselves, is actually only at the surface level, and the secrets can still be dug up in the commit history. As such, the only effective way to remove these keys is to delete the commit that introduced the key or to delete the entire commit history and start a brand new history. As such a detection tool like Radar could be applied as part of an engineer's CI/CD workflow before any code gets pushed to a remote repo to prevent the need to edit the git history. This proactive approach would be instrumental in reducing the accidental exposure of secrets. Contact us if you’re interested in leveraging Radar in this way.

To increase your scan limit, or scan the full commit history of larger repos, please email us at support@watchtower.ai. Likewise, we would welcome your feedback on the API - please don’t hesitate to reach out via email with your thoughts.

About Watchtower

Watchtower is a data security platform that uses machine learning to identify business-critical data, like customer PII, across SaaS and data infrastructure. Our team, based in San Francisco & Palo Alto and backed by leading Silicon Valley investors, has experience building cloud services & APIs — and the software to protect the data flowing into & through those systems — at some of the fastest growing platform companies in the world. We're hiring!


"I've been in the security industry for a while and was looking for a strong, automated solution for data discovery, classification, and protection. I was very impressed with the accuracy of the classification on my unstructured data — nothing on the market comes close to this."
Shahar Ben-Hador
CIO, Exabeam


"To ensure I had the proper context to reduce risk at our company, I needed a solution that was cloud first and able to provide visibility into my SaaS providers & infrastructure without slowing down the business. There were plenty of proxy solutions out on the market but none that were able to provide the same visibility and frictionless experience for our team members, enabling us to combat data spray and classification issues. That was until we started leveraging Watchtower — their API-driven solution accurately monitored and was able to take immediate action, which was a game changer for us."
CISO
Hyper-Growth Tech Company


"Watchtower saves us substantial time and is more effective than doing this manually... We need this capability and I’m grateful that Watchtower does such a good job. It is better than I expected and I typically have expectations that are high."
CISO
Fortune 100 Company

Watchtower simplifies and automates Slack DLP for Periscope Data

Watchtower simplifies and automates Slack DLP for Periscope Data

Guide to Data Loss Prevention (DLP) on Slack

Guide to Data Loss Prevention (DLP) on Slack