An academic study of GitHub found that more than 100,000 of the web service's code repositories contain publicly accessible authentication secrets such as API and cryptographic keys, while thousands of new secrets are leaked each day.
North Carolina State University researchers Michael Meli, Matthew McNiece (also from Cisco Systems) and Bradley Reaves detail their findings in a paper published last month in conjunction with the 2019 Network and Distributed Systems Security Symposium.
The researchers combined two approaches to identify the leaked secrets. The first methodology involved querying a GitHub a repository search engine API for nearly six months, from Oct. 31, 2017 through April 20, 2018. The second technique involved using BigQuery, a web service that enables analysis of massive datasets, to query a weekly snapshot of GitHub activity on April 4, 2018. According to the paper, the former method is a real-time means of discovering 99 percent of newly "committed" (i.e. saved on GitHub) files that contain secrets within them, while the latter produces a snapshot covering 13 percent of all GitHub public repositories.
The paper said the study examined "millions of repositories and billions of files to recover hundreds of thousands of secrets targeting 11 different platforms, five of which are in the Alexa Top 50."
The most commonly exposed secrets were Google API keys. The researchers found 212,892 such instances in total -- 85,311 of which were unique. Next most common were RSA Private Keys (158,011 total, 37,781 unique) and Google OAuth IDs (106,909 total, 47,814 unique).
The researchers claim their method of filtering and identifying secrets was intentionally conservative to reduce the odds of false positives -- meaning the actual number of exposed secrets is probably even greater than what they reported.
Additionally, the researchers estimated that 89.10 percent of the secrets they discovered were sensitive in nature, and thus could put users at risk.
Using April 4 as a starting point, the researchers also found that 81 percent of the secrets they discovered took two weeks or longer to be removed, or were never removed at all. "It is likely that the developers for this 81 percent either do not know the secrets are being committed or are underestimating the risk of compromise," the study said.
Of those that were removed by GitHub users within two weeks of April 4, most were erased in the first 24 hours -- which means in these cases users who exposed their secrets by committing them in their code typically disposed of them quickly by overwriting them with new commits.
"GitHub has become the most popular platform for collaboratively editing software, yet this collaboration often conflicts with the need for software to use secret information. This conflict creates the potential for public secret leakage," the paper states. Ultimately, "We have shown that an attacker with minimal resources could compromise many GitHub users by stealing leaked keys," the paper concluded.
In October 2018, GitHub introduced version 2.0 of its GitHub Token Scanning service, which scans repositories for known token formats and alerts the necessary party who can revoke the token. In their paper, the researchers claim this feature can be further improved using their findings to help mitigate damage from exposed secrets.