Top Organizations on GitHub Vulnerable to Dependency Confusion Attacks – Wave 3

Top Organizations on GitHub Vulnerable to Dependency Confusion Attacks - Wave 3

This research study aligns with our previous blog dependency confusion attacks, where we talked about how dependency confusion attacks are severe and can cause catastrophic damage to an organisation’s security. We analyzed the top 1,000 GitHub organizations for such issues. This process involved scanning 38,691 GitHub repositories that consisted of Ruby, Python, JavaScript, Go, and PHP code.

The Idea Behind It

With the recent buzz around dependency confusion attacks and Microsoft releasing an official advisory on how to avoid it, we studied top GitHub organization’s repositories and published a blog covering some famous package managers. During our research last month, it was identified that this attack is much more serious and prevalent than we initially estimated. We decided to follow up our blog with an analysis of both the density and plausibility of such dependency confusion attacks in the wild. We decided to scan repositories from the top 1,000 organizations from GitHub in an attempt to look for such issues in top-tier repositories which are widely used amongst the developer community.

The Approach We Took

The first step was to define ‘most popular’ organizations on GitHub. As GitHub doesn’t provide public statistics like ‘total downloads’ for an adjustment, we came up with a different metric. We decided to use the number of stars and activity as a metric to rank the top organizations.

The next step involved downloading the repository archives provided by GitHub for each repository. The reason for choosing this over cloning the repository was that the archive only contains the source code from the main branch with no history and metadata involved which decreases the size significantly. We do not want to comment on historical branches where this issue did not exist.

Once we had the entire source code available locally, we wrote a tool to look for respective dependency files in these downloaded repositories and check them for issues. In the process of optimizing the code, we decided to cache the dependencies and sources as they were scanned to minimize the number of requests made to package repositories.

All the results were then manually checked for false positives and derived statistics from the results. As mentioned above, the results were quite surprising.

The Issues We Looked For

We mainly looked for the following issues in the repositories we scanned:

1. Publicly unavailable packages

We looked for packages that are not registered/available publicly. It covers three types of issues:

  1. Packages that have been deleted
  2. Private packages which can be registered publicly
  3. Packages that have been mistyped

As we discussed in our previous blog, such packages can pose a risk for package managers and cause a dependency confusion attack.

2. Unreachable Sources

We looked for package sources that are unreachable and may be hijacked by malicious sources. It includes:

  1. Non-existent GitHub/Gitlab profiles
  2. Expired domains

In the case of GitHub profiles, we went one step further to check if the username could be registered on GitHub.

If a package source is hijackable, it can be used to host malicious packages and host all the users.

The Results

After the recent developments related to such attacks, we assumed that among the issues we are looking for, unavailable public packages will have the most victims but it turns out that unreachable package sources are common.

Out of the 38,691 repositories we scanned, 20,220 had a dependency file and the results we are going to talk about now belong to those repositories.

The Issues We Unearthed

On analyzing these repositories, we found that 93 repositories out of Top 1000 GitHub Organizations are using a package that doesn’t exist on a public package index which can be claimed by an attacker to cause a supply chain attack. On similar lines, we observed that 169 repositories were found to be installing dependencies from a host that isn’t reachable over the internet and 126 repositories were installing packages owned by a GitHub/Gitlab user that doesn’t exist.

Now, let’s do a breakdown of these results per language and understand the impact of this attack on respective programming languages.

1. JavaScript

JavaScript’s package manager npm or yarn uses package.json file to store dependencies. We found that out of 17,496 JavaScript repositories, 12,212 contained a package.json file and referenced packages from public/private sources.

No. of Unavailable Packages: 72
No. of Unreachable Sources: 345

JavaScript contributed the most to issue statistics and the potential reason for the same is that npm packages tend to have a lot of dependencies.

2. Python

Python’s package manager pypi doesn’t use a specific dependency file but requirements.txt has become a community standard. We found that out of 8,614 Python repositories, 2,906 had a requirements.txt file.

No. of Unavailable Packages: 40
No. of Unreachable Sources: 7

While the numbers for python were less in Top GitHub organizations, unidentified python packages is a challenge faced wildly in public.

3. Ruby

Ruby’s package manager bundler uses Gemfile file to store dependencies. We found that out of 4,538 Ruby repositories, 3,044 had a Gemfile file. While all of these had no unavailable packages, there were 7 sources that were not publicly reachable.

No. of Unavailable Packages: 0
No. of Unreachable Sources: 7

PHP

PHP’s package manager composer uses composer.json file to store dependencies. During our analysis, we found that only 33 PHP repositories had a composer.json file. Majority of these were safe except for the following finding:

No. of Unavailable Packages: 1
No. of Unreachable Sources: 2

4. Golang

Go manages dependencies using Go Modules, viz. a go.mod file (contains the list of dependencies) and go.sum (contains checksums of the content of specific module versions). We found that out of 4,198 Go repositories, 2,052 had a go.mod file for managing its dependencies.

No. of Unavailable Packages: 0
No. of Unreachable Sources: 69

In GoLang’s case, the frequency of importing a third-party library is quite frequent, the number of unreachable sources identified is higher.

The Impact We Drove

Out of the top 1000 organizations we scanned, 212 had at least one dependency confusion-related misconfiguration in their codebase. This is quite a serious concern because a major part of the open-source ecosystem depends on these giants and these repositories have a good number of users. Hence, if any of their projects get affected, there’s a high probability that millions of users will be at risk. 

To give you an idea of the impact, the projects that we found issues in, had a total of over 1,24,937 stars on GitHub in total. GitHub stars represent ‘endorsement’ from users and while it may not be a unit of measuring downloads of a particular project, it still says a lot about their influence.

We are reporting the issues we discovered to the relevant authorities and will update this blog as necessary.

The Datasets We Are Releasing

Similar to our prior project resonance waves, we are releasing data from this study, while making sure that we don’t reveal any sensitive / confidential / vulnerability details.

  1. Top Organizations – [Link]
  2. Repositories we scanned, tagged with languages – [Link]

Conclusion

Dependency confusion attacks are here to stay, at least the typosquatting and legacy assets/accounts which is a fundamental problem. They pose a risk to most of the entities, no matter how secure aware they are and this analysis explicitly proves it. On top of that, the impact of installing a package from an unknown origin is very high as it essentially gives the package owner a way to execute arbitrary code on the machine.

With that being said, check out recommendations from our blog Dependency Attacks: What, Why, How? to know the best practices to help minimize the risk of such attacks.

Our proprietary SaaS-based Attack Surface Management solution, NVADR, continuously keeps a track of your organization’s external digital footprint by identifying and profiling the assets as they surface on the internet. The assets we identify are way beyond IP Addresses and Subdomains, and we cover a wide variety including Docker Containers, Mobile Applications, Code Repositories, and a lot more. Once these assets are identified, we find security misconfigurations across all asset classes (including Dependency Confusion Attacks).

To understand how NVADR can help your organization improve its external digital footprint and security posture, Request a Demo.