Late last year, the Trellix Advanced Research Center team uncovered a vulnerability in Python’s tarfile module. As we dug in, we realized this was CVE-2007-4559 – a 15-year-old path traversal vulnerability with potential to allow an attacker to overwrite arbitrary files. CVE-2007-4559 was reported to the Python project on 2007, and left unchecked, had been unintentionally added to an estimated 350,000 open-source projects and prevalent in closed-source projects.
Today, we’re excited to share an update on this work. Through GitHub, our vulnerability research team has patched 61,895 open-source projects previously susceptible to the vulnerability. This work was led by Kasimir Schulz and Charles McFarland, and concluded earlier this month.
Phased approach to patching at scale
Open-source developer tools, like Python, are necessary to advance computing and innovation, and protection from known vulnerabilities requires industry collaboration, especially since many open-source projects lack dedicated staff and resources. To effectively minimize the vulnerability surface area, Trellix Advanced Research Center executed a months-long automated effort to patch open-source projects known to use the vulnerable code.
Through GitHub, developers and community members are able to push code to projects or repositories on the platform via a process called pull request. Once a request is opened, the project maintainers review the suggested code, request collaboration or clarification if needed, and accept the new code. In our case, the code pushed via pull request delivered unique patches to each of the vulnerable GitHub projects.
As we outlined a process to automate patching, our team took inspiration from Jonathan Leitschuh’s DEFCON 2022 talk on fixing vulnerabilities at scale. Our Advanced Research Center vulnerability team was able to automate most of the processes, except for quality control. We broke the process into two steps, the patching phase and the pull request phase, both of which were automated and simply needed to be executed.
Patching phase
GitHub was a great partner in this process, and after receiving a list of repositories and files that contained the keyword, “import tarfile,” our team was able to compile a unique list of repositories to scan. We could not have executed this large-scale effort without quick delivery of actionable data from GitHub.
Once the list was delivered, we cloned and scanned each repository using Creosote – a free tool we built for developers to check if their applications are vulnerable – to determine which repositories needed to be patched. If a repository was determined to contain the vulnerability, we patched the file and created a local patch diff containing the patched file so users can easily compare the two files, the original file, and some metadata about the repository. The repository was then deleted to conserve space.
Pull request phase
Once patches were ready to go live, we reviewed the list of local patch diff’s and for each repository we did the following: created a fork of the repository on GitHub, cloned the fork, then replaced the original file with the patched file if the original file had not changed. We checked to see if the original file had been changed between the original clone and when we did our fork to make sure we didn’t overwrite any new changes to the file during that time. We then committed the changes to the repository and created a pull request from our forked repository back to the original repository along with a message detailing who we were and why we were doing the pull request. At this point it was now up to the owner of the repository to accept or reject our changes.
Others looking to do this kind of work should not overlook should the importance of managing the servers the automated process is running on or keeping an eye on feedback from the repositories patched. Monitoring these items closely enabled us to move quickly to address questions from pull request recipients and expeditiously fix network server issues.
Conclusion
The vulnerable tarfile module is included in the base Python package and is a readily available solution for a common problem, it is also, without a direct fix from Python, firmly embedded in the supply chain of many projects. It’s permanence along with the fact that nearly all the learning material for how to properly use the tarfile module teaches developers how to use it improperly creates a broad attack surface. Through these efforts to automate and patch vulnerable projects, the software supply chain attack surface is narrowed.
This work to narrow the attack surface cannot be done without collaboration across our industry. As an industry we cannot afford to ignore the need to seek out and eradicate foundational vulnerabilities. Mass patching of open-source projects can be done, even if it takes a lot of time, and it can deliver benefits to organizations of all sizes, across sectors and regions.
To properly prevent the reintroduction of past attack surfaces, it’s critical that every organization using code libraries and frameworks in their applications have proper checks and evaluation measures in place to ensure full transparency into their software supply chain, while also making sure their developers are educated on all layers of the technology stack.