This is a (curated) list of relevant datasets, data sources, and empirical research in the space of Open Source Software development. We prioritize sources in which (1) the raw data is made publicly accessible or (2) the published metrics are derived from public sources. We also include data sources for which only high level insights are available.
An excellent list of datasets used for empirical software engineering / mining software repositories exists at dspinellis/awesome-msr. Several relevant data sources from this list are included here.
- GHTorrent
- Offline mirror of historical data offered by GitHub's REST API
- GitHub org: website code, tutorial
- GH Archive
- Records GitHub's public timeline of activity
- GitHub REST API and GraphQL API
- Ecosyste.ms
- Tools and open datasets to support, sustain, and secure critical digital infrastructure
- Software Heritage
- Historical archive of source code
- Common Crawl
- Raw page data, metadata, and extracted text from publicly accessible segments of the internet
- Timeframe: 2008 - present, monthly since March 2014
- Data hosted on Amazon S3: getting started docs
- Internet Archive
- Less systematic crawls with a longer history
- Access via the Wayback Machine or its API
- The Mail Archive
- Catalogs a number of public mailing lists for collaborative projects
- FAQ
- Mailing list ARChives
- Apache Mail Archives
- GNU Mail Archives
- NIST National Vulnerability Database
- A Common Vulnerabilities and Exploits (CVE) database
- CVE and severity metrics (CVSS)
- Timeframe: October 1988 - present
- GitHub Advisory Database
- A database of CVEs and security issues affecting GitHub packages
- Timeframe: October 2017 through present
- Drawn from a variety of sources and recorded using Open Source Vulnerability Format
- Open Source Vulnerability (OSV) Database
- Draws from a variety of sources across ecosystems. Note: encompasses GitHub Advisory Database
- GCS bucket: https://osv-vulnerabilities.storage.googleapis.com/
- Bhandari, Guru, Naseer, Amara, Moonen, Leon, 2021. CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software. https://doi.org/10.5281/zenodo.4476564
- Anas Nadeem, 2021. GitHub Issue Dataset From Top Repositories of Top Languages. https://doi.org/10.5281/zenodo.5048542
- NIST National Software Reference Library
- NSRL Reference Data Set: a collection of hashes and metadata for to uniquely identify individual files across a set of software projects
- Foresnic use cases include identifying software based solely on file contents, malicious elements
- Bigquery Introduction for GitHub data https://github.com/dinalav/Data-Science-Slides-and-Notebooks
- deps.dev - Open Source Insights
- A Google project to develop a software dependency graph across ecosystems. Versioning and vulnerabilty information included.
- general docs, API docs, docs for BigQuery access
- Libraries.io
- Data on software package depdency relationships over time. Sourced from a number of different ecosystems.
- Data releases
- Repology
- montors software package vintages (i.e. versioning) across a number of ecosystems
- PyPI - Python package index download statistics
- CRAN logs - R download statistics
- RubyGems - Ruby package traffic statistics
- Tons of information created by Honeycomb
- Julia - Julia download statistics since October 2021
- npm - Node.js download statistics API
- NuGet - Historical .NET/C# download numbers
- PECL and Pear - PhP download statistics
- crates.io - Rust download statistics
- GitHub Innovation Graph
- Census II of Free and Open Source Software
- survey of OSS library production usage at the application library level
- report and data appendix
- Hoffmann, Manuel and Nagle, Frank and Zhou, Yanuo. 2024. The Value of Open Source Software. Harvard Business School Strategy Unit Working Paper No. 24-038. Available at SSRN: https://ssrn.com/abstract=4693148 or http://dx.doi.org/10.2139/ssrn.4693148
- Bayoán Santiago Calderón, Robbins, Guci, Korkmaz, and Kramer. 2022. Measuring the Cost of Open-Source Software Innovation on GitHub. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-05-26. https://doi.org/10.3886/E158823V2
- CHAOSS
- Linux Foundation project to establish OSS community health metrics
- Metric definitions
- OpenSSF Best Practices Badge Program
- OpenSSF Criticality Scores
- An effort by OpenSSF Securing Critical Projects WG
- Algorithm: "Quantifying Criticality" by Rob Pike
- Data repository: https://github.com/ossf/criticality_score
- isitmaintained.com
- quick status checks for public GitHub repositories (e.g. median issue resolution time, percentage of open issues)
- source code repository for backend service
- Goggins, S., Lumbard, K. and Germonprez, M., 2021, May. Open source community health: Analytical metrics and their corresponding narratives. In 2021 IEEE/ACM 4th International Workshop on Software Health in Projects,Ecosystems and Communities (SoHeal) (pp. 25-33). IEEE.
- Denivan Campos, Luana Martins, & Ivan Machado. (2022). An empirical study on the influence of developers' experience on software test code quality [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.7110141
- Perez, Quentin, Urtado, Christelle, Vauttier, Sylvain, 2022. Dataset of Open-Source Software Developers Labeled by their Experience Level in the Project and their Associated Software Metrics. https://doi.org/10.5281/zenodo.6966195
- Munaiah, N., Kroh, S., Cabrey, C. and Nagappan, M., 2017. Curating github for engineered software projects. Empirical Software Engineering, 22(6), pp.3219-3253. project website
- Dabic, Ozren, Aghajani, Emad, Bavota, Gabriele, 2021. GHS (GitHub Search): Sampling Projects in GitHub for MSR Studies. https://doi.org/10.5281/zenodo.4588464
- Choudhary, Samridhi; Bogart, Christopher; Rose, Carolyn; Herbsleb, James (2020): Modeling Productivity in Open Source GitHub Projects: A Dataset and Codebase. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/6397013.v1
- Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Marco Tonelli, 2020. Dataset - How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. https://doi.org/10.5281/zenodo.3825044
- Champion, K. and Hill, B.M., 2021. Underproduction: An approach for measuring risk in open source software. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 388-399). IEEE.
- Wachs, J., Nitecki, M., Schueller, W. and Polleres, A., 2022. The geography of open source software: Evidence from github. Technological Forecasting and Social Change, 176, p.121478.
- Open Source Contributor Index (OSCI)
- Tracks GitHub contribution by commercial firms
- Measures active and total contributors
- Drawn from GH Archive (events from GitHub's public timeline)
- Spinellis, Diomidis, Kotti, Zoe, Kravvaritis, Konstantinos, Theodorou, Georgios, & Louridas, Panos. (2020). Enterprise-Driven Open Source Software (1.1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3742962
- Angermeir, F., Voggenreiter, M., Moyón, F. and Mendez, D., 2021, May. Enterprise-driven open source software: a case study on security automation. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 278-287). IEEE.
- Shimels Garomssa, Rathimala Kannan, Ian Chai, Dirk Riehle, 2022. How Software Quality Mediates the Impact of Intellectual Capital on Commercial Open Source Software Company Success. Available at: https://dx.doi.org/10.21227/3rwb-vg72.
- CSIS Government Open Source Software Policies
- Dataset of various public policy and legislation dealing with open source software from governments around the world.
- IssueHunt.io
- Bountysource
- boss.dev: Bounties for Open Source
- GitHub Sponsors
- List of dependencies for projects owned by the currently authenticated user (i.e. you). CSV export available.
- Kivach
- "cascading funding": donations to a project redistrbuted upstream
- Ko-fi
- Liberapay
- Open Collective
- transparent budgeting
- oss.fund
- aggregator for OSS funding opportunities, programs, and platforms
- StackAid
- donations redistributed evenly across project's dependencies
- simulation of funding allocation
- Secure Open Source Rewards (sos.dev)
- ralphtheninja/open-funding - guide to OSS funding options
- Wikipedia (open crowd sourcing platform)
- data dumps: https://dumps.wikimedia.org/enwiki/
- SQL access: https://quarry.wmcloud.org/
- Ekaterina Levitskaya, Gizem Korkmaz, Daniel Mietchen, Lane Rasberry, 2022. Analysis of Linked GitHub and Wikidata https://doi.org/10.5281/zenodo.7443339
- StackExchange
- Public Q&A data across the StackExchange
- SE's Data Explorer
- (latest) data dump hosted by Internet Archive
- Older vintages can be tracked down
- Linux Foundation surveys
- 2020 FOSS Contributor Survey
- access: high level insights public
- timeframe: 2020
- 2021 Diversity, Equity, and Inclusion in Open Source
- access: high level insights and survey data
- timeframe: 2021
- 2022 State of Open Source Security
- In collaboration with Snyk.io
- access: high level insights public
- timeframe: 2022
- Annual Jobs Survey
- 2020 FOSS Contributor Survey
- GitHub State of the Octoverse
- access: high level insights public
- timeframe: ?-2022 (annual)
- Github Open Source Survey
- access: (anonymized) individual responses public
- timeframe: 2017
- Stack Overflow Annual Developer Survey
- access: (anonymized) individual responses public
- timeframe: 2011-2022 (annual)
- GitLab Global Developer Survey
- access: high level insights public
- timeframe: 2016-2020 (annual)
- Tidelift
- access: high level insights
- timeframe: 2018-2022 (annual, varying topics)
- SlashData Developer Nation Survey
- access: not distributed publicly
- timeframe: ?
- O'Reilly: "The Value of Open Source in the Cloud Era"
- access: high level insights public
- timeframe:
- PDF copy of report
- FINOS / Linux Foundation State of Open Source in Financial Services
- access: high level insights public
- timeframe: 2021-2022 (annual)
- Open Source Programs Survey by TODO Group
- access: firm-level responses public
- timeframe: 2018-2022 (annual)
- additional resources on OSPOs
- Open Source Initiative and OpenLogic: State of Open Source survey
- access: high level insights public
- timeframe: 2022
- Hertel, G., Niedner, S. and Herrmann, S., 2003. Motivation of software developers in Open Source projects: an Internet-based survey of contributors to the Linux kernel. Research policy, 32(7), pp.1159-1177.
- Lakhani, K.R. and Wolf, R.G., 2003. Why hackers do what they do: Understanding motivation and effort in free/open source software projects. _Open Source Software Projects (September 2003).
- Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German, Lefteris Angelis, 2020. Dataset from "What do developers talk about open source software licensing? " - SEAA2020. https://doi.org/10.5281/zenodo.3871565
- Feitelson, Dror. (2021). Survey on Developer and Researcher Views on the Ethics of Experiments on Open-Source Projects [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5752053