Skip to content

sboysel/awesome-oss-research-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Open Source Software Research Data

GitHub Awesome

This is a (curated) list of relevant datasets, data sources, and empirical research in the space of Open Source Software development. We prioritize sources in which (1) the raw data is made publicly accessible or (2) the published metrics are derived from public sources. We also include data sources for which only high level insights are available.

An excellent list of datasets used for empirical software engineering / mining software repositories exists at dspinellis/awesome-msr. Several relevant data sources from this list are included here.

Contributions are welcome and greatly appreciated! Please open an issue or pull request if you have suggestions for new data sources and research.

Topics

Development activity

Datasets

Source Description
GHTorrent Offline mirror of historical data offered by GitHub's REST API
GH Archive Records GitHub's public timeline of activity
Ecosyste.ms Tools and open datasets to support, sustain, and secure critical digital infrastructure
GitHub REST API and GraphQL API GitHub's APIs for accessing data
GitHub Innovation Graph High level insights on worldwide GitHub activity over time. Blog post and repo
Census II of Free and Open Source Software Survey of OSS library production usage at the application library level. Report and data appendix
Census III of Free and Open Source Software Report and open data
OSSRank Ranking that provides useful mappings between top projects, project types, and contributors (individuals and private companies)
Open Source Contributor Index (OSCI) Measures active and total GitHub contributors by private organizations. Drawn from GH Archive (events from GitHub's public timeline)

Research

General contribution patterns

  • Choudhary, Samridhi; Bogart, Christopher; Rose, Carolyn; Herbsleb, James (2020): Modeling Productivity in Open Source GitHub Projects: A Dataset and Codebase. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/6397013.v1
  • Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Marco Tonelli, 2020. Dataset - How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. https://doi.org/10.5281/zenodo.3825044
  • Champion, K. and Hill, B.M., 2021. Underproduction: An approach for measuring risk in open source software. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 388-399). IEEE.
  • Wachs, J., Nitecki, M., Schueller, W. and Polleres, A., 2022. The geography of open source software: Evidence from github. Technological Forecasting and Social Change, 176, p.121478.
  • Ekaterina Levitskaya, Gizem Korkmaz, Daniel Mietchen, Lane Rasberry, 2022. Analysis of Linked GitHub and Wikidata https://doi.org/10.5281/zenodo.7443339

Enterprise driven contribution

  • Spinellis, Diomidis, Kotti, Zoe, Kravvaritis, Konstantinos, Theodorou, Georgios, & Louridas, Panos. (2020). Enterprise-Driven Open Source Software (1.1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3742962
  • Angermeir, F., Voggenreiter, M., Moyón, F. and Mendez, D., 2021, May. Enterprise-driven open source software: a case study on security automation. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 278-287). IEEE.
  • Shimels Garomssa, Rathimala Kannan, Ian Chai, Dirk Riehle, 2022. How Software Quality Mediates the Impact of Intellectual Capital on Commercial Open Source Software Company Success. Available at: https://dx.doi.org/10.21227/3rwb-vg72.

Contributor experience

  • Denivan Campos, Luana Martins, & Ivan Machado. (2022). An empirical study on the influence of developers' experience on software test code quality [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.7110141
  • Perez, Quentin, Urtado, Christelle, Vauttier, Sylvain, 2022. Dataset of Open-Source Software Developers Labeled by their Experience Level in the Project and their Associated Software Metrics. https://doi.org/10.5281/zenodo.6966195

Project characteristics

  • Munaiah, N., Kroh, S., Cabrey, C. and Nagappan, M., 2017. Curating github for engineered software projects. Empirical Software Engineering, 22(6), pp.3219-3253. project website
  • Dabic, Ozren, Aghajani, Emad, Bavota, Gabriele, 2021. GHS (GitHub Search): Sampling Projects in GitHub for MSR Studies. https://doi.org/10.5281/zenodo.4588464

Dependency networks

Datasets

Source Description
ecosyste.ms: Dependency parser for repositories An open API service to parse dependency metadata from many open source software ecosystems manifest files.
ecosyste.ms: Dependency resolver for packages An open API service to resolve dependency trees of packages for many open source software ecosystems.
Libraries.io Data on software package depdency relationships over time. Sourced from a number of different ecosystems.
Open Source Insights / deps.dev A Google project to develop a software dependency graph across ecosystems. Versioning and vulnerabilty information included.
Repology Monitors software package vintages (i.e. versioning) across a number of ecosystem repositories.
Data for Software Ecosystem Analysis (DaSEA) A continuously updated dataset of software dependencies covering various package manager ecosystems.

Package download statistics

Datasets

Source Description
PyPI Python package index download statistics. Accessible via BigQuery dataset or simple interface
CRAN R package download statistics. CRAN logs
RubyGems Ruby package traffic statistics. Tons of information created by Honeycomb
Julia Julia download statistics since October 2021
npm Node.js download statistics API
NuGet Historical .NET/C# download numbers. GitHub project
PECL and Pear PhP download statistics
crates.io Rust download statistics
Clojars API Clojure package download statistics.

Security

Datasets

Source Description
CVE Program CVE list bulk downloads available in CVEProject/cvelistV5
NIST National Vulnerability Database A Common Vulnerabilities and Exploits (CVE) database. Timeframe: October 1988 - present.
CISA Known Exploited Vulnerabilities Catalog CISA maintained database of known exploited vulnerabilities in the wild. Each KEV is linked to a CVE
GitHub Advisory Database A database of CVEs and security issues affecting GitHub packages. Drawn from a variety of sources and recorded using Open Source Vulnerability Format. Timeframe: October 2017 - present.
Open Source Vulnerability (OSV) Database Draws from a variety of sources across ecosystems. GCS bucket: https://osv-vulnerabilities.storage.googleapis.com/. Note: encompasses GitHub Advisory Database
CVEfixes Dataset Bhandari, Guru, Naseer, Amara, Moonen, Leon, 2021. CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software. DOI
GitHub Issue Dataset Anas Nadeem, 2021. GitHub Issue Dataset From Top Repositories of Top Languages. DOI

Community and project health

Datasets

Source Description
CHAOSS Linux Foundation project to establish OSS community health metrics. Metric definitions
OpenSSF Best Practices Badge Program Listing of projects, high-level project statistics, high-level criteria statistics
OpenSSF Criticality Scores An effort by OpenSSF Securing Critical Projects WG. Algorithm: "Quantifying Criticality" by Rob Pike and data respository

Tools

Source Description
isitmaintained.com Quick status checks for public GitHub repositories (e.g. median issue resolution time, percentage of open issues).
GitWhois High-level glance into GitHub repositories

Research

  • Goggins, S., Lumbard, K. and Germonprez, M., 2021, May. Open source community health: Analytical metrics and their corresponding narratives. In 2021 IEEE/ACM 4th International Workshop on Software Health in Projects,Ecosystems and Communities (SoHeal) (pp. 25-33). IEEE.

Community discourse

Source Description
StackExchange Public Q&A data across the StackExchange network. SO's Data Explorer and latest data dump hosted by Internet Archive. Older vintages can be tracked down.
Linux Kernel Mailing List The Linux kernel mailing list.
Apache Mail Archives Mailing list archives for Apache projects
GNU Mail Archives Mailing lists used by various GNU projects
Python Mailing Lists Mailing lists used by various Python projects
The Mail Archive Catalogs a number of public mailing lists for collaborative projects. FAQ
Mailing list ARChives

Valuation

Research

  • Blind, K., Böhm, M., Grzegorzewska, P., Katz, A., Muto, S., Pätsch, S. and Schubert, T., 2021. The impact of Open Source Software and Hardware on technological independence, competitiveness and innovation in the EU economy. Final Study Report. European Commission, Brussels, doi, 10, p.430161.
  • Bayoán Santiago Calderón, Robbins, Guci, Korkmaz, and Kramer. 2022. Measuring the Cost of Open-Source Software Innovation on GitHub. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-05-26. https://doi.org/10.3886/E158823V2
  • Wright, N.L., Nagle, F. and Greenstein, S., 2023. Open source software and global entrepreneurship. Research Policy, 52(9), p.104846.
  • Hoffmann, Manuel and Nagle, Frank and Zhou, Yanuo. 2024. The Value of Open Source Software. Harvard Business School Strategy Unit Working Paper No. 24-038. http://dx.doi.org/10.2139/ssrn.4693148. Available at SSRN: https://ssrn.com/abstract=4693148

Funding

Datasets

Source Description
GitHub Sponsors List of dependencies for projects owned by the currently authenticated user (i.e. you). CSV export available.
jonathimer/awesome-oss-investors List of VCs investing in commercial open-source startups
Kivach "cascading funding": donations to a project redistrbuted upstream
Ko-fi
Liberapay
Open Collective transparent budgeting
oss.fund Aggregator for OSS funding opportunities, programs, and platforms
StackAid Donations redistributed evenly across project's dependencies. Provides a helpfule simulation of funding allocation
Secure Open Source Rewards (sos.dev)
ralphtheninja/open-funding Guide to OSS funding options

Research

  • Boysel, S., Nagle, F., Carter, H., Hermansen, A., Crosby, K., Luszcz, J., Lincoln, S., Yue, D., Hoffmann, M., Staub, A., 2024. Open Source Software Funding Report. https://opensourcefundingsurvey2024.com/
  • Conti, A., Peukert, C. and Roche, M., 2025. Beefing IT Up for Your Investor? Engagement with Open Source Communities, Innovation, and Startup Funding: Evidence from GitHub. Organization Science.

Surveys

Source code analysis

Datasets

Source Description
Software Heritage Historical archive of source code
NIST National Software Reference Library A collection of hashes and metadata for to uniquely identify individual files across a set of software projects. Forensic use cases include identifying software based solely on file contents, malicious elements.

Bounty platforms

Source Description
IssueHunt A platform for funding open source projects.
Bountysource A platform for funding open source projects.
boss.dev A platform for funding open source projects.

Public policy

General web archives

Source Description
Common Crawl Raw page data, metadata, and extracted text from publicly accessible segments of the internet. Timeframe: 2008 - present, monthly since March 2014. Data hosted on Amazon S3: getting started docs
Internet Archive Less systematic crawls with a longer history. Access via the Wayback Machine or its API
Archive Team A group of volunteers that archives web pages and other content. Data is available via the Wayback Machine or its API
Wikipedia data dumps and SQL access
Wikidata A free and open knowledge base that can be read and edited by both humans and machines. Data is available via the Wikidata Query Service or its API
Wikimedia Commons A free media repository. Data is available via the Commons API or its API

Other Resources

About

A (curated) list of empirical research and datasets in the space of Open Source Software

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published