Skip to content

DLR-SC/GitLab-Corpus

Repository files navigation

GitLab-Corpus

This tool creates a corpus for accessible repositories in a GitLab instance. The corpus will primarily contain information about software projects.

Relevant information could be:

  • number of authors or commits
  • merge requests
  • programming languages used
  • CI usage

The output corpus is in the JSON-format, as it is widely used and because of its compatibility with neo4j.

Requirements

  • Git client >= 2.1.0
  • Python >= 3.12 with pip and venv
  • Optionally, a modern package manager (uv (recommended), poetry, or similar)

Running the corpus CLI tool

If you use uv, all you need to do is clone this repository and run the corpus command:

git clone <URL of this Git repository> corpus
cd corpus
uv run corpus

Otherwise, you need to install the dependencies and package first:

git clone <URL of this Git repository> corpus
cd corpus
python -m venv .venv  # Create a virtual environment
source .venv/bin/activate  # Activate the environment
pip install .  # Install dependencies declared in pyproject.toml
corpus  # Run the corpus CLI, should display a help message

Usage

  1. Create a configuration file in resources/gitlab.cfg with information about the GitLab instance you want to run this tool on:
[global]
# Sets the default GitLab instance
default = gitlab-1
# Whether SSL certificates should be validated.
# If the value is a string, it is the path to a CA file used for certificate validation.
ssl_verify = true
# Timeout for API requests
timeout = 15

# A GitLab instance
[gitlab-1]
# The instance's base URL
url = https://gitlab.example.com
# A user private token to authenticate with the GitLab API,
# needs at least `read_api` privileges!
private_token = 123abc
# The version of the GitLab API to use (the python-gitlab package supports '4' only) 
api_version = 4
  1. Run the corpus tool:
Usage: corpus [OPTIONS] COMMAND [ARGS]...

  Entry point to the corpus cli.

Options:
  -g, --gl-config TEXT     Path to the GitLab config file  [default: resources/gitlab.cfg]
  -n, --neo4j-config TEXT  Path to the Neo4J config file  [default: resources/neo4j.cfg]
  -s, --source TEXT        Name of the GitLab instance, you want to analyze, if not the default value of your configuration
  -v, --verbose BOOLEAN    Prints more output during execution
  --help                   Show this message and exit.

Commands:
  build    Run the pipeline extract -> filter -> export in one command.
  export   Export a previously extracted (and maybe filtered) corpus to...
  extract  Extract projects from the specified GitLab instance and write...
  filter   Apply filters on a previously extracted corpus.

Documentation

The documentation is available in the docs directory.

About

Creates a corpus of publicly accessible repositories in a GitLab instance.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages