OpenCitations Meta contains bibliographic metadata associated with the documents involved in the citations stored in the OpenCitations infrastructure. The OpenCitations Meta Software performs several key functions:
- Data curation of provided CSV files
- Generation of RDF files compliant with the OpenCitations Data Model
- Provenance tracking and management
- Data validation and fixing utilities
An example of a raw CSV input file can be found in example.csv
.
- OpenCitations Meta Software
- Meta Production Workflow
- Analysing the Dataset
- Running Tests
- Creating Releases
The Meta production process involves several steps to process bibliographic metadata. An optional but recommended preprocessing step is available to optimize the input data before the main processing.
The preprocess_input.py
script helps filter and optimize CSV files before they are processed by the main Meta workflow. This preprocessing step is particularly useful for large datasets as it:
- Removes duplicate entries across all input files
- Filters out entries that already exist in the database (using either Redis or SPARQL)
- Splits large input files into smaller, more manageable chunks
To run the preprocessing script:
# Using Redis (default)
poetry run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> [--redis-db <DB_NUMBER>]
# Using SPARQL endpoint
poetry run python -m oc_meta.run.meta.preprocess_input <INPUT_DIR> <OUTPUT_DIR> --storage-type sparql --sparql-endpoint <SPARQL_ENDPOINT_URL>
Parameters:
<INPUT_DIR>
: Directory containing the input CSV files to process<OUTPUT_DIR>
: Directory where the filtered and optimized CSV files will be saved--storage-type
: Type of storage to check IDs against (redis
orsparql
, default:redis
)--redis-db
: Redis database number to use if storage type is Redis (default: 10)--sparql-endpoint
: SPARQL endpoint URL if storage type is set tosparql
The script will generate a detailed report showing:
- Total number of input rows processed
- Number of duplicate rows removed
- Number of rows with IDs that already exist in the database
- Number of rows that passed the filtering and were written to output files
- Redis: Faster option for ID checking with lower memory overhead. Ideal for rapid preprocessing of large datasets.
- SPARQL: Directly checks against the triplestore where the data will be stored. Useful when you don't have a Redis cache of existing IDs.
After preprocessing, you can use the optimized files in the output directory as input for the main Meta process.
The main Meta processing is executed through the meta_process.py
file, which orchestrates the entire data processing workflow:
poetry run python -m oc_meta.run.meta_process -c <CONFIG_PATH>
Parameters:
-c --config
: Path to the configuration YAML file.
The Meta process performs the following key operations:
-
Preparation:
- Sets up the required directory structure
- Initializes connections to Redis and the triplestore
- Loads configuration settings
-
Data Curation:
- Processes input CSV files containing bibliographic metadata
- Validates and normalizes the data
- Handles duplicate entries and invalid data
-
RDF Creation:
- Converts the curated data into RDF format following the OpenCitations Data Model
- Generates entity identifiers and establishes relationships
- Creates provenance information for tracking data lineage
-
Storage and Triplestore Upload:
- Directly generates SPARQL queries for triplestore updates
- Loads RDF data directly into the configured triplestore via SPARQL endpoint
- Executes necessary SPARQL updates
- Ensures data is properly indexed for querying
The Meta process requires a YAML configuration file that specifies various settings for the processing workflow. Here's an example of the configuration structure with explanations:
# Endpoint URLs for data and provenance storage
triplestore_url: "http://127.0.0.1:8805/sparql"
provenance_triplestore_url: "http://127.0.0.1:8806/sparql"
# Base IRI for RDF entities
base_iri: "https://w3id.org/oc/meta/"
# JSON-LD context file
context_path: "https://w3id.org/oc/corpus/context.json"
# Responsible agent for provenance
resp_agent: "https://w3id.org/oc/meta/prov/pa/1"
# Source information for provenance
source: "https://api.crossref.org/"
# Redis configuration for counter handling
redis_host: "localhost"
redis_port: 6379
redis_db: 0
redis_cache_db: 1
# Processing settings
supplier_prefix: "060"
workers_number: 16
dir_split_number: 10000
items_per_file: 1000
default_dir: "_"
# Output control
generate_rdf_files: false
zip_output_rdf: true
output_rdf_dir: "/path/to/output"
# Data processing options
silencer: ["author", "editor", "publisher"]
normalize_titles: true
use_doi_api_service: false
Occasionally, the automatic upload process during Meta execution might fail due to connection issues, timeout errors, or other problems. In such cases, you can use the on_triplestore.py
script to manually upload the generated SPARQL files to the triplestore.
poetry run python -m oc_meta.run.upload.on_triplestore <ENDPOINT_URL> <SPARQL_FOLDER> [OPTIONS]
Parameters:
<ENDPOINT_URL>
: The SPARQL endpoint URL of the triplestore<SPARQL_FOLDER>
: Path to the folder containing SPARQL update query files (.sparql)
Options:
--batch_size
: Number of quadruples to include in each batch (default: 10)--cache_file
: Path to the cache file tracking processed files (default: "ts_upload_cache.json")--failed_file
: Path to the file recording failed queries (default: "failed_queries.txt")--stop_file
: Path to the stop file used to gracefully interrupt the process (default: ".stop_upload")
To gather statistics on the dataset, you can use the provided analysis tools.
For most statistics, such as counting bibliographic resources (--br
) or agent roles (--ar
), the sparql_analyser.py
script is the recommended tool. It queries the SPARQL endpoint directly.
poetry run python -m oc_meta.run.analyser.sparql_analyser <SPARQL_ENDPOINT_URL> --br --ar
Warning: Using the SPARQL analyser for venue statistics (--venues
) against an OpenLink Virtuoso endpoint is not recommended. The complex query required for venue disambiguation can exhaust Virtuoso's RAM, causing it to return partial (and thus incorrect) results. As this query is not yet optimized for Virtuoso, this count will be wrong.
For reliable venue statistics, use the meta_analyser.py
script to process the raw CSV output files directly.
To count the disambiguated venues, run the following command:
poetry run python -m oc_meta.run.analyser.meta_analyser -c <PATH_TO_CSV_DUMP> -w venues
The script will save the result in a file named venues_count.txt
.
The test suite is automatically executed via GitHub Actions upon pushes and pull requests. The workflow is defined in .github/workflows/run_tests.yml
and handles the setup of necessary services (Redis, Virtuoso) using Docker.
To run the test suite locally, follow these steps:
-
Install Dependencies: Ensure you have Poetry and Docker installed. Then, install project dependencies:
poetry install
-
Start Services: Use the provided script to start the required Redis and Virtuoso Docker containers:
chmod +x test/start-test-databases.sh ./test/start-test-databases.sh
Wait for the script to confirm that the services are ready. (The Virtuoso SPARQL endpoint will be available at http://localhost:8805/sparql and ISQL on port 1105. Redis will be available at localhost:6379, using database 0 for some tests and database 5 for most test cases including counter handling and caching).
-
Execute Tests: Run the tests using the following command, which also generates a coverage report:
poetry run coverage run --rcfile=test/coverage/.coveragerc
To view the coverage report in the console:
poetry run coverage report
To generate an HTML coverage report (saved in the
htmlcov/
directory):poetry run coverage html -d htmlcov
-
Stop Services: Once finished, stop the Docker containers:
chmod +x test/stop-test-databases.sh ./test/stop-test-databases.sh
The project uses semantic-release for versioning and publishing releases to PyPI. To create a new release:
-
Commit Changes: Make your changes and commit them with a message that includes
[release]
to trigger the release workflow. For details on how to structure semantic commit messages, see the Semantic Commits Guide. -
Push to Master: Push your changes to the master branch. This will trigger the test workflow first.
-
Automatic Release Process: If tests pass, the release workflow will:
- Create a new semantic version based on commit messages
- Generate a changelog
- Create a GitHub release
- Build and publish the package to PyPI
The release workflow is configured in .github/workflows/release.yml
and is triggered automatically when:
- The commit message contains
[release]
- The tests workflow completes successfully
- The changes are on the master branch
If you have used OpenCitations Meta in your research, please cite the following paper:
Arcangelo Massari, Fabio Mariani, Ivan Heibi, Silvio Peroni, David Shotton; OpenCitations Meta. Quantitative Science Studies 2024; 5 (1): 50–75. doi: https://doi.org/10.1162/qss_a_00292