Scope of this script is to convert ORCID yearly summaries XML dump (ORCID_YYYY_MM_summaries.tar.gz) into CSV, activities files are not processed.
Further information about the dump is available at Bulk data integration guide, the 2022 file is missing in the list, but is available in the FAQ section How do I get the public data file
XML files use the same structure as ORCID API and therefore the XSD files available at ORCID GitHub are the base information source.
The script is leveraging XSLT transformation of XML files, find more about XSLT on W3S, where input XML file is converted into a string with delimiters and than turned into 'normalized' CSV file. Script generate CSV file per XSLT file as a way to deal with nested structure of the XML.
- line separator:
$end_line$
- column separator:
¬
To speed up the process, xml files are first concatenated by the number defined in chunk_size
, from tests the chunk size is best between 10-25 and chunk_size=15
is used in the script. This approach reduce the runtime by 50% compared to transforming each individual xml file. Concatenation is done, by string operations with source xml file and new tag records
is introduced to list ORCID record:record
.
All XSLT files are in the xslt
folder, if you want to experiment you can use web XSLT editors for example .NET XSLT Fiddle and XML file from the dump.
For this experimentation recommend to use new line character for visual representations of different rows in the XSLT Fiddle.
Replace $end_line$
in XSLT
<xsl:text>$end_line$</xsl:text> <!-- newline character -->
With newline character
to escape rows
<xsl:text> </xsl:text> <!-- newline character -->
Recommended to create virtual environment
python3 -m venv .venv
Install packages
pip3 install -r requirements.txt
Script assumes 2022 summaries dump file ORCID_2022_10_summaries.tar.gz
stored in the same folder with the script and default output folder data
, both is possible to change with arguments.
If you want to download the file first you can run download script which will fetch 2022 dump into the script folder
python3 download.py
- Please note due to the dump size, the script will run for several hours (20+)
- Have enough free space on your disk (50 GB)
Run the script using default settings
python3 dump_to_csv.py
Adding file path and output dir arguments
python3 dump_to_csv.py --file ORCID_2022_10_summaries.tar.gz --outdir csv