Skip to content

pavelzbornik/orcid-dump-csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORCID dump to CSV

Scope of this script is to convert ORCID yearly summaries XML dump (ORCID_YYYY_MM_summaries.tar.gz) into CSV, activities files are not processed.

Further information about the dump is available at Bulk data integration guide, the 2022 file is missing in the list, but is available in the FAQ section How do I get the public data file

XSD documentation of the XML files in the dump

XML files use the same structure as ORCID API and therefore the XSD files available at ORCID GitHub are the base information source.

How the script works

The script is leveraging XSLT transformation of XML files, find more about XSLT on W3S, where input XML file is converted into a string with delimiters and than turned into 'normalized' CSV file. Script generate CSV file per XSLT file as a way to deal with nested structure of the XML.

  • line separator: $end_line$
  • column separator: ¬

Speed improvements

To speed up the process, xml files are first concatenated by the number defined in chunk_size, from tests the chunk size is best between 10-25 and chunk_size=15 is used in the script. This approach reduce the runtime by 50% compared to transforming each individual xml file. Concatenation is done, by string operations with source xml file and new tag records is introduced to list ORCID record:record.

XSLT tinkering

All XSLT files are in the xslt folder, if you want to experiment you can use web XSLT editors for example .NET XSLT Fiddle and XML file from the dump.

For this experimentation recommend to use new line character for visual representations of different rows in the XSLT Fiddle.

Replace $end_line$ in XSLT

<xsl:text>$end_line$</xsl:text> <!-- newline character -->

With newline character &#10; to escape rows

<xsl:text>&#10;</xsl:text> <!-- newline character -->

Running the script

1. Create virtual environment and install packages

Recommended to create virtual environment

python3 -m venv .venv

Install packages

pip3 install -r requirements.txt

2. Download the dump

Script assumes 2022 summaries dump file ORCID_2022_10_summaries.tar.gz stored in the same folder with the script and default output folder data, both is possible to change with arguments.

If you want to download the file first you can run download script which will fetch 2022 dump into the script folder

python3 download.py

3. Run the script

Notice

  • Please note due to the dump size, the script will run for several hours (20+)
  • Have enough free space on your disk (50 GB)

Run the script using default settings

python3 dump_to_csv.py

Adding file path and output dir arguments

python3 dump_to_csv.py --file ORCID_2022_10_summaries.tar.gz --outdir csv

About

ORCID dump parser to csv

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published