PubMed XML File Processor

This repository contains programs to process PubMed XML files with the naming pattern pubmed_x0_to_x1.xml, where x0 and x1 are generated from the range range(0, 3600, 100).

Files

1. `read_pubmed_xml_simple.py` (Basic file checking)

A simple program that:

Generates file names based on the pattern pubmed_x0_to_x1.xml where x0 ranges from 0 to 3600 in steps of 100
Checks which files exist in the current directory
Counts the number of articles in each XML file
Provides a summary of all files and total article count
Saves results to xml_file_summary.txt

2. `extract_article_ids.py` (ArticleIdList extraction - NEW)

A program that extracts ArticleIdList information from XML files:

Processes all XML files with the pattern pubmed_x0_to_x1.xml
Extracts ArticleIdList blocks from each article
Creates a CSV file with one row per ArticleIdList entry
Columns include: pmid, file_source, pubmed, mid, pmc, doi, pii, pmcid, and other ID types
Saves results to article_ids.csv

3. `read_pubmed_xml.py` (Comprehensive version)

A more comprehensive program that:

Finds all XML files matching the pattern pubmed_*_to_*.xml
Parses each XML file and extracts detailed article information including:
- PMID (PubMed ID)
- Article title
- Journal name
- Publication date
- Authors
- DOI
- Abstract
Provides detailed summaries and sample articles
Saves comprehensive results to pubmed_summary.txt

Usage

Basic Usage (File checking)

python3 read_pubmed_xml_simple.py

This will:

Check for 36 files: pubmed_0_to_99.xml through pubmed_3500_to_3599.xml
Display which files exist and how many articles each contains
Save a summary to xml_file_summary.txt

ArticleIdList Extraction (Recommended for ID analysis)

python3 extract_article_ids.py

This will:

Process all 36 XML files
Extract ArticleIdList information from each article
Create a CSV file with 3,590 rows (one per ArticleIdList)
Columns include all ID types found: pubmed, mid, pmc, doi, pii, pmcid, etc.
Save results to article_ids.csv

Comprehensive Usage

python3 read_pubmed_xml.py

This will:

Process all XML files in the directory
Extract detailed article information
Display sample articles
Save comprehensive results to pubmed_summary.txt

Expected File Pattern

The programs expect XML files with the following naming pattern:

pubmed_0_to_99.xml (articles 0-99)
pubmed_100_to_199.xml (articles 100-199)
pubmed_200_to_299.xml (articles 200-299)
...
pubmed_3500_to_3599.xml (articles 3500-3599)

Output Files

`xml_file_summary.txt` (from simple version)

Contains:

Number of files found vs missing
Total article count
List of existing files with article counts

`article_ids.csv` (from extract_article_ids.py) - NEW

Contains:

One row per ArticleIdList entry (3,590 total rows)
Columns: pmid, file_source, pubmed, mid, pmc, doi, pii, pmcid, other_medline, other_sici
All ID types found in the ArticleIdList blocks
File size: ~333KB

`pubmed_summary.txt` (from comprehensive version)

Contains:

Detailed information about each article
PMID, title, journal, authors, abstract, etc.
Complete article data for analysis

ArticleIdList CSV Structure

The article_ids.csv file contains the following columns:

Column	Description	Example
pmid	PubMed ID from the article	19622817
file_source	Source XML file	pubmed_0_to_99.xml
pubmed	PubMed ID from ArticleIdList	19622817
mid	Manuscript ID	NIHMS146886
pmc	PubMed Central ID	PMC2763631
doi	Digital Object Identifier	10.1001/jama.2009.1064
pii	Publisher Item Identifier	302/4/385
pmcid	PubMed Central ID (alternative)	PMC123456
other_medline	Other MEDLINE IDs	(if present)
other_sici	SICI identifiers	(if present)

ID Type Distribution (from actual data)

PUBMED: 3,590 entries (100%)
DOI: 3,565 entries (99.3%)
PII: 3,024 entries (84.2%)
PMC: 2,199 entries (61.2%)
MID: 945 entries (26.3%)
Other types: 10 entries (0.3%)

Requirements

Python 3.x
Standard library modules: xml.etree.ElementTree, os, csv, typing

Example Output

PubMed ArticleIdList Extractor
========================================
Extracting ArticleIdList information from XML files
with pattern: pubmed_x0_to_x1.xml
where x0 ranges from 0 to 3600 in steps of 100
Generated 36 file names to check
Processing: pubmed_0_to_99.xml
  - Found 91 ArticleIdList entries
...

Total files processed: 36
Total ArticleIdList entries found: 3590

================================================================================
ARTICLE ID LIST SUMMARY
================================================================================

Total ArticleIdList entries: 3590

ID Type Distribution:
  PUBMED: 3590
  MID: 945
  PMC: 2199
  DOI: 3565
  PII: 3024
  OTHER: 10

ArticleIdList data saved to: article_ids.csv
CSV contains 3590 rows and 10 columns

Notes

The first file (pubmed_0_to_99.xml) contains 91 articles instead of 100, which is normal
One file (pubmed_2300_to_2399.xml) contains 99 articles instead of 100
All other files contain exactly 100 articles each
Total articles across all files: 3,590
The ArticleIdList CSV contains exactly 3,590 rows (one per article)
Most articles have multiple ID types (PubMed ID, DOI, PII, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
alzforum_recommends_pmid_list_20250617.csv		alzforum_recommends_pmid_list_20250617.csv
article_ids.csv		article_ids.csv
extract_article_ids.py		extract_article_ids.py
fetch_pubmed_simple2.py		fetch_pubmed_simple2.py
pubmed_0_to_99.xml		pubmed_0_to_99.xml
pubmed_1000_to_1099.xml		pubmed_1000_to_1099.xml
pubmed_100_to_199.xml		pubmed_100_to_199.xml
pubmed_1100_to_1199.xml		pubmed_1100_to_1199.xml
pubmed_1200_to_1299.xml		pubmed_1200_to_1299.xml
pubmed_1300_to_1399.xml		pubmed_1300_to_1399.xml
pubmed_1400_to_1499.xml		pubmed_1400_to_1499.xml
pubmed_1500_to_1599.xml		pubmed_1500_to_1599.xml
pubmed_1600_to_1699.xml		pubmed_1600_to_1699.xml
pubmed_1700_to_1799.xml		pubmed_1700_to_1799.xml
pubmed_1800_to_1899.xml		pubmed_1800_to_1899.xml
pubmed_1900_to_1999.xml		pubmed_1900_to_1999.xml
pubmed_2000_to_2099.xml		pubmed_2000_to_2099.xml
pubmed_200_to_299.xml		pubmed_200_to_299.xml
pubmed_2100_to_2199.xml		pubmed_2100_to_2199.xml
pubmed_2200_to_2299.xml		pubmed_2200_to_2299.xml
pubmed_2300_to_2399.xml		pubmed_2300_to_2399.xml
pubmed_2400_to_2499.xml		pubmed_2400_to_2499.xml
pubmed_2500_to_2599.xml		pubmed_2500_to_2599.xml
pubmed_2600_to_2699.xml		pubmed_2600_to_2699.xml
pubmed_2700_to_2799.xml		pubmed_2700_to_2799.xml
pubmed_2800_to_2899.xml		pubmed_2800_to_2899.xml
pubmed_2900_to_2999.xml		pubmed_2900_to_2999.xml
pubmed_3000_to_3099.xml		pubmed_3000_to_3099.xml
pubmed_300_to_399.xml		pubmed_300_to_399.xml
pubmed_3100_to_3199.xml		pubmed_3100_to_3199.xml
pubmed_3200_to_3299.xml		pubmed_3200_to_3299.xml
pubmed_3300_to_3399.xml		pubmed_3300_to_3399.xml
pubmed_3400_to_3499.xml		pubmed_3400_to_3499.xml
pubmed_3500_to_3599.xml		pubmed_3500_to_3599.xml
pubmed_3600_to_3699.xml		pubmed_3600_to_3699.xml
pubmed_400_to_499.xml		pubmed_400_to_499.xml
pubmed_500_to_599.xml		pubmed_500_to_599.xml
pubmed_600_to_699.xml		pubmed_600_to_699.xml
pubmed_700_to_799.xml		pubmed_700_to_799.xml
pubmed_800_to_899.xml		pubmed_800_to_899.xml
pubmed_900_to_999.xml		pubmed_900_to_999.xml
pubmed_summary.txt		pubmed_summary.txt
read_pubmed_xml.py		read_pubmed_xml.py
read_pubmed_xml_simple.py		read_pubmed_xml_simple.py
sort_and_deduplicate_csv.py		sort_and_deduplicate_csv.py
sorted_article_ids.csv		sorted_article_ids.csv
xml_file_summary.txt		xml_file_summary.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubMed XML File Processor

Files

1. `read_pubmed_xml_simple.py` (Basic file checking)

2. `extract_article_ids.py` (ArticleIdList extraction - NEW)

3. `read_pubmed_xml.py` (Comprehensive version)

Usage

Basic Usage (File checking)

ArticleIdList Extraction (Recommended for ID analysis)

Comprehensive Usage

Expected File Pattern

Output Files

`xml_file_summary.txt` (from simple version)

`article_ids.csv` (from extract_article_ids.py) - NEW

`pubmed_summary.txt` (from comprehensive version)

ArticleIdList CSV Structure

ID Type Distribution (from actual data)

Requirements

Example Output

Notes

About

Uh oh!

Releases

Packages

Languages

License

elbert5770/Alzheimer-s_Disease_PubMed_Data

Folders and files

Latest commit

History

Repository files navigation

PubMed XML File Processor

Files

1. read_pubmed_xml_simple.py (Basic file checking)

2. extract_article_ids.py (ArticleIdList extraction - NEW)

3. read_pubmed_xml.py (Comprehensive version)

Usage

Basic Usage (File checking)

ArticleIdList Extraction (Recommended for ID analysis)

Comprehensive Usage

Expected File Pattern

Output Files

xml_file_summary.txt (from simple version)

article_ids.csv (from extract_article_ids.py) - NEW

pubmed_summary.txt (from comprehensive version)

ArticleIdList CSV Structure

ID Type Distribution (from actual data)

Requirements

Example Output

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `read_pubmed_xml_simple.py` (Basic file checking)

2. `extract_article_ids.py` (ArticleIdList extraction - NEW)

3. `read_pubmed_xml.py` (Comprehensive version)

`xml_file_summary.txt` (from simple version)

`article_ids.csv` (from extract_article_ids.py) - NEW

`pubmed_summary.txt` (from comprehensive version)

Packages