borb-pdf-corpus

This repository contains a curated corpus of PDF documents and their extracted content, organized to support document analysis, processing, and duplication detection workflows. Each PDF is accompanied by its full text (txt/), a first-page extract (first-page-pdf/ and first-page-txt/), and a corresponding SHA-256 digest (digest/) for efficient duplication checks.

---
config:
theme: default
---
graph TD
pdf
pdf --> txt
pdf --> digest
pdf --> first-page
first-page --> first-page-pdf
first-page --> first-page-txt

%% Define classes
classDef gray fill:#ccc,stroke:#999,stroke-width:1px;
classDef highlight fill:#F1CD2E,stroke:#999,stroke-width:2px;

%% Assign classes
class pdf highlight;
class txt,digest,first-page,first-page-pdf,first-page-txt gray;

The repository also includes automated metrics to help understand the overall structure, size, and temporal distribution of the documents.

1. File Size

Property	Value
Smallest PDF	2.00 KB
Average PDF	1.48 MB
Largest PDF	55.19 MB

2. Creation Year

Property	Value
Youngest PDF	2025
Average PDF	2015
Oldest PDF	1999

3. Word Count

Property	Value
Largest PDF	346574
Average PDF	6682
Smallest PDF	14

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
digest		digest
first-page-pdf		first-page-pdf
first-page-txt		first-page-txt
pdf		pdf
txt		txt
.gitignore		.gitignore
README.md		README.md
add_new_pdfs.py		add_new_pdfs.py
create_readme.py		create_readme.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

borb-pdf-corpus

1. File Size

2. Creation Year

3. Word Count

About

Uh oh!

Releases

Packages

Languages

borb-pdf/borb-pdf-corpus

Folders and files

Latest commit

History

Repository files navigation

borb-pdf-corpus

1. File Size

2. Creation Year

3. Word Count

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages