Legal Natural Language Processing

🗂 Datasets

Legal Judgement Prediction (LJP)

Dataset	Links	Domain	Language	Size
FSCS (Niklaus et al., 2021)	📄 🤗 💻	Swiss court judgments	🇩🇪 🇫🇷 🇮🇹	85K cases w/ 2 outcomes
ECtHR (Chalkidis et al., 2021)	📄 🤗	EU court judgments	🇬🇧	11K cases w/ 11 outcomes
ECHR (Aletras et al., 2019)	📄 💾	EU court judgments	🇬🇧	11.5K cases w/ 11 outcomes
CAIL (Xiao et al., 2018)	📄 💻	Chinese court judgements	🇨🇳	2.6M cases w/ 6 outcomes
AnnoCaseLaw (2025)	📄 💻	US Appeals Court negligence cases	🇺🇸	471 annotated cases with expert labels
IndianBailJudgments-1200 (2025)	📄 🤗 💻	Indian court bail decisions	🇮🇳	1.2K judgments with 20+ structured attributes
CaseSumm (2025)	📄 🤗	US Supreme Court opinions	🇺🇸	25.6K opinions with official syllabuses
JUSTICE (2022)	📄 💻	US Supreme Court cases	🇺🇸	Benchmark for judgment prediction
Cambridge Law Corpus (CLC) (2023)	📄	UK court cases	🇬🇧	258K+ cases (16th century–present)
Super-SCOTUS (2025)	📄 💻	US Supreme Court decisions	🇺🇸	Decision direction and related tasks

Legal Text Classification (LTC)

Dataset	Links	Domain	Language	Size
GLC (Papaloukas et al., 2021)	📄 💻	Greek legislation	🇬🇷	47.5K laws w/ 2.7K labels
CUAD (Hendrycks et al., 2021)	📄 🤗 💻	Contracts	🇬🇧	510 contracts w/ 41 classes
MultiEURLEX (Chalkidis et al., 2021)	📄 🤗 💻	EU legislation	🇬🇧 🇩🇪 🇫🇷 🇮🇹 🇪🇸 (18+)	65K laws w/ 4.5K labels
LEDGAR (Tuggener et al., 2020)	📄 💾	Contracts	🇬🇧	60.5K contracts w/ 12.6K labels
Contract Discovery (Borchmann et al., 2020)	📄 💻	Contracts	🇬🇧	2.6K clauses w/ 21 classes
EURLEX-57K (Chalkidis et al., 2019)	📄 💾	EU legislation	🇬🇧	57K laws w/ 4.3K labels
Unfair-ToS (Lippi et al., 2018)	📄 💾	Contracts	🇬🇧	9.4K sentences w/ 9 classes
Contract Elements (Chalkidis et al., 2017)	📄 💾	Contracts	🇬🇧	2.4K contracts w/ 10 classes
OPP-115 (Wilson et al., 2016)	📄 💾	Privacy laws	🇬🇧	115 policies w/ 23K labels
FairLex (2022)	📄 🤗 💻	Multi-jurisdictional legal texts	🇬🇧🇩🇪🇫🇷🇮🇹🇨🇳	Fairness-focused classification datasets
Legal Case Document Summarization (Kaggle)	📄	Legal case summaries	Various	Large-scale dataset
Legal Citation Text Classification Dataset (Kaggle)	📄	General legal documents	🇬🇧	25K cases with catchphrases and citations

Legal Information Retrieval (LIR)

Dataset	Links	Domain	Language	Size
BSARD (Louis et al., 2022)	📄 🤗 💻	Belgian legislation	🇫🇷	1.1K questions w/ 22.6K candidate statutory articles
EU2UK (Chalkidis et al., 2021)	📄 💾	EU & UK legislation	🇬🇧	2K query documents w/ 52.5K candidate documents
UK2EU (Chalkidis et al., 2021)	📄 💾	EU & UK legislation	🇬🇧	2.1K query documents w/ 3.9K candidate documents
COLIEE-Case-Law-Retrieval (Rabelo et al., 2020)	📄 💾	Canadian precedents	🇬🇧	650 query cases w/ 128K candidate cases
COLIEE-Statute-Law-Retrieval (Rabelo et al., 2020)	📄 💾	Japanese legislation	🇬🇧 🇯🇵	808 questions w/ 768 candidate statutory articles
CAIL2019-SCM (Xiao et al., 2019)	📄 💻	Chinese court judgements	🇨🇳	8.9K triplets of cases
CLERC (2024)	📄 🤗 💻	Legal case retrieval	🇬🇧	Large corpus for retrieval and RAG
LEAD (2024)	📄 💻	Legal case retrieval	Various	100K+ pairs of similar legal cases
Legal IR Philippines (2024)	📄	Philippine legal documents	🇵🇭	Datasets with synthetic queries

Legal Question Answering (LQA)

Dataset	Links	Domain	Language	Size
CaseHOLD (Zheng et al., 2021)	📄 💻	US case holdings	🇬🇧	53.1K multiple-choice questions
JEC-QA (Zhong et al., 2019)	📄 💾	Chinese law	🇨🇳	26.3K multiple-choice questions
CJRC (Duan et al., 2019)	📄 💻	Chinese court judgements	🇨🇳	50K question-answers from 10K documents
PrivacyQA (Ravichander et al., 2019)	📄 💻	Privacy policies	🇬🇧	1.7K question-answers from 35 documents
LLeQA (2024)	📄 🤗 💻	French-Belgian statutes	🇫🇷	1,868 expert-annotated long-form QA
IndicLegalQA (2025)	📄	Indian Supreme Court judgments	🇮🇳	10K QA pairs from 1,256 judgments
GerLayQA (2024)	📄 💻	German civil law	🇩🇪	21K laymen legal Qs with lawyer answers
LEGAL-UQA (2024)	📄	Legal questions	🇵🇰	619 parallel Urdu–English QA pairs

Legal Textual Entailment (LTE)

Dataset	Links	Domain	Language	Size
COLIEE-Case-Law-Entailment (Rabelo et al., 2020)	📄 💾	Canadian precedents	🇬🇧	425 cases w/ related case
COLIEE-Statute-Law-Entailment (Rabelo et al., 2020)	📄 💾	Japanese legislation	🇬🇧 🇯🇵	808 questions w/ related statutory article
LAR-ECHR (2024)	📄	European Court of Human Rights	🇬🇧	Legal argument reasoning task dataset
δ-Stance (2025)	📄	US legal argumentation	🇺🇸	Large-scale stances and arguments

Legal Text Summarization (LTS)

Dataset	Links	Domain	Language	Size
UK-Abs (Shukla et al., 2022)	📄 💻 💾	UK court cases	🇬🇧	793 pairs of (case, abastractive summary) from the UK Supreme Court
IN-Abs (Shukla et al., 2022)	📄 💻 💾	Indian court cases	🇬🇧	7.1K pairs of (case, abastractive summary) from the Indian Supreme Court
IN-Ext (Shukla et al., 2022)	📄 💻 💾	Indian court cases	🇬🇧	50 pairs of (case, extractive summary) from the Indian Supreme Court
TOS;DR (Keymanesh et al., 2020)	📄 💻	Terms of service	🇬🇧	1.6K pairs of (agreement text, summary) from data privacy policies
BillSum (Kornilova et al., 2019)	📄 💻 💾	US Congressional bills	🇬🇧	22.2K pairs of (bill, summary)
TL;DRLegal (Manor et al., 2019)	📄 💻	Terms of service	🇬🇧	84 pairs of (agreement text, summary) from software licenses
TOS;DR (Manor et al., 2019)	📄 💻	Terms of service	🇬🇧	421 pairs of (agreement text, summary) from data privacy policies
BVA Cases (Zhong et al., 2019)	📄 💻	US court cases	🇬🇧	92 pairs of (case, summary) from the US Board of Veterans' Appeal
LCR (Galgani et al., 2012)	📄 💾	Australian court cases	🇬🇧	3.9K pairs of (case, catchphrases)
EurLexSummarization (2022)	📄 🤗 💻	EU legislation	🌍	Multilingual summarization across 24 languages
Multi-LexSum (2025)	📄	Legal documents	🇬🇧	40K+ documents with 9K+ expert summaries
CaseSumm (2025)	📄 🤗	US Supreme Court opinions	🇬🇧	25.6K opinions with official syllabuses

Legal Language Modeling (LLM)

Dataset	Links	Language	Size
Pile of Law (Henderson et al., 2022)	📄 🤗 💻	🇬🇧	~256GB of legal and administrative legal text
MultiLegalPile (2024)	📄 🤗	🌍	689GB multilingual legal corpus from 17 jurisdictions

Benchmarks

Dataset	Task	Language	Tasks
FairLex (Chalkidis et al., 2022)	📄 🤗 💻	🇬🇧 🇩🇪 🇫🇷 🇮🇹 🇨🇳	Clasification (x1), legal judgement prediction (x3)
LexGLUE (Chalkidis et al., 2022)	📄 🤗 💻	🇬🇧	Classsification (x6), multiple-choice QA (x1)

🔥 Models

Model	Links	Language	Size
Legal-HeBERT (Chriqui et al., 2022)	📄 🤗 💻	🇮🇱	110M
PoL-BERT-Large (Henderson et al., 2022)	📄 🤗 💻	🇬🇧	336M
Italian-LEGAL-BERT (Licari and Comande, 2022)	📄 🤗	🇮🇹	110M
JuriBERT (Douka et al., 2021)	📄 💾	🇫🇷	{6M, 15M, 42M, 110M}
Custom-LEGAL-BERT (Zheng et al., 2021)	📄 🤗 💻	🇬🇧	110M
LEGAL-BERT (Chalkidis et al., 2020)	📄 🤗	🇬🇧	{35M, 110M}
LEGAL-GPT-{1,2} (Borchmann et al., 2020)	📄 💻	🇬🇧	{117M, 1.5B}
MultiLegalPile Models (2024-2025)	📄 🤗	🌍	RoBERTa (multilingual + 24 monolingual), Longformer
Legal-BERT Fine-tuned (2024)	📄	🇬🇧	Domain-adapted classification models
LegalCore Models (2025)	📄	🌍	Event coreference resolution for legal texts
Legal LLaMA (2025)	📄	🇨🇳	Chinese legal domain adaptations
FairLex Domain Models (2024-2025)	🤗	🌍	Domain-specific BERT models for 4 jurisdictions

📚 Books

[2017] Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age, K. Ashley. [link]
[2024] Large Language Models and International Law, Chicago Journal of International Law [🌐]
[2024] Computational Legal Studies Comes of Age, SSRN [📄]

📄 Surveys

[2020-05] How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence, H. Zhong et al. [pdf]
[2019-09] A Brief History of the Changing Roles of Case Prediction in AI and Law, K. Ashley [pdf]
[2018-12] Deep learning in law: early adaptation and legal word embeddings trained on large corpora, I. Chalkidis et al. [pdf]
[2024] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models and Challenges, F. Ariai et al. [📄]
[2025] Computational Law: Datasets, Benchmarks, and Ontologies, D. Küçük & F. Can [📄]
[2025] A Comprehensive Survey on Legal Summarization, arXiv [📄]
[2024] Large Language Models in Law: A Survey, J. Lai et al. [📄]
[2025] Large Language Models in Argument Mining: A Survey, arXiv [📄]
[2024] When Large Language Models Meet Law: Dual-Lens Survey, arXiv [📄]

🎙 Talks

[2019-06] Law as Data: The Promise and Challenges of Natural Language Processing for Legal Research, A. Dyevre. [slides]
[2019-04] Artificial Intelligence and Law – An Overview and History, H. Surden. [video]

🗓 Conferences & Workshops

The Natural Legal Language Processing (NLLP) Workshop [website]
The International Conference on Artificial Intelligence and Law (ICAIL) [website]
The International Conference on Legal Knowledge and Information Systems (JURIX) [website]
The EXplainable AI in Law (XAILA) Workshop [website]
The International Workshop on Juris-informatics (JURISIN) [website]
The Competition on Legal Information Extraction/Entailment (COLIEE) [website]
The International Workshop on Legal Information Retrieval [website]

2025 Conferences

NLLP 2025 - Natural Legal Language Processing Workshop (EMNLP 2025, Suzhou) [🌐]
RegNLP 2025 - Regulatory Natural Language Processing Workshop (COLING 2025) [🌐]
JURIX 2025 - 38th International Conference on Legal Knowledge and Information Systems (Turin, December 9-11, 2025) [🌐]
ICAIL 2025 - 20th International Conference on Artificial Intelligence and Law (Chicago, June 16-20, 2025) [🌐]
MWAiL 2025 - Multilingual Workshop on AI & Law Research (Chicago, June 20, 2025) [🌐]
LLMFinLegal 2025 - Workshop on Large Language Models for Finance and Legal (COLING 2025) [🌐]
8th World Legal Tech and AI Summit (Berlin, September 18-19, 2025) [🌐]

Industry & Professional Events

AI Legal Summit 2025 - Various industry conferences on AI in legal practice [🌐]
Legal AI Conferences Online Platform - Centralized platform for legal AI events [🌐]

🧰 Tools & Evaluation

Evaluation Tools

Embedding Benchmarking Tools: MTEB, Hugging Face evaluate, LegalBench, COLIEE [🌐]
Legal Argument Mining Tools: RMU:ECHR corpus and mining models [💻]
Multilingual Legal Processing: Evaluation pipelines for multilingual legal LLMs [📄]

Quality Assessment Frameworks

LegalEval-Q: Quality evaluation for LLM-generated legal text [📄]
FairLex Evaluation: Bias and fairness assessment [🌐]

Last Updated: 2025-09-30 Research Coverage: 2024-01 to 2025-09 Sources: 180+ academic papers, datasets, and conference proceedings

Name		Name	Last commit message	Last commit date
Latest commit History 412 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Legal Natural Language Processing

🗂 Datasets

Legal Judgement Prediction (LJP)

Legal Text Classification (LTC)

Legal Information Retrieval (LIR)

Legal Question Answering (LQA)

Legal Textual Entailment (LTE)

Legal Text Summarization (LTS)

Legal Language Modeling (LLM)

Benchmarks

🔥 Models

📚 Books

📄 Surveys

🎙 Talks

🗓 Conferences & Workshops

2025 Conferences

Industry & Professional Events

🧰 Tools & Evaluation

Evaluation Tools

Quality Assessment Frameworks

About

Uh oh!

Contributors 2

Uh oh!

License

maastrichtlawtech/awesome-legal-nlp

Folders and files

Latest commit

History

Repository files navigation

Legal Natural Language Processing

🗂 Datasets

Legal Judgement Prediction (LJP)

Legal Text Classification (LTC)

Legal Information Retrieval (LIR)

Legal Question Answering (LQA)

Legal Textual Entailment (LTE)

Legal Text Summarization (LTS)

Legal Language Modeling (LLM)

Benchmarks

🔥 Models

📚 Books

📄 Surveys

🎙 Talks

🗓 Conferences & Workshops

2025 Conferences

Industry & Professional Events

🧰 Tools & Evaluation

Evaluation Tools

Quality Assessment Frameworks

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!