Skip to content

maastrichtlawtech/awesome-legal-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Awesome License

Legal Natural Language Processing

๐Ÿ—‚ Datasets

Legal Judgement Prediction (LJP)

Dataset Links Domain Language Size
FSCS (Niklaus et al., 2021) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Swiss court judgments ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ฎ๐Ÿ‡น 85K cases w/ 2 outcomes
ECtHR (Chalkidis et al., 2021) ๐Ÿ“„ ๐Ÿค— EU court judgments ๐Ÿ‡ฌ๐Ÿ‡ง 11K cases w/ 11 outcomes
ECHR (Aletras et al., 2019) ๐Ÿ“„ ๐Ÿ’พ EU court judgments ๐Ÿ‡ฌ๐Ÿ‡ง 11.5K cases w/ 11 outcomes
CAIL (Xiao et al., 2018) ๐Ÿ“„ ๐Ÿ’ป Chinese court judgements ๐Ÿ‡จ๐Ÿ‡ณ 2.6M cases w/ 6 outcomes
AnnoCaseLaw (2025) ๐Ÿ“„ ๐Ÿ’ป US Appeals Court negligence cases ๐Ÿ‡บ๐Ÿ‡ธ 471 annotated cases with expert labels
IndianBailJudgments-1200 (2025) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Indian court bail decisions ๐Ÿ‡ฎ๐Ÿ‡ณ 1.2K judgments with 20+ structured attributes
CaseSumm (2025) ๐Ÿ“„ ๐Ÿค— US Supreme Court opinions ๐Ÿ‡บ๐Ÿ‡ธ 25.6K opinions with official syllabuses
JUSTICE (2022) ๐Ÿ“„ ๐Ÿ’ป US Supreme Court cases ๐Ÿ‡บ๐Ÿ‡ธ Benchmark for judgment prediction
Cambridge Law Corpus (CLC) (2023) ๐Ÿ“„ UK court cases ๐Ÿ‡ฌ๐Ÿ‡ง 258K+ cases (16th centuryโ€“present)
Super-SCOTUS (2025) ๐Ÿ“„ ๐Ÿ’ป US Supreme Court decisions ๐Ÿ‡บ๐Ÿ‡ธ Decision direction and related tasks

Legal Text Classification (LTC)

Dataset Links Domain Language Size
GLC (Papaloukas et al., 2021) ๐Ÿ“„ ๐Ÿ’ป Greek legislation ๐Ÿ‡ฌ๐Ÿ‡ท 47.5K laws w/ 2.7K labels
CUAD (Hendrycks et al., 2021) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Contracts ๐Ÿ‡ฌ๐Ÿ‡ง 510 contracts w/ 41 classes
MultiEURLEX (Chalkidis et al., 2021) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป EU legislation ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡ช๐Ÿ‡ธ (18+) 65K laws w/ 4.5K labels
LEDGAR (Tuggener et al., 2020) ๐Ÿ“„ ๐Ÿ’พ Contracts ๐Ÿ‡ฌ๐Ÿ‡ง 60.5K contracts w/ 12.6K labels
Contract Discovery (Borchmann et al., 2020) ๐Ÿ“„ ๐Ÿ’ป Contracts ๐Ÿ‡ฌ๐Ÿ‡ง 2.6K clauses w/ 21 classes
EURLEX-57K (Chalkidis et al., 2019) ๐Ÿ“„ ๐Ÿ’พ EU legislation ๐Ÿ‡ฌ๐Ÿ‡ง 57K laws w/ 4.3K labels
Unfair-ToS (Lippi et al., 2018) ๐Ÿ“„ ๐Ÿ’พ Contracts ๐Ÿ‡ฌ๐Ÿ‡ง 9.4K sentences w/ 9 classes
Contract Elements (Chalkidis et al., 2017) ๐Ÿ“„ ๐Ÿ’พ Contracts ๐Ÿ‡ฌ๐Ÿ‡ง 2.4K contracts w/ 10 classes
OPP-115 (Wilson et al., 2016) ๐Ÿ“„ ๐Ÿ’พ Privacy laws ๐Ÿ‡ฌ๐Ÿ‡ง 115 policies w/ 23K labels
FairLex (2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Multi-jurisdictional legal texts ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡จ๐Ÿ‡ณ Fairness-focused classification datasets
Legal Case Document Summarization (Kaggle) ๐Ÿ“„ Legal case summaries Various Large-scale dataset
Legal Citation Text Classification Dataset (Kaggle) ๐Ÿ“„ General legal documents ๐Ÿ‡ฌ๐Ÿ‡ง 25K cases with catchphrases and citations

Legal Information Retrieval (LIR)

Dataset Links Domain Language Size
BSARD (Louis et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Belgian legislation ๐Ÿ‡ซ๐Ÿ‡ท 1.1K questions w/ 22.6K candidate statutory articles
EU2UK (Chalkidis et al., 2021) ๐Ÿ“„ ๐Ÿ’พ EU & UK legislation ๐Ÿ‡ฌ๐Ÿ‡ง 2K query documents w/ 52.5K candidate documents
UK2EU (Chalkidis et al., 2021) ๐Ÿ“„ ๐Ÿ’พ EU & UK legislation ๐Ÿ‡ฌ๐Ÿ‡ง 2.1K query documents w/ 3.9K candidate documents
COLIEE-Case-Law-Retrieval (Rabelo et al., 2020) ๐Ÿ“„ ๐Ÿ’พ Canadian precedents ๐Ÿ‡ฌ๐Ÿ‡ง 650 query cases w/ 128K candidate cases
COLIEE-Statute-Law-Retrieval (Rabelo et al., 2020) ๐Ÿ“„ ๐Ÿ’พ Japanese legislation ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฏ๐Ÿ‡ต 808 questions w/ 768 candidate statutory articles
CAIL2019-SCM (Xiao et al., 2019) ๐Ÿ“„ ๐Ÿ’ป Chinese court judgements ๐Ÿ‡จ๐Ÿ‡ณ 8.9K triplets of cases
CLERC (2024) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป Legal case retrieval ๐Ÿ‡ฌ๐Ÿ‡ง Large corpus for retrieval and RAG
LEAD (2024) ๐Ÿ“„ ๐Ÿ’ป Legal case retrieval Various 100K+ pairs of similar legal cases
Legal IR Philippines (2024) ๐Ÿ“„ Philippine legal documents ๐Ÿ‡ต๐Ÿ‡ญ Datasets with synthetic queries

Legal Question Answering (LQA)

Dataset Links Domain Language Size
CaseHOLD (Zheng et al., 2021) ๐Ÿ“„ ๐Ÿ’ป US case holdings ๐Ÿ‡ฌ๐Ÿ‡ง 53.1K multiple-choice questions
JEC-QA (Zhong et al., 2019) ๐Ÿ“„ ๐Ÿ’พ Chinese law ๐Ÿ‡จ๐Ÿ‡ณ 26.3K multiple-choice questions
CJRC (Duan et al., 2019) ๐Ÿ“„ ๐Ÿ’ป Chinese court judgements ๐Ÿ‡จ๐Ÿ‡ณ 50K question-answers from 10K documents
PrivacyQA (Ravichander et al., 2019) ๐Ÿ“„ ๐Ÿ’ป Privacy policies ๐Ÿ‡ฌ๐Ÿ‡ง 1.7K question-answers from 35 documents
LLeQA (2024) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป French-Belgian statutes ๐Ÿ‡ซ๐Ÿ‡ท 1,868 expert-annotated long-form QA
IndicLegalQA (2025) ๐Ÿ“„ Indian Supreme Court judgments ๐Ÿ‡ฎ๐Ÿ‡ณ 10K QA pairs from 1,256 judgments
GerLayQA (2024) ๐Ÿ“„ ๐Ÿ’ป German civil law ๐Ÿ‡ฉ๐Ÿ‡ช 21K laymen legal Qs with lawyer answers
LEGAL-UQA (2024) ๐Ÿ“„ Legal questions ๐Ÿ‡ต๐Ÿ‡ฐ 619 parallel Urduโ€“English QA pairs

Legal Textual Entailment (LTE)

Dataset Links Domain Language Size
COLIEE-Case-Law-Entailment (Rabelo et al., 2020) ๐Ÿ“„ ๐Ÿ’พ Canadian precedents ๐Ÿ‡ฌ๐Ÿ‡ง 425 cases w/ related case
COLIEE-Statute-Law-Entailment (Rabelo et al., 2020) ๐Ÿ“„ ๐Ÿ’พ Japanese legislation ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฏ๐Ÿ‡ต 808 questions w/ related statutory article
LAR-ECHR (2024) ๐Ÿ“„ European Court of Human Rights ๐Ÿ‡ฌ๐Ÿ‡ง Legal argument reasoning task dataset
ฮด-Stance (2025) ๐Ÿ“„ US legal argumentation ๐Ÿ‡บ๐Ÿ‡ธ Large-scale stances and arguments

Legal Text Summarization (LTS)

Dataset Links Domain Language Size
UK-Abs (Shukla et al., 2022) ๐Ÿ“„ ๐Ÿ’ป ๐Ÿ’พ UK court cases ๐Ÿ‡ฌ๐Ÿ‡ง 793 pairs of (case, abastractive summary) from the UK Supreme Court
IN-Abs (Shukla et al., 2022) ๐Ÿ“„ ๐Ÿ’ป ๐Ÿ’พ Indian court cases ๐Ÿ‡ฌ๐Ÿ‡ง 7.1K pairs of (case, abastractive summary) from the Indian Supreme Court
IN-Ext (Shukla et al., 2022) ๐Ÿ“„ ๐Ÿ’ป ๐Ÿ’พ Indian court cases ๐Ÿ‡ฌ๐Ÿ‡ง 50 pairs of (case, extractive summary) from the Indian Supreme Court
TOS;DR (Keymanesh et al., 2020) ๐Ÿ“„ ๐Ÿ’ป Terms of service ๐Ÿ‡ฌ๐Ÿ‡ง 1.6K pairs of (agreement text, summary) from data privacy policies
BillSum (Kornilova et al., 2019) ๐Ÿ“„ ๐Ÿ’ป ๐Ÿ’พ US Congressional bills ๐Ÿ‡ฌ๐Ÿ‡ง 22.2K pairs of (bill, summary)
TL;DRLegal (Manor et al., 2019) ๐Ÿ“„ ๐Ÿ’ป Terms of service ๐Ÿ‡ฌ๐Ÿ‡ง 84 pairs of (agreement text, summary) from software licenses
TOS;DR (Manor et al., 2019) ๐Ÿ“„ ๐Ÿ’ป Terms of service ๐Ÿ‡ฌ๐Ÿ‡ง 421 pairs of (agreement text, summary) from data privacy policies
BVA Cases (Zhong et al., 2019) ๐Ÿ“„ ๐Ÿ’ป US court cases ๐Ÿ‡ฌ๐Ÿ‡ง 92 pairs of (case, summary) from the US Board of Veterans' Appeal
LCR (Galgani et al., 2012) ๐Ÿ“„ ๐Ÿ’พ Australian court cases ๐Ÿ‡ฌ๐Ÿ‡ง 3.9K pairs of (case, catchphrases)
EurLexSummarization (2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป EU legislation ๐ŸŒ Multilingual summarization across 24 languages
Multi-LexSum (2025) ๐Ÿ“„ Legal documents ๐Ÿ‡ฌ๐Ÿ‡ง 40K+ documents with 9K+ expert summaries
CaseSumm (2025) ๐Ÿ“„ ๐Ÿค— US Supreme Court opinions ๐Ÿ‡ฌ๐Ÿ‡ง 25.6K opinions with official syllabuses

Legal Language Modeling (LLM)

Dataset Links Language Size
Pile of Law (Henderson et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง ~256GB of legal and administrative legal text
MultiLegalPile (2024) ๐Ÿ“„ ๐Ÿค— ๐ŸŒ 689GB multilingual legal corpus from 17 jurisdictions

Benchmarks

Dataset Task Language Tasks
FairLex (Chalkidis et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡จ๐Ÿ‡ณ Clasification (x1), legal judgement prediction (x3)
LexGLUE (Chalkidis et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง Classsification (x6), multiple-choice QA (x1)

๐Ÿ”ฅ Models

Model Links Language Size
Legal-HeBERT (Chriqui et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฎ๐Ÿ‡ฑ 110M
PoL-BERT-Large (Henderson et al., 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง 336M
Italian-LEGAL-BERT (Licari and Comande, 2022) ๐Ÿ“„ ๐Ÿค— ๐Ÿ‡ฎ๐Ÿ‡น 110M
JuriBERT (Douka et al., 2021) ๐Ÿ“„ ๐Ÿ’พ ๐Ÿ‡ซ๐Ÿ‡ท {6M, 15M, 42M, 110M}
Custom-LEGAL-BERT (Zheng et al., 2021) ๐Ÿ“„ ๐Ÿค— ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง 110M
LEGAL-BERT (Chalkidis et al., 2020) ๐Ÿ“„ ๐Ÿค— ๐Ÿ‡ฌ๐Ÿ‡ง {35M, 110M}
LEGAL-GPT-{1,2} (Borchmann et al., 2020) ๐Ÿ“„ ๐Ÿ’ป ๐Ÿ‡ฌ๐Ÿ‡ง {117M, 1.5B}
MultiLegalPile Models (2024-2025) ๐Ÿ“„ ๐Ÿค— ๐ŸŒ RoBERTa (multilingual + 24 monolingual), Longformer
Legal-BERT Fine-tuned (2024) ๐Ÿ“„ ๐Ÿ‡ฌ๐Ÿ‡ง Domain-adapted classification models
LegalCore Models (2025) ๐Ÿ“„ ๐ŸŒ Event coreference resolution for legal texts
Legal LLaMA (2025) ๐Ÿ“„ ๐Ÿ‡จ๐Ÿ‡ณ Chinese legal domain adaptations
FairLex Domain Models (2024-2025) ๐Ÿค— ๐ŸŒ Domain-specific BERT models for 4 jurisdictions

๐Ÿ“š Books

  • [2017] Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age, K. Ashley. [link]

  • [2024] Large Language Models and International Law, Chicago Journal of International Law [๐ŸŒ]

  • [2024] Computational Legal Studies Comes of Age, SSRN [๐Ÿ“„]

๐Ÿ“„ Surveys

  • [2020-05] How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence, H. Zhong et al. [pdf]

  • [2019-09] A Brief History of the Changing Roles of Case Prediction in AI and Law, K. Ashley [pdf]

  • [2018-12] Deep learning in law: early adaptation and legal word embeddings trained on large corpora, I. Chalkidis et al. [pdf]

  • [2024] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models and Challenges, F. Ariai et al. [๐Ÿ“„]

  • [2025] Computational Law: Datasets, Benchmarks, and Ontologies, D. Kรผรงรผk & F. Can [๐Ÿ“„]

  • [2025] A Comprehensive Survey on Legal Summarization, arXiv [๐Ÿ“„]

  • [2024] Large Language Models in Law: A Survey, J. Lai et al. [๐Ÿ“„]

  • [2025] Large Language Models in Argument Mining: A Survey, arXiv [๐Ÿ“„]

  • [2024] When Large Language Models Meet Law: Dual-Lens Survey, arXiv [๐Ÿ“„]

๐ŸŽ™ Talks

  • [2019-06] Law as Data: The Promise and Challenges of Natural Language Processing for Legal Research, A. Dyevre. [slides]
  • [2019-04] Artificial Intelligence and Law โ€“ An Overview and History, H. Surden. [video]

๐Ÿ—“ Conferences & Workshops

  • The Natural Legal Language Processing (NLLP) Workshop [website]
  • The International Conference on Artificial Intelligence and Law (ICAIL) [website]
  • The International Conference on Legal Knowledge and Information Systems (JURIX) [website]
  • The EXplainable AI in Law (XAILA) Workshop [website]
  • The International Workshop on Juris-informatics (JURISIN) [website]
  • The Competition on Legal Information Extraction/Entailment (COLIEE) [website]
  • The International Workshop on Legal Information Retrieval [website]

2025 Conferences

  • NLLP 2025 - Natural Legal Language Processing Workshop (EMNLP 2025, Suzhou) [๐ŸŒ]
  • RegNLP 2025 - Regulatory Natural Language Processing Workshop (COLING 2025) [๐ŸŒ]
  • JURIX 2025 - 38th International Conference on Legal Knowledge and Information Systems (Turin, December 9-11, 2025) [๐ŸŒ]
  • ICAIL 2025 - 20th International Conference on Artificial Intelligence and Law (Chicago, June 16-20, 2025) [๐ŸŒ]
  • MWAiL 2025 - Multilingual Workshop on AI & Law Research (Chicago, June 20, 2025) [๐ŸŒ]
  • LLMFinLegal 2025 - Workshop on Large Language Models for Finance and Legal (COLING 2025) [๐ŸŒ]
  • 8th World Legal Tech and AI Summit (Berlin, September 18-19, 2025) [๐ŸŒ]

Industry & Professional Events

  • AI Legal Summit 2025 - Various industry conferences on AI in legal practice [๐ŸŒ]
  • Legal AI Conferences Online Platform - Centralized platform for legal AI events [๐ŸŒ]

๐Ÿงฐ Tools & Evaluation

Evaluation Tools

  • Embedding Benchmarking Tools: MTEB, Hugging Face evaluate, LegalBench, COLIEE [๐ŸŒ]
  • Legal Argument Mining Tools: RMU:ECHR corpus and mining models [๐Ÿ’ป]
  • Multilingual Legal Processing: Evaluation pipelines for multilingual legal LLMs [๐Ÿ“„]

Quality Assessment Frameworks

  • LegalEval-Q: Quality evaluation for LLM-generated legal text [๐Ÿ“„]
  • FairLex Evaluation: Bias and fairness assessment [๐ŸŒ]

Last Updated: 2025-09-30 Research Coverage: 2024-01 to 2025-09 Sources: 180+ academic papers, datasets, and conference proceedings

About

๐Ÿ“– A curated list of LegalNLP resources from all around the web.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors 2

  •  
  •