American Indian Language Dictionaries in Digital Archives: Adaptive Learning Models for Archival Description
This repository explores the methodologies and frameworks used in creating American Indian language dictionaries for digital archives. It presents an improved model for automated archival processing, comparing adaptive learning models with traditional human processing to enhance archival descriptions. The broader implications for large-scale digital projects and the future of archival science are also discussed.
With advancements in Natural Language Processing (NLP) and machine learning, the ability to process large volumes of typewritten and handwritten text within minutes—rather than days or months—represents a transformative leap in digital archival science. By refining text analysis, entity recognition, and terminology control, this model accelerates metadata standardization, making historical records more accessible and meaningful.
Click for the American Indian Language Working List and Resources
This project aims to:
- Develop and improve dictionaries for American Indian languages within archival frameworks.
- Implement adaptive learning models to process and interpret archival descriptions.
- Standardize terminology across digital archives while preserving unique tribal distinctions.
- Enhance entity recognition, linking historical events, policies, and individuals in American Indian history.
- Improve metadata accuracy through feedback loops that refine machine learning predictions over time.
Traditional archival description and metadata creation rely heavily on manual human input, which can be inconsistent, biased, and time-consuming. Adaptive learning models enhance this process by:
- Accelerating Text Processing: Automating the recognition of entities, subjects, policies, and people.
- Refining Controlled Vocabularies: Standardizing terminology (e.g., American Indian vs. Native American vs. Muscogee’).
- Enhancing Metadata Linkage: Strengthening connections between records and ensuring semantic relationships between terms.
- Detecting Patterns in Archival Texts: Identifying key themes, dates, and historical context through NLP techniques.
- Inconsistent terminology across archival collections.
- Difficulty in recognizing handwritten or typewritten documents.
- Inaccurate metadata due to human cognitive biases.
- Challenges in identifying ceremonial, legal, or political references in tribal history.
Solution: Our model improves accuracy by integrating Named Entity Recognition (NER), sentiment analysis, and entity-linking techniques to detect, categorize, and interconnect important archival information.
- Standardized dictionaries ensure consistent data annotation.
- Language models are trained to recognize, classify, and translate diverse terms.
- Example: ‘Mvskoke’ vs. ‘Muscogee Creek’—ensuring all related documents are linked under the same classification.
- Named Entity Recognition (NER): Identifies names, organizations, locations, and historical terms.
- Topic Modeling: Groups related documents by themes (e.g., treaties, sovereignty, land policies).
- Text Classification: Maps documents to predefined controlled vocabularies for better searchability.
- Sentiment Analysis: Detects contextual tone (e.g., legal proceedings vs. personal correspondence).
- Entity Linking: Connects references within texts to historical events, people, and organizations.
- Ensures continuous model improvement through automated re-training.
- Allows human reviewers to refine machine predictions.
- Differentiates between literal and figurative language (e.g., ‘Chief’ as a tribal leader vs. a government title).
Example: If a satirical remark appears in congressional records, the system flags it for review to prevent misinterpretation.
To ensure archival consistency, this project creates standardized metadata feedback loops:
Figure [2]: Metadata Feedback Loops – Terminology, Language
Process Flow:
- Extract terms from historical documents.
- Compare against controlled vocabularies.
- Identify relationships across archival records.
- Normalize metadata while preserving original text.
- Validate accuracy using human feedback.
Improved Record Linkage:
- By unifying terminology, the system prevents fragmentation and strengthens archival metadata. For example:
- Muscogee Creek → Mvskoke (Linked under a standardized term).
- Indian Affairs Act (1978) → Referenced in multiple congressional records.
- Many Indigenous languages remain underrepresented in digital archives.
- Adaptive models can preserve, translate, and classify historical linguistic data.
- Recognizing tribal sovereignty through accurate archival description is vital for historical justice.
- Automated archival processing ensures faster access to cultural records for Indigenous communities, researchers, and educators.
- Expanding NLP models to cover additional tribal languages and dialects.
- Integrating AI-powered handwriting recognition to process handwritten American Indian documents.
- Building interactive digital dictionaries for Indigenous language preservation.
- Collaborating with Indigenous scholars and communities to refine data representation.
- Data & Scripts: Contains Python scripts, JSON mappings, and NLP models for language processing.
- Metadata Guidelines: Best practices for integrating standardized American Indian terminology in digital archives.
- Research Papers: Publications on adaptive learning in archival processing.
This project is part of ongoing NEH and NHPRC-funded research dedicated to American Indian sovereignty, policymaking, and historical documentation. We acknowledge the contributions of tribal historians, language experts, and archivists working towards equitable representation of Indigenous knowledge in digital archives.
Relevant Projects & Grants:
- American Congress Digital Archives Portal
- *Historical Collection of Political Campaign Advertisements *
- Congressional Correspondence Handwriting Textract Pilot
Email: japryse@ou.edu
GitHub: https://prys0000.github.io/