Skip to content

hkumar747/llm-json

Repository files navigation

Extracting JSON data from text with LLMs

Atlas-USACE project

Python Jupyter This repository hosts a series of Jupyter notebooks that demonstrate the extraction and analysis of environmental impact data from textual descriptions of USACE (U.S. Army Corps of Engineers) wetland projects. Leveraging both traditional NLP techniques and advanced machine learning models from OpenAI, these notebooks provide a comprehensive guide to understanding and predicting the environmental impacts of proposed projects.

The core task we want to accomplish here is to convert a text passage into a structured dictionary of key-value pairs:

Input:

'This project would place approximately 855 cubic yards of commercially obtained fill material within 23,087 square feet (0. 53 acres) of herbaceous wetlands for >>phase II of a single-family housing subdivision.

Output:

{
  "wetlands": [
    {
      "wetland_type": herbaceous "wetlands",
      "impact_quantity": "0.53",
      "impact_unit": "acres",
      "impact_duration": "unknown",
      "impact_type": "fill",
      "project_type:" "residential",
    }
  ]
}

Overview of Notebooks

In this notebook, we focus on extracting key information from PDF documents related to wetland projects. Techniques include abstractive summarization, regular expressions for data extraction, and querying OpenAI's API for structured data extraction.

Explore the process of fine-tuning OpenAI's GPT model with domain-specific data and using the model to extract structured information from project descriptions.

This notebook shows how to perform fine-tuning on the AnyScale platform for open LLMs like Llama-13b, Mistral-7b and Mistral-8x7B.

This notebook uses HuggingFace's trl library perform parameter efficient fine-tuning (PEFT) on Mistral-7b for our JSON generation task.

Getting Started

To use these notebooks, you'll need to install the required Python packages and set up your environment.

  • Python 3.8 or higher
  • Jupyter Notebook or JupyterLab
  • An OpenAI API key

Clone the repository and install the required dependencies:

git clone https://github.com/hkumar747/llm-json.git
cd llm-json
pip install -r requirements.txt

Make sure to configure your environment with your OpenAI API key:

export OPENAI_API_KEY='your_api_key_here'

Contributing

Contributions are welcome! If you have improvements or bug fixes, please open a pull request.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments:

For the full `wetland-tracker' repository and application, check out this repo by Atlas Public Policy.

About

Gallery of OpenAI and HuggingFace LLM generation and fine-tuning scripts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published