PDFtoDLM: Multi-File PDF to JSON Extraction Tool

AEC Hackathon Munich : MOD Smart Prefab challenge, Team MOD-2.

Our goal is to extract structured data from unstructured PDFs containing information about prefabricated elements. The main challenge is to produce reliable results in JSON format, which can be used for further applications.

The strategy we chose involves using two models acting as agents, based on OpenAI and Claude respectively, to improve each other's quality. One model generates the initial JSON based on the prompt request, while the other checks it and corrects any mistakes in the previous output.

There are some typical approaches for this challenge. Prompt engineering refers to designing effective prompts to instruct the LLM to complete the task. RAG applies semantic embedding to retrieve the relevant chunks of documents and the LLM then uses this retrieved content to generate answers that are more informed and accurate. Fine tuning aims to customize an LLM on a specific dataset to adjust its behavior or optimize it for specific tasks. In this project, we did not do fine tuning, instead, we tried a strategy called Verbal Reinforcement Learning, that is using feedback from human evaluators to iteratively improve how the LLM responds.

PDFtoDLM is composed of two backends for LLMs, a frontend user interface, and some additional tools.

Features

Upload multiple PDF files via the web interface.
Immediate visualization of each uploaded PDF.
Asynchronous generation of structured JSON data from each PDF.
Interactive JSON schema editor with live syntax highlighting.
Options to save edited JSON and download it locally.

Tech Stack

OpenAI

Frontend: React.js
Backend: Node.js with Express
PDF Parsing: pdf-parse, pdf-lib
AI Integration: OpenAI API

Claude

Bash, llm-claude-3

Installation

OpenAI backend

Node.js (v14 or higher)
npm or yarn
OpenAI API Key (requires a valid API key from OpenAI)

Claude backend

Python
Claude API Key
Installation

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
backend		backend
doc/img		doc/img
frontend		frontend
tools		tools
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDFtoDLM: Multi-File PDF to JSON Extraction Tool

Features

Tech Stack

Installation

Team

About

Uh oh!

Releases

Packages

Languages

mod-construction/2024-Hackathon---PDFtoDLM

Folders and files

Latest commit

History

Repository files navigation

PDFtoDLM: Multi-File PDF to JSON Extraction Tool

Features

Tech Stack

Installation

Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages