Skip to content

hyuncat/congressLLM

Repository files navigation

congressLLM

Coded in 24 hours by Sarah Hong, Moises Mata, and Andromeda Kepecs for DevFest 2024.

Won: Best Use of Google Cloud!

Inspiration

Our government websites are full of legal jargon that we simply cannot be bothered to try to understand. So, we often rely on news sources for our understanding of legal matters. However, these sources can introduce sensationalism or bias, making the information time-consuming to sift through. Our aim is to transform the often dry and overwhelming Congress proceedings into easily digestible and accessible information. We want to foster a greater understanding of legal proceedings for those who may not possess an avid interest in the intricacies of the legal system.

What it does

Our website is all about simplicity, and has two main functions: a summary page and a search function. The home page displays 10 top key moments from recent Congressional meetings, ranked by frequency of appearance of each key moment's umbrella topic. Each key moment is summarized. The search is comprised of 20 different umbrella categories, which can be selected from a drop-down menu. Once a category is selected, the user is then shown a list of relevant moments during recent Congressional meetings. For example, searching for "healthcare" and "sorting by relevance" will show all recent bills and discussions related to healthcare, sorted in order of relevance (determined by our algorithm). This approach allows for an unbiased but palatable presentation of legal current events.

How we built it

The web application framework of this project is based in Flask. To collect data, we made use of a number of Google Cloud Platform APIs. We first scraped data and links to PDFs Senate proceedings from congress.gov, and then used GCP Document AI API to parse the PDFs into smaller chunks of text. From there, we used GCP Natural Language Processing Content Classification API to tag each text chunk with its most similar category. This GCP yielded a very large number of categories, many of which were unrelated to politics, so we then implemented an algorithm that would determine the semantic similarity of each of these categories to our own list of 20 relevant categories. We then created a data frame containing each document PDF link, the date of the proceedings, each chunk and its categorization generated by GCP, and then its broader categorization that we assigned. Our search algorithm would match the selected category with each text chunk's broad category, and provide information about the more specific category as well.

For the homepage, a counter keeps track of categorical frequencies, and uses PaLM 2 Text generative AI through GCP to summarize all the events relating to the top category of the day.

Challenges we ran into

We ran into many challenges. We struggled with integrate backend and frontend, creating the database, and working with GCP APIs. We spent a lot of our time dealing with various errors, which made it difficult to implement everything that we wanted to. We are currently trying to solve an error where our highly-specific categories are all showing up as "none" and may not solve it in time :( but that is something that we hope to fix for future iterations!

Accomplishments that we're proud of

Our group has little to no experience with HTML, CSS, backend development, GCP, and pretty much everything that was done in this project. We are incredibly proud of how much we figured out together in such a short period of time, and are very happy that we were able to hack together a finished product!

What's next for The Congress Cut

A domain! We also want to give users the option to make profiles so our site can provide more curated digests of legal proceedings, and allow for email opt-in. We also want to improve our search algorithm and make the site more visually appealing.

Installation

Libraries

  • pip3 install --upgrade google-cloud-documentai
  • pip3 install --upgrade google-cloud-storage
  • pip3 install --upgrade google-cloud-documentai-toolbox

About

Using content classification AI to parse Congress documents for searchable topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •