README Outline

Project Description
Prerequisites
Usage
- Building
- Testing
- Execution
Additional Notes
Version History and Retention
License
Contributions
Contact Information
Acknowledgements

Project Description

ROADII Use Case - Categorizing Public Comments

Title: Categorizing Public Comments for CAFE Rulemakings
Purpose and goals of the project: Explore diferent AI approaches to comment categorization to gain an understanding of language processing tools and to improve the efficiency of the public comment response project. This is not intended to remove humans from the public comment response process, it is only intended to speed up the time in which comments are responded to.
Purpose of the source code and how it relates to the overall goals of the project: This code represents a prototype implementation of processing public comments from plain text to binned and categorized comments. This is not a final version and does not represent the current method of processing public comments.
Length of the project: This use case is currently in the exploratory phase.

Prerequisites

This repositories dependencies are listed in the requirements.txt file. Install all dependencies by running: pip install -r /path/to/requirements.txt

LLM Approach

The team utilized LM Studio as with the local server and API approach to attempt to improve upon the traditional machine learning approaches to text recognition and categorization. The files for this approach can be found within the this folder. Most of the code for this approach is located in this file. This file just needs to be run and it will currently read an Excel sheet for comments in a particular column. It will then label those comments based on one of the labels of {legal, compliance, economics, technology, or other/unknown}. An example of the code that can be modified to run the program with various parameters is shown below.

Running LM Studio

Click on developer tab.

Make sure model is loaded and click Start Server (ctrl + R) button.

As long as the model you identify in the code is downloaded, LM Studio will find the model and load it, if it's not already loaded. For example, it will use the Calme Legal model when instructed and switch to the Gemma model, when instructed, as shown below.

Models Tested

gemma-2-27b-it
Calme 2.3 LegalKit 8B
llama-3.2-1b-instruct-q8_0.gguf
llava-hf/llava-1.5-7b-hf
simmo/legal-llama-3
meta-llama/Llama-3.2-1B
Granite 3.1

Parameters varied between 1 and 27 billion for the models that were used.

Process

The LLM approach utilized a tiered prompting approach due to inconsistencies with other approaches. An overview of the steps are as follows:

Calme LegalKit was first prompted with the comment to determine if it was a legal-based comment or not.
If it returns something other than true or false, the model is reprompted to provide either a true or false response
If it is a legal comment, it is labeled legal and the program moves to the next comment.
If not a legal comment, the program then prompts the Gemma 2 model for the next category.
It follows this approach until either a category is found or the model is unable to determine a category and determines it as unknown.

Traditional Neural Network(NN) Approach

The team utilized SciKitLearn to locally run a large selection of traditional machine learning algorithms. The files for this approach can be found within the this folder. Eight differrent models were tested with a large selection of different dataset cleaning and oversampling techniques to determine the best approach for the final use case. The code used to test differnt approaches can be found in this file. The best performing approach was found to be Log Loss SGD, and the selection can be found at the top of the main.py file.

Models Tested

Logistic Regression
Ridge Classifier
K Nearest Neighbors (kNN)
Random Forest Classifier
Linear Support Vector Machine (SVC)
Log Loss Stochastic Gradient Descent (SGD)
Nearest Centroid
Complement Naive Bayes

Process

training data is read, some items are reclassified to reduce similarities between categories.
The selected oversampling technique modifies the training data to account for the size and category imbalance in the data set.
Text to be categorized is loaded and chunked using semantic-text-splitter.
Chunked text and training data are combined and vectorized/tokenized.
Model is trained on training data.
Text chunks are categorized by model.
Model categorization is compared to keyword classification and updated if necessary.
Sequential chunks with matching categorizations are combined to reduce the number of output chunks.

Execution with Machine Learning Approach

The steps to running the main machine learning pipeline are as follows:

Navigate to main.py
Update file location of training data and text file to be chunked and categorized
Fill in keywordDict with any keywords you would like to use for classification
Select True or False for human intervention when keyword and neural network don't match
run main.py

Testing

Basic functionality testing is built into each file and can be run by simply running the file. Testing for edge cases and comparing results to expected values is currently left to human-based quality control.

Additional Notes

Data used for this project is not stored within the repository. See below for information on where data came from and how to access it. Models used for this project are also not stored within the repository. Follow model instructions for local hosting and interfacing with Python.

Context on Previous Related Work

Known Issues:

None identified.

Associated datasets:

Data for this project was supplied by the CAFE public comments team. All data is publicly available at regulations.gov under the prevoius CAFE NPRM.

Version History and Retention

Status: This project is in active development phase.

Release Frequency: This project will be updated when there are stable developments. This will be approximately every month.

Retention: This project will likely remain publicly accessible indefinitely.

Release History: See CHANGELOG.md**

License

This project is dual-licensed under the Apache v2.0 License and Creative Commons 1.0 Universal (CC0 1.0) License. You can choose between one of them if you use this work.

SPDX-License-Identifier: Apache-2.0 OR CC0-1.0

Contributions

Please read [CONTRIBUTING.md] for details on our Code of Conduct, the process for submitting pull requests to us, and how contributions will be released.

Contact Information

Contact Name: Eric Englin Contact Information: Eric.Englin@dot.gov

Contact Name: Andrew DeCandia Contact Information: Andrew.DeCandia@dot.gov

Acknowledgements

Citing this code

To cite this code in a publication or report, please cite our associated report/paper and/or our source code. Below is a sample citation for this code:

Sample citation should be in the below format, with the formatted fields replaced with details of your source code

author_surname_or_organization, first_initial. (year). program_or_source_code_title (code_version) [Source code]. Provided by ITS/JPO and Volpe Center through GitHub.com. Accessed YYYY-MM-DD from doi_url.

When you copy or adapt from this code, please include the original URL you copied the source code from and date of retrieval as a comment in your code. Additional information on how to cite can be found in the ITS CodeHub FAQ.

Contributors

Andrew DeCandia (Volpe)
Vincent Livant (Volpe)
Jeremy Hicks (Volpe)
Robin Wilkinson (Volpe)
Ben Clinton (Volpe)
Ali Brodeur (Volpe)
Billy Chupp (Volpe)
Eric Englin (Volpe)

The development of ROADII that contributed to this public version was funded by the U.S. Intelligent Transportation Systems Joint Program Office (ITS JPO) under IAA HWE3A122. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the ITS JPO.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
BERT Approach		BERT Approach
CAFE_Comment_Pulling_Tool/src		CAFE_Comment_Pulling_Tool/src
CAFE_LLM_spaCy_LM_Studio		CAFE_LLM_spaCy_LM_Studio
File_Parsing		File_Parsing
Images		Images
LM_Studio		LM_Studio
MinimalComputeApproach		MinimalComputeApproach
RAKE_and_Dify		RAKE_and_Dify
Sample_Comments		Sample_Comments
TextSplitter		TextSplitter
.gitignore		.gitignore
.gitignore.bak		.gitignore.bak
KeywordClassify.py		KeywordClassify.py
LICENSE.Apache-2.0		LICENSE.Apache-2.0
LICENSE.CC0-1.0		LICENSE.CC0-1.0
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

README Outline

Project Description

ROADII Use Case - Categorizing Public Comments

Prerequisites