Hello! Welcome to Atlas Chat. This README contains all the information you need to understand the code, run the website, and continue development. If you have any questions, please contact me at maxlaibson@gmail.com.
- Features
- Installation
- Hosting
- LLM Vocabulary
- Structure
- Inner Workings
- Problems & Possible Next Steps
- Contact
Atlas chat is designed to help users explore data from the Opportunity Atlas Paper. The chat can:
- Find variables
- Users can search for variables related to various topics and broken down by race, gender, and parent income percentile
- Get location-specific data
- Data is available for specific census tracts, counties, and commuting zones
- Users can also get tables with data for all US counties, all the counties in a specific state, or all the census tracts in a specific state
- Calculate statistics
- The chat can calculate mean, median, standard deviation, and correlation
- Make choropleth maps
- Maps are available for all the counties in the US, all the counties in a state, and all the census tracts in a state
- Make scatter plots
- Answer questions
In addition to these features, Atlas Chat has a data download page where users can download the variables mentioned in their conversations for different races, genders, percentiles, and geographic levels.
The website also has an error reporting feature that sends the contents of the chat, the contents of the console, and a message entered by the user describing the problem to a FireStore database.
Watch this video to understand how users can interact with the chat. 📹
First, open the terminal in a directory of your choosing and clone this repository.
git clone https://github.com/xamxl/Atlas-Chat.git
Next, navigate to the Flask
folder and install the Python requirements.
cd Atlas-Chat/Flask
python -m pip install --upgrade pip
pip install -r requirements.txt
You now need your own OpenAI API key and Google Cloud service account (If you want error reporting to work). You can get an OpenAI API key here and a service account here. Put the OpenAI API key in the .env
file which you can find in the Flask
folder. If you get a Google Cloud service account, put your JSON key into the atlas-chat-gcloud-key.json
file.
Now, run the Python file to start the server.
python main.py
Finally, navigate to the localhost. That's it! 🎉
Alternatively, you can run the chat in a Docker container. A Docker image of the current chat was shared with members of the OI team.
If you want to make a shareable link, you can upload the whole program to Google Cloud Run. Google Cloud Run is great because you can easily control how powerful the server is. The chat runs best when you give each instance maximum power, 8 VCPUs and 32 GB of memory. To scale the server up or down you can change the maximum number of instances. One instance can support at least 20 users at once, and the maximum number of instances is 100.
To do this, first, go to the script.js
file in the project and replace all occurrences of http://127.0.0.1:3000/ with the link to your Google Cloud Run deployment. You may need to deploy twice, once to figure out what this link is and another time with this link in the code.
Then set up a Google Cloud account and install the Google Cloud SDK. Navigate to the Flask
folder and build the docker image.
cd Atlas-Chat/Flask
gcloud run deploy
You may have to wait a while for the files to upload, but after that, the SDK will print out a link in the console and you will be all good to go. ☁️ 🔗
OpenAI offers two services through its API that are used in this project: text generation with chat-GPT and embedding generation.
Text generation involves providing a prompt to a model like gpt-4o and receiving a text-based response. For example, if you prompt “Tell me about standard deviation,” the model might respond with: “Standard deviation is a measure of the amount of variation or dispersion in a set of values. It indicates how much the individual data points in a dataset differ from the mean (average) of the dataset.”
Embedding generation involves inputting text into a model, such as text-embedding-3-large, and receiving a vector representing the meaning of the text. These embeddings are useful in search processes because the distance between two vectors can easily be calculated. A small distance indicates that the two pieces of text have similar meanings. The chat uses embeddings to search through variables in the database to find the one that is most likely to help the user.
While simple text generation is sufficient for many applications, there are cases where it’s more helpful for chat-GPT to provide a structured response that can be used to trigger specific functions. For instance, if a user asks, “What is the weather in Boston?”, it would be ideal if chat-GPT could return a JSON object specifying the name of a function to call for retrieving the weather, along with the relevant parameters, such as the city name “Boston”. Luckily, chat-GPT has this ability to generate a function call during text generation.
A function call allows the model to output a structured response instead of plain text. When a function call is enabled, the model can either respond with plain text or a JSON object with the necessary parameters to execute a specific function. There is also a parameter that can force chat-GPT to always use a function call, ensuring that it consistently provides structured outputs instead of plain text.
The image below shows the structure of the program. The program is divided into four main parts. First, there is a webpage consisting of HTML, CSS, and JS. This front end, which runs on the user's computer, can make requests to a backend server programmed in Python using the Flask library. The backend server can get data from the sheets and other files stored in the database. Finally, the backend can make requests to the OpenAI API.
To see the raw data this chat has access to, please click here and scroll down to "The Opportunity Atlas: Mapping the Childhood Roots of Social Mobility".
In this section, there are 13 subsections. Atlas Chat has access to the CSV files from all subsections except "Crosswalk Between 2010 and 2020 US Census Tracts" and "Replication Package". In the code, the subsections that it can access are numbered 1 to 11, with 1 corresponding to "Household Income and Incarceration for Children from Low-Income Households by Census Tract, Race, and Gender" and 11 corresponding to "Neighborhood Characteristics by Commuting Zone". Sheet 12 contains the variables present in sheet 5 but not sheet 4.
All of these sheets consist of columns which each hold the data for a specific variable. Variables generally have a category and may also have a race, gender, parent income percentile, and statistic type. Examples of categories include median income, high school graduation rates, and marriage rates. The races are black, white, natam, hisp, asian, other, and pooled. The genders are male, female, and pooled. The percentiles are p1, p10, p25, p50, p75, and p100. The statistics are mean, n, se, imp, and mean_se. For example, one column is hs_male_black_mean
. This column gives the mean high school graduation rate for black men.
Some folders in the application only contain files corresponding to sheets 1, 4, 9, and 12. This is because these sheets collectively contain all the variables in the database. The other sheets contain these same variables for different geographical levels. The geographical levels are census tract, county, and commuting zone.
The list below goes over each file in the project, where it is located, and why it exists.
flowchart.jpeg
: Used in the READMEstructure.jpeg
: Used in the READMEREADME.md
: A file that helps explain the codesetup.py
: Contains the code that was used to construct the database.gitignore
: Prevents sensitive files from uploading to GitHubFlask
: Contains the application itself, including the files for the front end, Flask backend, and database. When running the program, the wholeFlask
folder is part of the execution.env
: Stores the OpenAI API keyatlas-chat-gcloud-key.json
: Stores the Google Cloud service account keycountycode-countyname.csv
: Stores a table converting county codes to county namesstates.csv
: Stores a table converting state names to state IDsmerged_data.csv
: Populated anew each time the server makes a map. There is no need to know what is in this filerequirements.txt
: Lists the Python packages that need to be installed to run the Flask servermain.py
: Contains the server codestatic
: Stores the CSS, JS, and images thatmain.py
gives to the user's computer when it loads the pagetemplates
: Stores the HTML thatmain.py
gives to the user's computer when it loads the pagemap_data
: Contains files with information about the outlines of US counties and census tracts. Used to construct mapsheaders
: Contains CSV sheets, each with only one row. This row contains the header names for all the columns in that sheetdata_columns
: Contains the data itself. Each column from the original data was turned into its own sheet named with the number of the sheet the data came from and the variable nameheader_description
: Contains files with descriptions for each variable. There are only four files (fewer files than there are sheets) since many of the sheets have the same variables, just for different geographical levelslabel_col_names
: Holds the names of the label columns for each sheet. Examples of label columns are state ID, state name, county name, and county IDdescriptions_units
: Contains each variable name and its description. Importantly these variables do not specify race, gender, or percentile. For example,descriptions_units
may contain a variable calledkfr_[race]_[gender]_mean
.descriptions_units
also contains information on the sheets' units and the different outcomes in each sheet. Think ofdescriptions_units
as the information from the README on Opportunity Insights' data pageembeddings
: Contains the embeddings themselves. All columns in the sheets have a corresponding embedding, but each embedding normally corresponds to multiple columns. For example,kfr_pooled_pooled_mean
andkfr_black_pooled_mean
both have the same embedding. The embeddings for all label columns are set to zeroDockerfile
: Used to build a Docker image of the application
This section explains how the code works. Please look through the following flowchart before reading the text since this section references the image.
When the user sends a message to the chat, sendMessage() in script.js
begins handling the request. The function adds the user's message to the messages
list and calls useCase() which makes a server call to determine which action the chat should take. The endpoint /useCase on the server uses a function call from the OpenAI API to make the decision. This represents the first box in the flow chart. Then, sendMessage()
calls the right function(s) based on what branch of the flow chart the function call picks.
If /useCase
decides the user wants the chat to calculate a statistic or make a figure, sendMessage()
calls a function that corresponds to one branch of the flow chart. For illustration, this description will follow the branch the chat picks if it thinks the user wants to make a scatter plot.
In this case, requestGraphVars() starts by calling requestVar() which constructs a list of all the available variables and sends them to the server. /pickGraphVars in the server uses a function call from the OpenAI API to decide which variables should be used or to write a message describing why the right variables were not available. Back in requestGraphVars()
, the chat either prints out the text describing why the right variables were not found or calls graphVariable() to create the graph with the variables the server selected.
Warning
There is a bug in the code. The map download feature is not fully functional.
In this branch of the flow chart, sendMessage()
starts by calling variableSearch(). This function calls /formulateQueryOrRespond in the server which uses a function call from the OpenAI API to either directly respond to the user or fill out the fields required to search for a variable. These fields include race, gender, parental income percentile, location, and query (the keywords to use in the search).
Back in variableSearch()
, the chat either prints out the response from the server or gets the values of all the fields so that sendMessage()
can call the right function to find variables for the user. For example, if the user asks a question like "What is a standard deviation?" then the chat will answer the question directly while in this part of the flow chart. If the user says, "Get me a median household income variable." then the flow chart will reach this stage and move on to Search after it collects the search fields.
In this branch of the flow chart, sendMessage()
starts by calling fetchData() which calls /getRankedVariables in the server with the key words the previous function call said should be used for the search. Then, /getRankedVariables
calls getRankedVariables() which gets the embeddings of the key words using the OpenAI API.
The rest of this description uses the following example: the key words are "high school completion rate" which means the variables getRankedVariables()
should return first are hs_pooled_pooled_mean
, hs_pooled_female_mean
, and hs_pooled_male_mean
.
To find these variables, getRankedVariables()
goes to the database and gets a set of precalculated embeddings. These embeddings each represent multiple variables. For example, hs_pooled_pooled_mean
, hs_pooled_female_mean
, and hs_pooled_male_mean
are represented by the same embedding made with the text "hs - the fraction of children who completed high school." getRankedVariables()
uses cosine similarity to measure the distances between the embedding of the user's key words and the precalculated embeddings.
Next, getRankedVariables()
calculates "dumb distance", which is the fraction of the user's key words that appear in each of the texts that were used to make the precalculated embeddings. For example, the text "hs - the fraction of children who completed high school" would have a "dumb distance" of 0.5 since "high" and "school" are in the text but "rate" and "completion" are not.
The two different distance metrics are then averaged, weighting the cosine similarity by 0.8 and the "dumb distance" by 0.2. The specific values 0.8 and 0.2 were chosen by trying different ratios and seeing which ratio most regularly returned the most relevant variables. The cosine similarity embedding distance is used so that terms like "upward mobility" which do not appear in the texts used to make the precalculated embeddings can be matched with phrases from the texts with similar meanings. The "dumb distance" is used so that terms like "individual" which do appear in the texts but whose meaning isn't picked up by the embedding model can still impact the search.
Finally, getRankedVariables()
creates a list of all the variable names in the database ordered by the distance between the text they correspond to and the users's key words. For example, in this case, the list could start with hs_pooled_pooled_mean
, hs_pooled_female_mean
, and hs_pooled_male_mean
. The lower the index of a variable in the list the closer the variable is to the key words. Strings with the variable description and the name of the sheet the variable is from are added to the variable list.
The front end receives the resulting list and sendMessage()
calls makeTable() which puts the variables into a hidden HTML table so they can be processed.
sendMessage()
then calls condense() which removes duplicates from the table that have the same title but different races, genders, and percentiles. It saves the different race, gender, and percentile options for each variable to lists. For example, if kfr_black_pooled_p50
and kfr_black_pooled_p25
are both in the data, only one is kept.
sendMessage()
then calls chooseDropdown() which makes sure that the race, gender, and percentile values for each variable are in line with the fields filled out by /formulateQueryOrRespond
. For example, if condense()
took in hs_pooled_pooled_mean
, hs_male_pooled_mean
, and hs_female_pooled_mean
and removed both hs_male_pooled_mean
and hs_female_pooled_mean
but /formulateQueryOrRespond
specified that the user wants the data for females only, chooseDropdown()
will replace hs_pooled_pooled_mean
with hs_female_pooled_mean
.
sendMessage()
then calls linkRows() which reorders the tables so "families" are together (or "linked"). For example, if kfr_pooled_pooled_mean
and kfr_pooled_pooled_p50
are both in the table but are not next to each other, then linkRows()
will move the one farther from the top of the table to be next to the other. The function linkRows()
also adds text to the descriptions of variables ending in "_n", "_se", "_p50", and "_p25" to make sure the chat understands what these variables mean. This additional text is not visible to the user. Finally, the function returns a string listing the top 10 variables and their descriptions in the table. The value of 10 can easily be changed. When this number is larger, the next step is better at choosing the right variable but costs more.
The next function called by sendMessage()
is pickVarAndDescribe(). This function calls /pickVarAndDescribe in the server which uses a function call from the OpenAI API to decide which of the variables returned by linkRows()
to use. Along with its choice, the function call also returns a description of the variable.
If the function call does not find a suitable variable, back in pickVarAndDescribe()
the chat will print an error message to let the user know no variable was found. If a variable was found and no location was specified by the user, the chat displays the resulting variable along with its description.
If a location was specified by the user then sendMessage()
calls getLocationData(). This function takes different actions depending on what type of location was requested. This description follows the branch of the flow chart where the user requested data for "All US Counties".
In this case, getLocationData()
calls fetchDataLoc() which calls /getData in the server which finds the correct data and returns it to the front end along with its units. The returned data is a table consisting of the variable the function call decided was best along with the label columns, which in this case would include the county id and the county name. The data can be fetched quickly because each variable or label column is stored in its own sheet named according to the header of the column. getLocationData()
then adds the resulting data to the chat while keeping it hidden.
Finally, after the data is fetched, sendMessage()
calls describeLocationData() which then calls /describeLocationData in the server. /describeLocationData
uses a direct request, instead of a function call, to ask the OpenAI API to describe the data. The server then returns the description to the front end which displays the variable, location-specific data, and description to the user.
This section outlines some of the key problems with the chat and ways that these problems could be fixed.
Sometimes when the variable the user wants is not one of the top 10 variables returned by linkRows()
the function call in /pickVarAndDescribe
still returns the name of the variable that the user wants. Right now this causes the chat to break because the code can't find the variable from the function call in the list of possible variables. This could be fixed by making the chat directly search for variable names that are returned by the function call in /pickVarAndDescribe
but are not in the list of possible variables created by linkRows()
. If "Pick Final Variable From Resulting List" fails after the function call in /pickVarAndDescribe
returns a variable name, then the chat should first try finding that variable directly from the database before returning an error.
To save money, the function call in /useCase
used to only get one message to use when deciding which action to take. Since it was missing most of the context of the chat, the function call sometimes decided to take the wrong action. Recently, this was updated to four messages because the OpenAI API got much cheaper, so it made sense to give /useCase
more messages to use when deciding which action to take. If this problem persists, it might be a good idea to add even more messages to the function call. Another option is to use a cheaper model like gpt-3.5 or gpt-4o-mini for this task and give the function call all of the messages.
If a user knows how to use the chat then generally they should be able to find what they are looking for. However, if someone is not familiar with the chat's workflow, then trying to get data, calculate statistics, or make figures can be confusing. For example, they may not know that to make a map they first have to request a variable for a specific location and then ask for the map to be created. To fix this, more explanatory text could be added to the chat's error messages and to the prompts given to chat-GPT. Another way to fix this would be to improve the workflow so that the chat can do two tasks following one prompt. For example, the chat could first get the data and then create the map without the user having to break the task down into two prompts.
During a testing session with people from the lab, several OI members asked for the same three features:
- The ability for the chat to manipulate data in more ways. For example, people wanted to be able to be able to get the rows of tables with the biggest or smallest values and to get the difference between two tables. While these features could be added to the chat individually, a better solution would be to give the chat the ability to write and execute JS code on tables so that it could complete any task, even ones it was not preprogrammed to do.
- The ability to understand more about how the data was calculated and what data is available. To do this the chat could be given the ability to search through the Atlas Paper with an embedding search engine to find relevant information to add to its context.
- The ability to fetch multiple variables simultaneously with one request and the ability to fetch data and make a figure with one request.
New embedding models and LLMs that outperform the ones used in this code are regularly being released. Researching the best LLMs and embedding models and replacing the models currently being used with the new ones will improve the chat. Try looking for chat-GPT 5 and for llama 3.1 405B hosted by Groq.
Lastly, if more data needs to be added to the chat, make sure to read through the setup.py
file to see how raw datasets can be converted into a form the chat can understand. Look through the workflow in main.py
and script.js
to see places where functions need to be changed. It will probably be necessary to update the location fetching code and the code that parses races, genders, parental income percentiles, and statistic types.