Data Pipeline
Purpose
Requirements
Functionality
Web Application: AI Assistant
App
Functionality
Sidebar: Update DB
Main Page: Querying the Database
Notes
Demonstration
About
License
- Python 3.12.1
beautifulsoup4
>=4.13.3chromadb
>=0.6.3currencyconverter
>=0.18.3dbt-postgres
>=1.9.0dotenv
>=0.9.9groq
>=0.18.0langchain
langgraph
>=0.3.5numpy
==1.26.4pandas
>=2.2.3pandera
>=0.23.0psycopg2-binary
>=2.9.10selenium
>=4.29.0sentence-transformers
>=3.4.1sqlalchemy
>=2.0.38streamlit
>=1.42.2torch
>=2.6.0tqdm
>=4.67.1
This part of the project automates the process of fetching, parsing, and loading flight prices data into a database. It utilizes web scraping, HTML parsing, data validation, and database loading.
- Fetch flight price data for a specified route.
- Parse the HTML content of the scraped flight data.
- Extract relevant flight information, including departure/arrival details, flight numbers, and ticket prices.
- Validate the extracted data with Pandera.
- Load the validated data into a Postgres SQL database.
- Trigger dbt (data build tool) models to transform and analyze the loaded data.
The script defines a primary function, update_db(d, origin_cd, destin_cd)
, which performs the following steps:
-
Date Retrieval:
- Calculates the search date by adding
d
days to the current date using theget_date
function fromupdatedb.webscraping
. - Displays the search date in the Streamlit application.
- Calculates the search date by adding
-
Web Scraping:
- The
get_flights
function (fromupdatedb.webscraping
) fetches the HTML content of the flight search results. - The scraped data is saved in the /data folder as
flights.html
for parsing.
- The
-
HTML Parsing:
- Initializes a
Parser
object (fromupdatedb.parser
). - Opens the
data/flights.html
file using the parser. - Finds all flight departure elements using BeautifulSoup.
- Initializes a
-
Data Extraction:
- Iterates through the departure elements and extracts flight details:
- Departure city, date, and time.
- Arrival city and time.
- Flight number, number of stops, and flight duration.
- Ticket Prices.
- Stores the extracted data in lists.
- Iterates through the departure elements and extracts flight details:
-
Data Processing:
- Handles edge cases, such as the first entry having an incorrect date, and fills in missing price data.
- Creates a Pandas DataFrame from the extracted lists.
- Fills missing values using forward fill and backward fill.
- Saves the DataFrame to a CSV file (
data/flights.csv
).
-
Database Loading:
- Calls the
load_to_sql
function (fromupdatedb.load_sql
) to validate the data using Pandera and load it into the SQL database.
- Calls the
-
dbt Execution:
- Executes the
run_dbt
function (fromupdatedb.load_sql
) to trigger dbt models for data transformation and analysis.
- Executes the
This Streamlit application provides an interface for interacting with flight price data stored in a PostgreSQL database. It offers two main functionalities: updating the database with new flight prices and querying the database using natural language through AI assistants.
The application aims to:
- Update Flight Price Data: Allow users to schedule updates to the database with flight price information for specified date ranges and city codes.
- SQL Query Database via Natural Language: Enable users to ask questions about the flight data in natural language, which are then translated into SQL queries or used to generate analytical insights.
- Ask Analyst AI Assistant: Enable users to ask questions about the flights data to an AI Analyst that provides insights.
The application is structured with a sidebar for database updates and a main page for querying the data.
- Allows users to select multiple days ahead for which to update flight prices.
- Provides input fields for origin and destination city codes.
- Includes a button to trigger the
update_db
function, which fetches, parses, and loads flight price data into the database. - Displays a success message upon completion of the database update.
- Provides two AI assistant options: "SQL Assistant" and "Flight Analyst."
- SQL Assistant:
- Allows users to enter natural language questions.
- Uses a LangGraph workflow to:
- Translate the question into an SQL query.
- Execute the SQL query against the database.
- Generate a natural language answer based on the query results.
- Displays the generated answer and the executed SQL query.
- Flight Analyst:
- Allows users to enter natural language questions.
- Uses the
groq_response
function to generate an analytical response based on the question. - Displays the generated response.
- Displays the current BRL to USD conversion rate.
-
Updating the Database:
- In the sidebar, select the desired number of days ahead.
- Enter the origin and destination city codes.
- Click the "Update DB" button.
-
Querying the Database:
- On the main page, select either "SQL Assistant" or "Flight Analyst."
- Enter your question in the text input field.
- Click the "Run" button.
- View the generated answer or analysis.
- Ensure that the necessary database connection details and API keys (GROQ) for the AI assistants are configured correctly.
- The bug fix for the torch class path is included in the code.
- The streamlit page is configured to be wide, and the sidebar is collapsed by default.
![]() |
![]() |
This project was developed by Gustavo R Santos.
Linkedin
Website
Project licensed under MIT License.
Docker
Deploy