The AI Real Estate Assistant is a conversational AI application designed to help real estate agencies assist potential buyers and renters in finding their ideal properties. The application uses natural language processing and machine learning to understand user preferences and recommend suitable properties from a database.
For detailed project requirements and specifications, see the Product Requirements Document.
The AI Real Estate Assistant implements a Retrieval-Augmented Generation (RAG) system that combines the strengths of large language models with structured property data to provide intelligent property recommendations through natural conversation.
Component | Implementation | Why We Chose It |
---|---|---|
Large Language Models | OpenAI GPT / Llama | Provides natural language understanding and human-like responses with reasoning capabilities |
Data Processing | Pandas | Industry-standard for handling tabular data with powerful querying and transformation features |
Vector Embeddings | FastEmbed | Efficient, lightweight embedding generation for semantic search capabilities |
Vector Storage | DocArrayInMemorySearch, ChromaDB | In-memory solution for quick retrieval in demonstration environments |
Conversation Chain | ConversationalRetrievalChain | Maintains context across multiple turns of conversation with memory |
Web Interface | Streamlit | Rapid development of interactive web applications with minimal code |
-
Data Loading & Preprocessing:
- CSV data is loaded and normalized using Pandas
- Properties are transformed into vector embeddings using FastEmbed
- Embeddings are stored in a vector database for efficient similarity search
-
Conversation Handling:
- User queries are processed through the LLM to extract preferences
- Preferences are used to query the property database
- Results are formatted and presented in a conversational manner
-
Retrieval Enhancement:
- V2 implements RAG to combine retrieved property information with LLM generation
- This produces more relevant and accurate responses than pure LLM generation
- Natural Language Conversation: Engage users in a natural conversation about their property preferences
- Property Search & Filtering: Search properties based on location, budget, type, features, and more
- Multiple Data Sources: Support for local CSV files and external data sources via URLs
- Flexible LLM Integration: Support for OpenAI GPT models and open-source alternatives
- Responsive Web Interface: Easy-to-use Streamlit-based user interface
- Natural Interaction: Users can express preferences in natural language without learning complex search interfaces
- Contextual Understanding: System remembers previous interactions and builds cumulative understanding of user needs
- Flexible Integration: Works with various data sources and LLM providers, avoiding vendor lock-in
- Explainable Recommendations: Provides reasons why specific properties match user criteria
- Rapid Development: Architecture allows fast iteration and feature addition
- Data Dependency: Quality of recommendations depends heavily on the completeness and accuracy of property data
- Cold Start Challenges: Without sufficient initial information, recommendations may be too broad
- Performance Issues: Current implementation has a "db freeze" issue during heavy processing
- Limited Visualization: Currently text-based with limited visual property representation
- Scaling Concerns: In-memory vector store may face challenges with very large property databases
- Frontend: Streamlit
- Backend: Python (3.11+)
- AI/ML: LangChain, OpenAI GPT models, Llama models
- Data Processing: Pandas, FastEmbed
- Vector Storage: DocArrayInMemorySearch, ChromaDB
- Package Management: Poetry
├── ai/ # AI agent implementation
├── app.py # V1 application
├── app_v2.py # V2 application
├── assets/ # Screenshots and images
├── common/ # Common configurations
├── data/ # Data loading modules
├── dataset/ # Sample datasets
├── pyproject.toml # Poetry dependencies
├── README.md # This file
├── PRD.MD # Product Requirements Document
├── run_v1.sh # Script to run V1
├── run_v2.sh # Script to run V2
├── streaming.py # Streaming utilities
├── TODO.MD # Todo list
├── utils/ # Utility scripts
└── utils.py # Helper functions
- Used
langchain_experimental.agents.agent_toolkits.pandas.base.create_pandas_dataframe_agent
- Basic property search functionality with single-turn conversations
- Run with: run_v1.sh | Application file: app.py
- Uses different LLM models (OpenAI GPT, Llama)
- Implements RAG with
ConversationalRetrievalChain
for multi-turn conversations - Enhanced property search and filtering capabilities
- Run with: run_v2.sh | Application file: app_v2.py
Priority | Feature | Technical Approach | Expected Impact |
---|---|---|---|
High | Fix "db freeze" issue | Implement asynchronous processing and background tasks | Improved user experience with responsive UI |
High | Comprehensive testing suite | Add unit and integration tests with pytest | Enhanced reliability and easier maintenance |
Medium | Property visualization features | Integrate image handling and comparison views | Better user understanding of property options |
Medium | Multiple data source integration | Create adapters for RESTful APIs and additional file formats | Broader property selection and up-to-date data |
Medium | Advanced filtering capabilities | Enhance vector search with multi-dimensional preference modeling | More precise matching of user preferences |
Low | Market trend analysis | Add time-series analysis of property prices | Help users make better investment decisions |
-
Performance Optimization:
- Implement connection pooling for database operations
- Add request queuing system for high traffic scenarios
- Optimize embedding generation with batching
-
Enhanced User Experience:
- Add property comparison view for side-by-side evaluation
- Implement proactive suggestions based on user history
- Develop mobile-optimized interface
-
Data Enrichment:
- Integrate with mapping services for location visualization
- Add neighborhood statistics and amenity details
- Include historical price trends and future projections
The application follows a modular architecture with the following components:
- User Interface: Streamlit-based web interface
- AI Agent: LLM-powered conversational agent
- Data Processing: CSV loaders and data formatters
- Data Storage: In-memory database and vector stores
Aspect | V1 Implementation | V2 Implementation | Benefits of V2 |
---|---|---|---|
Agent Type | Pandas DataFrame Agent | ConversationalRetrievalChain | Improved conversation memory and context retention |
LLM Integration | Limited to OpenAI | Multiple model support (OpenAI, Llama) | Greater flexibility and cost options |
Conversation Mode | Single-turn | Multi-turn with history | More natural conversation flow and follow-up questions |
Result Quality | Basic data filtering | Enhanced reasoning with RAG | More relevant and personalized recommendations |
Technical Complexity | Lower | Higher | More powerful but requires more careful implementation |
Approach | Pros | Cons | Why We Chose Our Approach |
---|---|---|---|
Traditional Search Interface | Simpler implementation, faster queries | Limited understanding of complex preferences, requires user training | Our conversational approach provides better user experience and more intuitive interaction |
Pure LLM Without RAG | Simpler architecture, fewer components | Hallucinations, less accurate property details | RAG provides factual grounding and prevents misinformation |
SQL Database Only | Mature technology, optimized queries | Limited semantic understanding, rigid query structure | Vector database enables similarity search based on semantic meaning |
External API Integration | Access to real-time data | Dependency on third-party services, potential costs | Local CSV processing provides control and predictability for demonstrations |
When presenting this solution, focus on:
-
Natural Conversation Flow: Show how users can express complex requirements naturally
- Example: "I need a 2-bedroom apartment in the city center with a parking space under $1500"
-
Preference Refinement: Demonstrate how the system narrows recommendations as users provide more details
- Example: Start broad with "apartments in Warsaw" then refine with budget, amenities, etc.
-
Context Retention: Highlight how the assistant remembers previous statements
- Example: User asks about parking after discussing a property, system knows which property they're referring to
-
Reasoning Capabilities: Show how it explains why properties match criteria
- Example: "This property matches 90% of your criteria: it has the parking and 2 bedrooms you wanted, though it's slightly above your budget"
-
Technical Architecture: Briefly explain the RAG approach and how it prevents hallucinations
The application uses CSV datasets with the following structure. Datasets can be loaded from local files or remote URLs.
This table describes the columns in the DataFrame:
Column Name | Description |
---|---|
id |
Unique identifier for each record. |
city |
Name of the city where the property is located. |
type |
Type of property (e.g., apartment, house). |
square_meters |
Area of the property in square meters. |
rooms |
Number of rooms in the property. |
floor |
Floor number where the property is located. |
floor_count |
Total number of floors in the building. |
build_year |
Year the building was constructed. |
latitude |
Latitude coordinate of the property. |
longitude |
Longitude coordinate of the property. |
centre_distance |
Distance from the property to the city center. |
poi_count |
Number of Points of Interest nearby. |
school_distance |
Distance to the nearest school. |
clinic_distance |
Distance to the nearest clinic. |
post_office_distance |
Distance to the nearest post office. |
kindergarten_distance |
Distance to the nearest kindergarten. |
restaurant_distance |
Distance to the nearest restaurant. |
college_distance |
Distance to the nearest college. |
pharmacy_distance |
Distance to the nearest pharmacy. |
ownership |
Type of ownership (e.g., condominium). |
building_material |
Material used in the construction of the building. |
condition |
Condition of the property (e.g., new, good). |
has_parking_space |
Whether the property has a parking space (True /False ). |
has_balcony |
Whether the property has a balcony (True /False ). |
has_elevator |
Whether the building has an elevator (True /False ). |
has_security |
Whether the property has security (True /False ). |
has_storage_room |
Whether the property has a storage room (True /False ). |
price |
Price of the property. |
price_media |
Median price of similar properties. |
price_delta |
Difference between the property's price and price_media . |
negotiation_rate |
Possibility of negotiation (e.g., High, Medium, Low). |
bathrooms |
Number of bathrooms in the property. |
owner_name |
Name of the property owner. |
owner_phone |
Contact phone number of the property owner. |
has_garden |
Whether the property has a garden (True /False ). |
has_pool |
Whether the property has a pool (True /False ). |
has_garage |
Whether the property has a garage (True /False ). |
has_bike_room |
Whether the property has a bike room (True /False ). |
Challenge | Solution | Technical Details |
---|---|---|
Extracting Structured Preferences from Unstructured Text | LLM-based entity extraction | Using function-calling capabilities of GPT models to convert natural language to structured parameters |
Maintaining Conversation Context | ConversationBufferMemory | Storing and retrieving conversation history to inform subsequent responses |
Efficient Property Retrieval | Vector similarity search | Converting property descriptions to embeddings and finding semantic similarity to user queries |
Handling Multiple Data Formats | Standardized data schema | Normalizing diverse CSV formats to a consistent internal representation |
Performance Optimization | Result caching and parallelization | Implementing caching strategies to avoid redundant computations |
-
Modularity: Separation of concerns between data loading, AI processing, and user interface
# Example from data/csv_loader.py def load_csv_data(file_path: str) -> pd.DataFrame: """Load and normalize CSV data from a file path""" # Implementation follows Single Responsibility Principle
-
Extensibility: Designed for easy addition of new data sources or AI models
# Example from ai/agent.py class PropertyAgent: """Base agent class that can be extended with different LLM implementations""" # Implementation follows Open/Closed Principle
-
Configuration Management: Centralized configuration via environment variables and config files
# Example from common/cfg.py def get_api_key() -> str: """Retrieve API key from environment with appropriate fallbacks""" # Implementation follows best security practices
-
Error Handling: Robust error handling with graceful degradation
# Example strategy for handling API failures try: # Attempt primary model except Exception: # Fall back to alternative model or cached results
-
Testing Strategy: Unit tests for core components with integration testing for end-to-end flows
- Python 3.11 or later
- Poetry for dependency management
- OpenAI API Key (optional, for GPT models)
git clone https://github.com/AleksNeStu/ai-real-estate-assistant.git
cd ai-real-estate-assistant
# Install pip and poetry if needed
python -m ensurepip --upgrade
curl -sSL https://install.python-poetry.org | python3 - --version 1.7.0
# Configure Poetry
poetry config virtualenvs.in-project true
poetry config virtualenvs.prompt 'ai-real-estate-assistant'
# Install dependencies
poetry install --no-root
source .venv/bin/activate
Create a .env
file in the project root with the following contents:
OPENAI_API_KEY=your-api-key-here
Run either version of the application:
# Run V1
streamlit run app.py
# OR
./run_v1.sh
# Run V2
streamlit run app_v2.py
# OR
./run_v2.sh
For additional configuration options:
# Local run with custom settings
./utils/run_local.sh
This application can be deployed to Streamlit Cloud. For details, see the Streamlit Deployment Documentation.
# Build and run Docker container
./utils/run_docker.sh
# Run in development container
./utils/run_dev_container.sh
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.