Skip to content

Implement Web Search Integration via Dynamic Web Search Node #28

@MathisVerstrepen

Description

@MathisVerstrepen

This feature introduces the capability for our LLM models to access real-time, up-to-date information from the web, significantly enhancing response accuracy, reducing hallucinations, and expanding the scope of queries the application can effectively answer. This will be achieved through a hybrid approach: users can seamlessly trigger a web search from the chat interface, which dynamically integrates a configurable "Web Search Node" into the graph view, leveraging SearXNG as the metasearch engine.

Depends on #22.

1. Core Concept & Value Proposition

  • Problem: LLMs have a knowledge cutoff and can "hallucinate" information. Users need real-time, factual answers.
  • Solution: Integrate web search to provide fresh, relevant context directly into the LLM prompt.
  • Web Search Node: A new node type in the graph view designed to perform web searches, fetch content, and provide it as context. It will replace the standard "Text to Text" generation node when web search is active.
  • User Benefits: More accurate responses, ability to ask about current events, research specific topics, and verify information.

2. User Flow: Hybrid Activation & Configuration

The user experience for triggering and configuring web search will combine simplicity with advanced control:

  • Chat View Toggle Activation:

    • UI Element: A prominent toggle switch or checkbox will be added near the message input field in the Chat view.
    • Behavior:
      • When this toggle is enabled, and the user sends a message, the system will not use the default "Text to Text" generation node. Instead, it will dynamically create and use a "Web Search Node" in the graph view for that specific chat turn.
      • When the toggle is disabled, the system reverts to using the standard "Text to Text" generation node.
    • Visual Feedback: The toggle should clearly indicate its active state.
  • Dynamic Node Placement in Graph View:

    • When the "Web Search" toggle is enabled in the chat view, the system will swap the active generation node from a "Text to Text" type to a "Web Search Node" for the current chat turn.
    • This Web Search Node will be implicitly created and configured based on global default settings (see Section 4).
    • Goal: Provide the power of a graph node without requiring the user to leave the chat view for basic web search.
  • Configuration & Control:

    • Default Configuration: The Web Search Node dynamically created from the chat toggle will inherit all its search parameters (e.g., number of results, search categories) from the Global Web Search Settings (see Section 4).
    • Editing the Node for More Control:
      • Users can navigate to the graph view.
      • If a Web Search Node was recently active (or persists), users can select and edit it directly.
      • Editing the node in the graph view allows users to override the global default settings for that specific node instance, providing granular control over the search query, number of results, specific domains, etc.
      • This allows for complex, persistent web search contexts within a canvas.
    • Changing Global Defaults: Users can modify the global Web Search settings (Section 4) to change the default behavior for all future dynamic Web Search Nodes created from the chat toggle.

3. Web Search Node Functionality (Backend & Frontend)

The Web Search Node will handle the entire web search lifecycle:

  • Query Formulation:

    • The primary search query will be derived from the user's prompt.
    • Optimization: A smaller LLM will be used to re-write or expand the user's natural language query into a more effective search string for SearXNG.
  • Search Execution (Via SearXNG):

    • Backend Integration: The system will make API calls to a SearXNG instance.
    • SearXNG Choice: We will use SearXNG as it is a free, open-source metasearch engine.
    • Parameters: The API call will include:
      • The formulated search query.
      • Number of results to fetch (default from global settings, overrideable per node).
      • Optional: Specific categories (e.g., news, images, general), time range, language.
      • Optional: Specific domains (site:example.com).
    • Output: SearXNG returns a list of search results, each containing a title, snippet, and URL.
  • Content Fetching & Parsing:

    • Leverage Existing Feature: For the top N (configurable, default 3-5) relevant URLs returned by SearXNG, the system will use the existing firecrawl (or similar) implementation to fetch and parse the full webpage content. Depends on Implement URL Context Integration into LLM Prompts #22.
    • Error Handling: Gracefully handle URLs that are inaccessible (404, timeouts, paywalls, bot blocking, etc.). These should be skipped, and the LLM should not be fed broken context.
  • Content Summarization & Formatting for LLM:

    • Purpose: Raw fetched content is often too verbose and exceeds LLM token limits.
    • Method:
      • Use an LLM (potentially a smaller, cheaper one than the main chat LLM) to summarize the extracted text from each relevant URL.
      • Focus on extracting key facts, figures, and direct answers pertinent to the original query.
      • Define strict token limits for the summarized content from each source and for the total aggregated context.
    • Formatting: Concatenate the summarized content into a clear, structured format for the main LLM:
      --- Web Search Result 1: [Title from SearXNG] (Source: [Original URL]) ---
      <Summarized content from URL 1, truncated if necessary>
      --- End of Web Search Result 1 ---
      
      --- Web Search Result 2: [Title from SearXNG] (Source: [Original URL]) ---
      <Summarized content from URL 2, truncated if necessary>
      --- End of Web Search Result 2 ---
      
  • LLM Prompt Construction:

    • The aggregated, summarized web search context will be injected into the main LLM's prompt, alongside the user's original message and relevant chat history.
    • Token Management (CRITICAL):
      • Perform a pre-flight token calculation for the entire prompt (user message + web context + history).
      • If token limits are exceeded, implement a truncation strategy: prioritize the user's direct prompt, then truncate the web search context (e.g., reduce number of sources, further summarize individual sources), and finally truncate chat history.
      • Provide user feedback if content was truncated.

4. Global Application Settings for Web Search

A new section in the application's "General Settings" will control global defaults for web search:

  • Enable/Disable Web Search: A master toggle to turn the entire web search feature on or off for the user.
  • Default Web Search Node Configuration:
    • Default Search Query Type: (e.g., "Use user's full message," "Use first sentence," "Automatic intent detection").
    • Default Number of Results: (e.g., 3, 5, 7).
    • Default Search Categories: (e.g., "General," "News," "Academic").
    • Default Timeframe: (e.g., "Anytime," "Past Year," "Past Month").
  • Domain Blacklist/Whitelist:
    • Extend the existing URL blacklist functionality to apply to web search results. Users can list domains to exclude (e.g., pinterest.com, example.com) or only include (if whitelist is chosen).
  • SearXNG Instance URL (Advanced): An option to configure the API endpoint.

5. UI/UX Considerations

  • In-Progress Feedback:
    • Crucial for managing user expectations during the search process. Display concise messages in the chat UI: "Searching the web...", "Processing results...", "Summarizing context...".
    • Consider a small spinning indicator on the chat message or the toggle itself.
  • Citations & Source Attribution:
    • Mandatory: For LLM responses that incorporate web search results, clear citations must be provided.
    • Method: Display clickable links/chips below the LLM's response, showing the title of the article and the original URL. Clicking should open the source in a new tab.
    • Example: "According to [Source 1], [information]." with "[1] Article Title - example.com/url" listed below.
  • Error Feedback:
    • If SearXNG fails, or content fetching/parsing fails for a URL, provide clear, user-friendly messages (e.g., "Web search failed," "Could not access some sources").
    • The LLM should also be informed not to hallucinate answers if search context is missing.

6. Technical Challenges & Considerations

  • Token Management:
    • This is the most critical aspect. Aggressive summarization and intelligent truncation are paramount to control costs and stay within LLM limits.
    • Need to clearly define the token budget for web context vs. user prompt vs. chat history.
  • Latency: Web searches add significant latency to response times. Optimize every step (search, fetch, summarize) and provide excellent UI feedback.
  • Resilience:
    • firecrawl failures: Robust error handling for firecrawl when encountering paywalls, CAPTCHAs, or heavily dynamic sites.
    • Network Issues: Graceful degradation if external services are unreachable.
  • Security:
    • SSRF Prevention: Strict validation of URLs fetched by firecrawl to prevent Server-Side Request Forgery.
    • Content Sanitization: Ensure fetched content is safe before being processed by LLMs (e.g., stripping potentially malicious scripts if not handled by firecrawl).

Metadata

Metadata

Labels

new-featureAdding entirely new capabilities or functionalities to the application.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions