diff --git a/examples/gemini/python/docs-agent/README.md b/examples/gemini/python/docs-agent/README.md index e7f11f294..357761fd2 100644 --- a/examples/gemini/python/docs-agent/README.md +++ b/examples/gemini/python/docs-agent/README.md @@ -9,10 +9,10 @@ and project deployment. ## Overview -Docs Agent apps use a technique known as Retrieval Augmented Generation (RAG), which allows -you to bring your own documents as knowledge sources to AI language models. This approach -helps the AI language models to generate relevant and accurate responses that are grounded -in the information that you provide and control. +Docs Agent apps use a technique known as Retrieval Augmented Generation (RAG), which +allows you to bring your own documents as knowledge sources to AI language models. +This approach helps the AI language models to generate relevant and accurate responses +that are grounded in the information that you provide and control. ![Docs Agent architecture](docs/images/docs-agent-architecture-01.png) @@ -26,66 +26,43 @@ the [Set up Docs Agent][set-up-docs-agent] section below. The following list summarizes the tasks and features supported by Docs Agent: -- **Process Markdown**: Split Markdown files into small plain text files. (See the - Python scripts in the [`preprocess`][preprocess-dir] directory.) -- **Generate embeddings**: Use an embedding model to process small plain text files - into embeddings, and store them in a vector database. (See the - [`populate_vector_database.py`][populate-vector-database] script.) -- **Perform semantic search**: Compare embeddings in the vector database to retrieve - most relevant content given user questions. -- **Add context to a user question**: Add a list of text chunks returned from - a semantic search as context in a prompt. (See the - [Structure of a prompt to a language model][prompt-structure] section.) -- **(Experimental) “Fact-check” responses**: This experimental feature composes +- **Process Markdown**: Split Markdown files into small plain text chunks. (See + [Docs Agent chunking process][chunking-process].) +- **Generate embeddings**: Use an embedding model to process text chunks into embeddings + and store them in a vector database. +- **Perform semantic search**: Compare embeddings in a vector database to retrieve + chunks that are most relevant to user questions. +- **Add context to a user question**: Add chunks returned from a semantic search as + [context][prompt-structure] to a prompt. +- **Fact-check responses**: This [experimental feature][fact-check-section] composes a follow-up prompt and asks the language model to “fact-check” its own previous response. - (See the [Using a language model to fact-check its own response][fact-check-section] - section.) -- **Generate related questions**: In addition to displaying a response to the user - question, the web UI displays 5 questions generated by the language model based on - the context of the user question. (See the - [Using a language model to suggest related questions][related-questions-section] - section.) -- **Return URLs of documentation sources**: Docs Agent's vector database stores URLs - as metadata next to embeddings. Whenever the vector database is used to retrieve - text chunks for context, the database can also return the URLs of the sources used - to generate the embeddings. -- **Collect feedback from users**: Docs Agent's chatbot web UI includes buttons that - allow users to [like generated responses][like-generated-responses] or - [submit rewrites][submit-a-rewrite]. +- **Generate related questions**: In addition to answering a question, Docs Agent can + [suggest related questions][related-questions-section] based on the context of the + question. +- **Return URLs of source documents**: URLs are stored as chunks' metadata. This enables + Docs Agent to return the URLs of the source documents. +- **Collect feedback from users**: Docs Agent's web app has buttons that allow users + to [like responses][like-generated-responses] or [submit rewrites][submit-a-rewrite]. - **Convert Google Docs, PDF, and Gmail into Markdown files**: This feature uses - Apps Script to convert Google Docs, PDF, and Gmail into Markdown files, which then - can be used as input datasets for Docs Agent. (See the - [`apps_script`][apps-script-readme] directory.) -- **Run benchmark test to monitor the quality of AI-generated responses**: Using - Docs Agent, you can run [benchmark test][benchmark-test] to measure and compare - the quality of text chunks, embeddings, and AI-generated responses. -- **Use the Semantic Retrieval API and AQA model**: You can use Gemini's - [Semantic Retrieval API][semantic-api] to upload source documents to an online - corpus and use the [AQA model][aqa-model] that is specifically created for answering - questions using an online corpus. -- **Manage online corpora using the Docs Agent CLI**: The Docs Agent CLI enables you - to create, populate, update and delete online corpora using the Semantic Retrieval AI. - For the list of all available Docs Agent command lines, see the - [Docs Agent CLI reference][cli-reference] page. -- **Run the Docs Agent CLI from anywhere in a terminal**: You can set up the - Docs Agent CLI to ask questions to the Gemini model from anywhere in a terminal. - For more information, see the [Set up Docs Agent CLI][cli-readme] page. -- **Support the Gemini 1.5 models**: You can use the new Gemini 1.5 models, - `gemini-1.5-pro-latest` and `text-embedding-004`, with Docs Agent today. - For the moment, the following `config.yaml` setup is recommended: - - ``` - models: - - language_model: "models/aqa" - embedding_model: "models/text-embedding-004" - api_endpoint: "generativelanguage.googleapis.com" - ... - app_mode: "1.5" - db_type: "chroma" - ``` - - The setup above uses 3 Gemini models to their strength: AQA (`aqa`), - Gemini 1.0 Pro (`gemini-pro`), and Gemini 1.5 Pro (`gemini-1.5-pro-latest`). + [Apps Script][apps-script-readme] to convert Google Docs, PDF, and Gmail into + Markdown files, which then can be used as input datasets for Docs Agent. +- **Run benchmark test**: Docs Agent can [run benchmark test][benchmark-test] to measure + and compare the quality of text chunks, embeddings, and AI-generated responses. +- **Use the Semantic Retrieval API and AQA model**: Docs Agent can use Gemini's + [Semantic Retrieval API][semantic-api] to upload source documents to online corpora + and use the [AQA model][aqa-model] for answering questions. +- **Manage online corpora using the Docs Agent CLI**: The [Docs Agent CLI][cli-reference] + lets you create, update and delete online corpora using the Semantic Retrieval AI. +- **Prevent duplicate chunks and delete obsolete chunks in databases**: Docs Agent + uses [metadata in chunks][chunking-process] to prevent uploading duplicate chunks + and delete obsolete chunks that are no longer present in the source. +- **Run the Docs Agent CLI from anywhere in a terminal**: + [Set up the Docs Agent CLI][cli-readme] to make requests to the Gemini models + from anywhere in a terminal. +- **Support the Gemini 1.5 models**: Docs Agent works with the Gemini 1.5 models, + `gemini-1.5-pro-latest` and `text-embedding-004`. The new ["1.5"][new-15-mode] web app + mode uses all three Gemini models to their strength: AQA (`aqa`), Gemini 1.0 Pro + (`gemini-pro`), and Gemini 1.5 Pro (`gemini-1.5-pro-latest`). For more information on Docs Agent's architecture and features, see the [Docs Agent concepts][docs-agent-concepts] page. @@ -122,26 +99,26 @@ Update your host machine's environment to prepare for the Docs Agent setup: 1. Update the Linux package repositories on the host machine: - ```posix-terminal + ``` sudo apt update ``` 2. Install the following dependencies: - ```posix-terminal + ``` sudo apt install git pipx python3-venv ``` 3. Install `poetry`: - ```posix-terminal + ``` pipx install poetry ``` 4. To add `$HOME/.local/bin` to your `PATH` variable, run the following command: - ```posix-terminal + ``` pipx ensurepath ``` @@ -157,7 +134,7 @@ Update your host machine's environment to prepare for the Docs Agent setup: 6. Update your environment: - ```posix-termainl + ``` source ~/.bashrc ``` @@ -202,25 +179,25 @@ Clone the Docs Agent project and install dependencies: 1. Clone the following repo: - ```posix-terminal + ``` git clone https://github.com/google/generative-ai-docs.git ``` 2. Go to the Docs Agent project directory: - ```posix-terminal + ``` cd generative-ai-docs/examples/gemini/python/docs-agent ``` 3. Install dependencies using `poetry`: - ```posix-terminal + ``` poetry install ``` 4. Enter the `poetry` shell environment: - ```posix-terminal + ``` poetry shell ``` @@ -437,3 +414,5 @@ Meggin Kearney (`@Meggin`), and Kyo Lee (`@kyolee415`). [oauth-client]: https://ai.google.dev/docs/oauth_quickstart#set-cloud [cli-readme]: docs_agent/interfaces/README.md [cli-reference]: docs/cli-reference.md +[chunking-process]: docs/chunking-process.md +[new-15-mode]: docs/config-reference.md#app_mode diff --git a/examples/gemini/python/docs-agent/docs/chunking-process.md b/examples/gemini/python/docs-agent/docs/chunking-process.md new file mode 100644 index 000000000..e4c054b92 --- /dev/null +++ b/examples/gemini/python/docs-agent/docs/chunking-process.md @@ -0,0 +1,68 @@ +# Docs Agent chunking process + +This page describes Docs Agent's chunking process and potential optimizations. + +Currently, Docs Agent utilizes Markdown headings (`#`, `##`, and `###`) to +split documents into smaller, manageable chunks. However, the Docs Agent team +is actively developing more advanced strategies to improve the quality and +relevance of these chunks for retrieval. + +## Chunking technique + +In Retrieval Augmented Generation ([RAG][rag]) based systems, ensuring each +chunk contains the right information and context is crucial for accurate +retrieval. The goal of an effective chunking process is to ensure that each +chunk encapsulates a focused topic, which enhances the accuracy of retrieval +and ultimately leads to better answers. At the same time, the Docs Agent team +acknowledges the importance of a flexible approach that allows for +customization based on specific datasets and use cases. + +Key characteristics in Docs Agent’s chunking process include: + +- **Docs Agent splits documents based on Markdown headings.** However, + this approach has limitations, especially when dealing with large sections. +- **Docs Agent chunks are smaller than 5000 bytes (characters).** This size + limit is set by the embedding model used in generating embeddings. +- **Docs Agent enhances chunks with additional metadata.** The metadata helps + Docs Agent to execute operations efficiently, such as preventing duplicate + chunks in databases and deleting obsolete chunks that are no longer + present in the source. +- **Docs Agent retrieves the top 5 chunks and displays the top chunk's URL.** + However, this is adjustable in Docs Agent’s configuration (see the `widget` + and `experimental` app modes). + +The Docs Agent team continues to explore various optimizations to enhance +the functionality and effectiveness of the chunking process. These efforts +include refining the chunking algorithm itself and developing advanced +post-processing techniques, for instance, reconstructing chunks to original +documents after retrieval. + +Additionally, the team has been exploring methods for co-optimizing content +structure and chunking strategies, which aims to maximize retrieval +effectiveness by ensuring the structure of the source document itself +complements the chunking process. + +## Chunks retrieval + +Docs Agent employs two distinct approaches for storing and retrieving chunks: + +- **The local database approach uses a [Chroma][chroma] vector database.** + This approach grants greater control over the chunking and retrieval + process. This option is recommended for development and experimental + setups. +- **The online corpus approach uses Gemini’s + [Semantic Retrieval API][semantic-retrieval].** This approach provides + the advantages of centrally hosted online databases, ensuring + accessibility for all users throughout the organization. This approach + has some drawbacks, as control is reduced because the API may dictate + how chunks are selected and where customization can be applied. + +Choosing between these approaches depends on the specific needs of the user’s +deployment situation, which is to balance control and transparency against +possible improvements in performance, broader reach and ease of use. + + + +[rag]: concepts.md +[chroma]: https://docs.trychroma.com/ +[semantic-retrieval]: https://ai.google.dev/gemini-api/docs/semantic_retrieval diff --git a/examples/gemini/python/docs-agent/docs/cli-reference.md b/examples/gemini/python/docs-agent/docs/cli-reference.md index c3d3100d9..e7385aa92 100644 --- a/examples/gemini/python/docs-agent/docs/cli-reference.md +++ b/examples/gemini/python/docs-agent/docs/cli-reference.md @@ -3,10 +3,15 @@ This page provides a list of the Docs Agent command lines and their usages and examples. -**Important**: All `agent` commands in this page need to run in the -`poetry shell` environment. +The Docs Agent CLI helps developers to manage the Docs Agent project and +interact with language models. It can handle various tasks such as +processing documents, populating vector databases, launching the chatbot, +running benchmark test, sending prompts to language models, and more. -## Processing of Markdown files +**Important**: All `agent` commands need to run in the `poetry shell` +environment. + +## Processing documents ### Chunk Markdown files into small text chunks @@ -53,7 +58,16 @@ The command below deletes development databases specified in the agent cleanup-dev ``` -## Docs Agent chatbot web app +### Write logs to a CSV file + +The command below writes the summaries of all captured debugging information +(in the `logs/debugs` directory) to a `.csv` file: + +```sh +agent write-logs-to-csv +``` + +## Launching the chatbot web app ### Launch the Docs Agent web app @@ -89,7 +103,7 @@ a log view page (which is accessible at `/logs`): agent chatbot --enable_show_logs ``` -## Docs Agent benchmark test +## Running benchmark test ### Run the Docs Agent benchmark test @@ -158,7 +172,44 @@ absolure or relative path, for example: agent helpme write comments for this C++ file? --file ../my-project/test.cc ``` -## Online corpus management +### Ask for advice in a session + +The command below starts a new session (`--new`), which tracks responses, +before running the `agent helpme` command: + +```sh +agent helpme --file --new +``` + +For example: + +```sh +agent helpme write a draft of all features found in this README file? --file ./README.md --new +``` + +After starting a session, use the `--cont` flag to include the previous +responses as context to the request: + +```sh +agent helpme --cont +``` + +For example: + +```sh +agent helpme write a concept doc that delves into more details of these features? --cont +``` + +### Ask for advice using RAG + +The command below uses a local or online vector database (specified in +the `config.yaml` file) to retrieve relevant context for the request: + +```sh +agent helpme --file --rag +``` + +## Managing online corpora ### List all existing online corpora diff --git a/examples/gemini/python/docs-agent/docs/config-reference.md b/examples/gemini/python/docs-agent/docs/config-reference.md new file mode 100644 index 000000000..cbb5eda9e --- /dev/null +++ b/examples/gemini/python/docs-agent/docs/config-reference.md @@ -0,0 +1,156 @@ +# Docs Agent configuration reference + +This page provides a list of additional options that can be specified +in the Docs Agent configuration file ([`config.yaml`][config-yaml]). + +## Web application options + +### app_port + +This field sets the port which the Docs Agent web app runs on. + +``` +app_port: 5001 +``` + +By default, the web app is set to use port 5000. + +### app_mode + +This field controls the user interface mode of the web app. + +The options are: + +* `widget`: This mode launches a compact widget-style interface, suitable + for being embedded within a webpage. + + ``` + app_mode: "widget" + ``` + +* `1.5`: This special mode is designed to be used with the Gemini 1.5 Pro + model. + + ``` + models: + - language_model: "models/aqa" + ... + app_mode: "1.5" + ``` + +When this field is not specified, the web app is set to use the standard mode. + +## User feedback options + +### feedback_mode + +This field sets the type of feedback mechanism available to users for providing +the quality or relevance of responses. + +The options are: + +* `feedback`: This is the default setting. + + ``` + feedback_mode: "feedback" + ``` + +* `rewrite`: This option provides the "Rewrite this response" button to allows + users to suggest alternative responses. + + ``` + feedback_mode: "rewrite" + ``` + +* `like_and_dislike`: This option provides simple "Like" and "Dislike" buttons. + + ``` + feedback_mode: "like_and_dislike" + ``` + +## Logging options + +### log_level + +This field controls the level of detail captured in the logs generated by Docs +Agent. + +Setting it to `VERBOSE` provides more comprehensive logging information: + +``` +log_level: "VERBOSE" +``` + +This field is set to `NORMAL` by default. + +### enable_show_logs + +Setting this field to `"True"` allows logs to be displayed on a web browser +(which is accessible at `/logs`): + +``` +enable_show_logs: "True" +``` + +### enable_logs_to_markdown + +Setting this field to `"True"` saves the generated answers as Markdown pages +on the host machine: + +``` +enable_logs_to_markdown: "True" +``` + +### enable_logs_for_debugging + +Setting this field to `"True"` generates detailed logs for debugging purposes: + +``` +enable_logs_for_debugging: "True" +``` + +## Database management options + +### enable_delete_chunks + +Setting this field to `"True"` enables the ability to delete outdated, stale +text chunks from the vector databases: + +``` +enable_delete_chunks: "True" +``` + +## Secondary database configuration + +Docs Agent allows for the use of a secondary database alongside the primary one +for providing additional context from a different source. + +### secondary_db_type + +This field specifies the type of secondary database to be used: + +``` +secondary_db_type: "google_semantic_retrieval" +``` + +or + +``` +secondary_db_type: "chroma" +``` + +When `chroma` is specified, the collection in the `vector_stores/chroma` +directory is used as the secondary database. + +### secondary_corpus_name + +This field defines the name of the corpus for the secondary database, +for example: + +``` +secondary_corpus_name: "corpora/my-example-corpus" +``` + + + +[config-yaml]: ../config.yaml diff --git a/examples/gemini/python/docs-agent/docs/whats-new.md b/examples/gemini/python/docs-agent/docs/whats-new.md index fd2a5d09d..7794324ac 100644 --- a/examples/gemini/python/docs-agent/docs/whats-new.md +++ b/examples/gemini/python/docs-agent/docs/whats-new.md @@ -1,5 +1,31 @@ # What's new in Docs Agent +## April 2024 + +* **Focus: Feature enhancements and usability improvements** +* Expanded CLI functionality with options for managing online corpora and interacting with files. +* Addressed bug fixes and performed code refactoring for improved stability and maintainability. +* Added a new chat app template specifically designed for the **Gemini 1.5 model**. +* Updated GenAI SDK version to `0.5.0`. +* Introduced a splitter for handling of Fuchsia’s FIDL protocol files in the preprocessing stage. + +## March 2024 + +* **Milestone: Introduction of the Docs Agent CLI** +* Added the `tellme` command for direct interaction with Gemini from a Linux terminal. +* Expanded CLI options for corpora management, including creation, deletion, and permission control. +* Enhanced the chat app UI with a "loading" animation and probability-based response pre-screening. +* Enabled displaying more URLs retrieved from the AQA model in the widget mode. +* Added support for including URLs as metadata when uploading chunks to online corpora. + +## February 2024 + +* **Focus: Refining AQA model integration** +* Improved UI rendering of AQA model responses, especially for code segments. +* Addressed bug fixes to handle unexpected AQA model responses. +* Generated related questions by using retrieved context instead of a user question. +* Started logging `answerable_probability` for AQA model responses. + ## January 2024 * **Milestone: Docs Agent uses AQA model and Semantric Retrieval API** diff --git a/examples/gemini/python/docs-agent/docs_agent/agents/__init__.py b/examples/gemini/python/docs-agent/docs_agent/agents/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py b/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py index 2bb8fd0d6..820c22034 100644 --- a/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py +++ b/examples/gemini/python/docs-agent/docs_agent/agents/docs_agent.py @@ -41,7 +41,12 @@ class DocsAgent: """DocsAgent class""" # Temporary parameter of init_chroma - def __init__(self, config: ProductConfig, init_chroma: bool = True, init_semantic: bool = True): + def __init__( + self, + config: ProductConfig, + init_chroma: bool = True, + init_semantic: bool = True, + ): # Models settings self.config = config self.embedding_model = str(self.config.models.embedding_model) @@ -297,6 +302,22 @@ def ask_aqa_model(self, question): response = self.ask_aqa_model_using_local_vector_store(question) return response + # Retrieve and return chunks that are most relevant to the input question. + def retrieve_chunks_from_corpus(self, question, corpus_name: str = "None"): + if corpus_name == "None": + corpus_name = self.corpus_name + user_query = question + results_count = 5 + # Quick fix: This was needed to allow the method to be called + # even when the model is not set to `models/aqa`. + retriever_service_client = glm.RetrieverServiceClient() + # Make the request + request = glm.QueryCorpusRequest( + name=corpus_name, query=user_query, results_count=results_count + ) + query_corpus_response = retriever_service_client.query_corpus(request) + return query_corpus_response + # Use this method for asking a Gemini content model for fact-checking def ask_content_model_to_fact_check(self, context, prev_response): question = self.config.conditions.fact_check_question + "\n\nText: " diff --git a/examples/gemini/python/docs-agent/docs_agent/benchmarks/__init__.py b/examples/gemini/python/docs-agent/docs_agent/benchmarks/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/__init__.py b/examples/gemini/python/docs-agent/docs_agent/interfaces/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/chatui.py b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/chatui.py index 4d41a7abf..730353687 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/chatui.py +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/chatui.py @@ -41,7 +41,13 @@ from docs_agent.storage.chroma import Format from docs_agent.agents.docs_agent import DocsAgent -from docs_agent.memory.logging import log_question, log_like +from docs_agent.memory.logging import ( + log_question, + log_debug_info_to_file, + log_feedback_to_file, + log_like, + log_dislike, +) # This is used to define the app blueprint using a productConfig @@ -56,7 +62,9 @@ def construct_blueprint( # A local Chroma DB is not needed for the Semantic Retreiver only mode. docs_agent = DocsAgent(config=product_config, init_chroma=False) elif product_config.db_type == "none": - docs_agent = DocsAgent(config=product_config, init_chroma=False, init_semantic=False) + docs_agent = DocsAgent( + config=product_config, init_chroma=False, init_semantic=False + ) else: docs_agent = DocsAgent(config=product_config, init_chroma=True) logging.info( @@ -119,9 +127,17 @@ def api(): def like(): if request.method == "POST": json_data = json.loads(request.data) + uuid_found = str(json_data.get("uuid")).strip() is_like = json_data.get("like") - uuid_found = json_data.get("uuid") - log_like(is_like, str(uuid_found).strip()) + if is_like != None: + log_like(is_like, uuid_found) + is_dislike = json_data.get("dislike") + if is_dislike != None: + log_dislike(is_dislike, uuid_found) + # Check if the server has the `debugs` directory. + debug_dir = "logs/debugs" + if os.path.exists(debug_dir): + log_feedback_to_file(uuid_found, is_like, is_dislike) return "OK" else: return redirect(url_for(redirect_index)) @@ -191,9 +207,7 @@ def feedback(): date_format = "%m%d%Y-%H%M%S" date = datetime.now(tz=pytz.utc) date = date.astimezone(pytz.timezone("US/Pacific")) - print( - "[" + date.strftime(date_format) + "] A user has submitted feedback." - ) + print("[" + date.strftime(date_format) + "] A user has submitted feedback.") print("Submitted by: " + user_id + "\n") print("# " + question.strip() + "\n") print("## Response\n") @@ -244,11 +258,20 @@ def question(ask): else: return redirect(url_for(redirect_index)) - # Render the log view page + # Render the log view page. @bp.route("/logs", methods=["GET", "POST"]) def logs(): return show_logs(agent=docs_agent) + # Render the debug view page. + @bp.route("/debugs/", methods=["GET", "POST"]) + def debugs(filename): + if request.method == "GET": + filename = urllib.parse.unquote_plus(filename) + return show_debug_info(agent=docs_agent, filename=filename) + else: + return redirect(url_for(redirect_index)) + return bp @@ -279,12 +302,12 @@ def ask_model(question, agent, template: str = "chatui/index.html"): aqa_response_in_html = "" # Debugging feature: Do not log this question if it ends with `?do_not_log`. - can_be_logged = "True" + can_be_logged = True question_match = re.search(r"^(.*)\?do_not_log$", question) if question_match: # Update the question to remove `do_not_log`. question = question_match[1] + "?" - can_be_logged = "False" + can_be_logged = False # Retrieve context and ask the question. if "gemini" in docs_agent.config.models.language_model: @@ -432,24 +455,44 @@ def ask_model(question, agent, template: str = "chatui/index.html"): else: summary_response = "" log_lines = f"{response}" - + ### LOG THIS REQUEST. - if docs_agent.config.enable_logs_to_markdown == "True": - log_question( - new_uuid, - question, - log_lines, - probability, - save=can_be_logged, - logs_to_markdown="True", - ) - else: - log_question(new_uuid, question, log_lines, probability, save=can_be_logged) + if can_be_logged: + if docs_agent.config.enable_logs_to_markdown == "True": + log_question( + new_uuid, + question, + log_lines, + probability, + save=True, + logs_to_markdown="True", + ) + else: + log_question(new_uuid, question, log_lines, probability, save=True) + # Log debug information. + + if docs_agent.config.enable_logs_for_debugging == "True": + top_source_url = search_result[0].section.url + source_urls = "" + index = 1 + for result in search_result: + source_urls += "[" + str(index) + "]: " + str(result.section.url) + "\n" + index += 1 + log_debug_info_to_file( + uid=new_uuid, + user_question=question, + response=log_lines, + context=final_context, + top_source_url=top_source_url, + source_urls=source_urls, + probability=probability, + server_url=server_url, + ) ### Check the feedback mode in the `config.yaml` file. feedback_mode = "feedback" - if hasattr(docs_agent.config, "feedback_mode") and docs_agent.config.feedback_mode == "rewrite": - feedback_mode = "rewrite" + if hasattr(docs_agent.config, "feedback_mode"): + feedback_mode = str(docs_agent.config.feedback_mode) return render_template( template, @@ -519,3 +562,24 @@ def show_logs(agent, template: str = "admin/logs.html"): logs=log_contents, answerable_logs=answerable_contents, ) + + +# Display a page showing debug information. +def show_debug_info(agent, filename: str, template: str = "admin/debugs.html"): + docs_agent = agent + product = docs_agent.config.product_name + debug_dir = "logs/debugs" + debug_filename = f"{debug_dir}/{filename}" + debug_info = "" + if docs_agent.config.enable_logs_for_debugging == "True": + try: + if debug_filename.endswith("txt"): + with open(debug_filename, "r", encoding="utf-8") as file: + debug_info = file.read() + except: + debug_info = "Cannot find or open this file." + return render_template( + template, + product=product, + debug_info=debug_info, + ) diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui-1.5.css b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui-1.5.css index a076ba1d5..4cdb442e3 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui-1.5.css +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui-1.5.css @@ -247,12 +247,18 @@ body { cursor:pointer; } - .selected { + #like-button.selected { background-color: #1e6a9c; padding-top: 7px; padding-bottom: 7px; } + #dislike-button.selected { + background-color: #CF5C3F; + padding-top: 7px; + padding-bottom: 7px; + } + .selected:hover { background-color: #0a619a; cursor:pointer; @@ -329,6 +335,15 @@ body { cursor:pointer; } + #dislike-button { + border: 0; + color: #fff; + padding-left: 7px; + padding-right: 7px; + border-radius: 5px; + cursor:pointer; + } + #submit-button { border: 0; background: none; diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui.css b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui.css index 8ee1fa684..7267c9b22 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui.css +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-chatui.css @@ -222,12 +222,18 @@ body { cursor:pointer; } - .selected { + #like-button.selected { background-color: #1e6a9c; padding-top: 7px; padding-bottom: 7px; } + #dislike-button.selected { + background-color: #CF5C3F; + padding-top: 7px; + padding-bottom: 7px; + } + .selected:hover { background-color: #0a619a; cursor:pointer; @@ -304,6 +310,15 @@ body { cursor:pointer; } + #dislike-button { + border: 0; + color: #fff; + padding-left: 7px; + padding-right: 7px; + border-radius: 5px; + cursor:pointer; + } + #submit-button { border: 0; background: none; diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-widget.css b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-widget.css index 37e7e5947..9658ef6a7 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-widget.css +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/css/style-widget.css @@ -41,7 +41,7 @@ body { font-weight: 500; font-size: 1.3em; margin-top: 0.1em; - margin-left: 0.1em; + margin-left: 1.0em; margin-bottom: 0.9em; } @@ -238,12 +238,18 @@ body { cursor:pointer; } - .selected { + #like-button.selected { background-color: #1e6a9c; padding-top: 7px; padding-bottom: 7px; } + #dislike-button.selected { + background-color: #CF5C3F; + padding-top: 7px; + padding-bottom: 7px; + } + .selected:hover { background-color: #0a619a; cursor:pointer; @@ -326,6 +332,15 @@ body { cursor:pointer; } + #dislike-button { + border: 0; + color: #fff; + padding-left: 7px; + padding-right: 7px; + border-radius: 5px; + cursor:pointer; + } + #submit-button { border: 0; background: none; diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/javascript/app.js b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/javascript/app.js index 6007e7f80..cd3e9aba9 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/javascript/app.js +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/static/javascript/app.js @@ -88,6 +88,7 @@ if (feedbackButton != null){ // Toggle the selected class on the `like this response` button. let likeButton = document.getElementById('like-button'); +let dislikeButton = document.getElementById('dislike-button'); if (likeButton != null){ likeButton.addEventListener('click',function (){ @@ -100,6 +101,13 @@ if (likeButton != null){ if (uuidBox != null){ uuid = uuidBox.textContent; } + if (dislikeButton != null){ + dislikeButton.classList.add("hidden"); + if (dislikeButton.classList.contains("selected")) { + dislikeButton.classList.remove("selected"); + dislikeButton.classList.add("notselected"); + } + } let xhr = new XMLHttpRequest(); // The value of `urlLike` is specified in the html template, // which is set by the Flask server. @@ -118,6 +126,13 @@ if (likeButton != null){ if (uuidBox != null){ uuid = uuidBox.textContent; } + if (dislikeButton != null){ + dislikeButton.classList.remove("hidden"); + if (dislikeButton.classList.contains("selected")) { + dislikeButton.classList.remove("selected"); + dislikeButton.classList.add("notselected"); + } + } let xhr = new XMLHttpRequest(); // The value of `urlLike` is specified in the html template, // which is set by the Flask server. @@ -131,6 +146,63 @@ if (likeButton != null){ }); } +// Toggle the selected class on the `dislike` button. +if (dislikeButton != null){ + dislikeButton.addEventListener('click',function (){ + if (dislikeButton.classList.contains("notselected")) { + this.classList.remove("notselected"); + this.classList.add("selected"); + this.value = "Disliked" + let uuidBox = document.getElementById('uuid-box'); + let uuid = "Unknown"; + if (uuidBox != null){ + uuid = uuidBox.textContent; + } + if (likeButton != null){ + likeButton.classList.add("hidden"); + if (likeButton.classList.contains("selected")) { + likeButton.classList.remove("selected"); + likeButton.classList.add("notselected"); + } + } + let xhr = new XMLHttpRequest(); + // The value of `urlLike` is specified in the html template, + // which is set by the Flask server. + // See chatbot/templates/chatui/base.html + xhr.open("POST", urlLike, true); + xhr.setRequestHeader("Accept", "application/json"); + xhr.setRequestHeader("Content-Type", "application/json"); + let data = JSON.stringify({"dislike": true, "uuid": uuid}); + xhr.send(data); + }else{ + this.classList.remove("selected"); + this.classList.add("notselected"); + this.value = 'Dislike \uD83D\uDC4E'; + let uuidBox = document.getElementById('uuid-box'); + let uuid = "Unknown"; + if (uuidBox != null){ + uuid = uuidBox.textContent; + } + if (likeButton != null){ + likeButton.classList.remove("hidden"); + if (likeButton.classList.contains("selected")) { + likeButton.classList.remove("selected"); + likeButton.classList.add("notselected"); + } + } + let xhr = new XMLHttpRequest(); + // The value of `urlLike` is specified in the html template, + // which is set by the Flask server. + // See chatbot/templates/chatui/base.html + xhr.open("POST", urlLike, true); + xhr.setRequestHeader("Accept", "application/json"); + xhr.setRequestHeader("Content-Type", "application/json"); + let data = JSON.stringify({"dislike": false, "uuid": uuid}); + xhr.send(data); + } + }); +} + // Adjust the size of the `edit-text-area` textarea. let rewriteTextArea = document.getElementById('edit-text-area'); diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/admin/debugs.html b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/admin/debugs.html new file mode 100644 index 000000000..4a04c62e3 --- /dev/null +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/admin/debugs.html @@ -0,0 +1,24 @@ + + + + + + + + + {{ product }} Docs Agent admin page + + + {% if debug_info %} +
+

{{ product }} Docs Agent

+
+
+

Debug information

+
{{ debug_info }}
+

+
+
+ {% endif %} + + diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chat-widget/result.html b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chat-widget/result.html index aa249e3eb..c0c4e9856 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chat-widget/result.html +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chat-widget/result.html @@ -167,6 +167,25 @@

Rewrite:

+{% elif feedback_mode == "like_and_dislike" %} +
+ + + {% if aqa_response_in_html != "" %} + + (Answerable probability: {{"%.3f"|format(probability|float)}}) + + {% endif %} +
+ + + {% else %}
diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui-1.5/result.html b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui-1.5/result.html index 3b17fafe9..3fca7dea5 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui-1.5/result.html +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui-1.5/result.html @@ -160,6 +160,25 @@

Rewrite:

+{% elif feedback_mode == "like_and_dislike" %} +
+ + + {% if aqa_response_in_html != "" %} + + (Answerable probability: {{"%.3f"|format(probability|float)}}) + + {% endif %} +
+ + + {% else %}
diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui/result.html b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui/result.html index e6df8b9b7..0999c7a55 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui/result.html +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/chatbot/templates/chatui/result.html @@ -137,6 +137,25 @@

Rewrite:

+{% elif feedback_mode == "like_and_dislike" %} +
+ + + {% if aqa_response_in_html != "" %} + + (Answerable probability: {{"%.3f"|format(probability|float)}}) + + {% endif %} +
+ + + {% else %}
diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/cli.py b/examples/gemini/python/docs-agent/docs_agent/interfaces/cli.py index 2d76fe3a4..2acf992f1 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/cli.py +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/cli.py @@ -28,6 +28,7 @@ return_pure_dir, end_path_backslash, start_path_no_backslash, + resolve_path, ) from docs_agent.preprocess import files_to_plain_text as chunker from docs_agent.preprocess import populate_vector_database as populate_script @@ -36,10 +37,12 @@ from docs_agent.interfaces import run_console as console from docs_agent.storage.google_semantic_retriever import SemanticRetriever from docs_agent.storage.chroma import ChromaEnhanced +from docs_agent.memory.logging import write_logs_to_csv_file import socket import os import string import re +import time def common_options(func): @@ -209,12 +212,18 @@ def show_config(config_file: typing.Optional[str], product: list[str] = [""]): @cli_client.command(name="tellme") @click.argument("words", nargs=-1) +@click.option( + "--model", + help="Specify a model to use. Overrides the model (language_model) for all loaded product configurations.", + default=None, + multiple=False, +) @common_options -def tellme(words, config_file: typing.Optional[str], product: list[str] = [""]): +def tellme(words, config_file: typing.Optional[str], product: list[str] = [""], model: typing.Optional[str] = None): """Answer a question related to the product.""" # Loads configurations from common options loaded_config, product_configs = return_config_and_product( - config_file=config_file, product=product + config_file=config_file, product=product, model=model ) if product == (): # Extracts the first product by default @@ -229,70 +238,231 @@ def tellme(words, config_file: typing.Optional[str], product: list[str] = [""]): @cli_client.command() @click.argument("words", nargs=-1) +@click.option( + "--model", + help="Specify a model to use. Overrides the model (language_model) for all loaded product configurations.", + default=None, + multiple=False, +) @click.option( "--file", type=click.Path(), - help="Only available with Gemini 1.5 Pro. Specify a file with which a task is performed.", + help="Specify a file to be included as context.", +) +@click.option( + "--path", + type=click.Path(), + help="Specify a path where the request will be applied to all files found in the directory.", +) +@click.option( + "--file_ext", + help="Works with --path. Specify the file type to be selected. By default it is set to None.", ) @click.option( "--rag", is_flag=True, - help="Only works with --file. Augments the context input with content from your configured RAG system.", + help="Add entries found in the knowledge database as context.", +) +@click.option( + "--yaml", + is_flag=True, + help="Works with --path. Store the output as a responses.yaml file.", +) +@click.option( + "--new", + is_flag=True, + help="Works with --file. Clear the previous responses buffer.", +) +@click.option( + "--cont", + is_flag=True, + help="Use the previous responses buffer as context.", ) @common_options def helpme( words, config_file: typing.Optional[str], file: typing.Optional[str] = None, + path: typing.Optional[str] = None, + file_ext: typing.Optional[str] = None, rag: bool = False, + yaml: bool = False, + new: bool = False, + cont: bool = False, product: list[str] = [""], + model: typing.Optional[str] = None, ): - """(Experimental) Ask AI to perform a task using console output. - Use --file to perform a task on a specific file.""" + """Ask the AI model to perform a task from the terminal.""" # Loads configurations from common options loaded_config, product_configs = return_config_and_product( - config_file=config_file, product=product + config_file=config_file, product=product, model=model ) if product == (): # Extracts the first product by default product_config = ConfigFile(products=[product_configs.products[0]]) else: product_config = product_configs + + # Get the language model. + this_model = product_config.products[0].models.language_model + # This feature is only available to the Gemini Pro models (not AQA). + if not this_model.startswith("models/gemini"): + click.echo(f"File mode is not supported with this model: {this_model}") + exit(1) + + # Get the question string. question = "" for word in words: question += word + " " question = question.strip() - if file and file != "None": - # This feature is only available in gemini 1.5 pro (large context) - if product_config.products[0].models.language_model.startswith( - "models/gemini-1.5-pro" - ): - try: - this_file = os.path.realpath(os.path.join(os.getcwd(), file)) - with open(this_file, "r", encoding="utf-8") as auto: - # Read the input content - content = auto.read() - auto.close() - final_file = f"THE CONTENT OF THE FILE {file} BELOW:\n\n" + content - console.ask_model_with_file( - question.strip(), product_config, file=final_file, rag=rag - ) - except FileNotFoundError: - click.echo(f"File not found: {file}") + + # Set the filename for recording exchanges with the Gemini models. + history_file = "/tmp/docs_agent_responses" + + # 4 different mode: Terminal output (default), All files, Single file, + # and Previous exchanges + helpme_mode = "TERMINAL_OUTPUT" + if path: + helpme_mode = "ALL_FILES" + elif file and file != "None": + helpme_mode = "SINGLE_FILE" + elif cont: + helpme_mode = "PREVIOUS_EXCHANGES" + + if helpme_mode == "PREVIOUS_EXCHANGES": + # Continue mode, which uses the previous exchangs as the main context. + this_output = console.ask_model_with_file( + question.strip(), + product_config, + context_file=history_file, + rag=rag, + return_output=True, + ) + print() + print(f"{this_output}") + print() + with open(history_file, "a", encoding="utf-8") as out_file: + out_file.write(f"QUESTION: {question}\n\n") + out_file.write(f"RESPONSE:\n\n{this_output}\n") + out_file.close() + elif helpme_mode == "SINGLE_FILE": + # Single file mode. + this_file = os.path.realpath(os.path.join(os.getcwd(), file)) + this_output = "" + if cont: + # if the `--cont` flag is set, inlcude the previous exchanges as additional context. + this_output = console.ask_model_with_file( + question.strip(), + product_config, + file=this_file, + context_file=history_file, + rag=rag, + return_output=True, + ) else: - click.echo( - f"File mode is not supported with this model: {product_config.products[0].models.language_model}" + # By default, do not inlcude any additional context. + this_output = console.ask_model_with_file( + question.strip(), + product_config, + file=this_file, + rag=rag, + return_output=True, ) + print() + print(f"{this_output}") + print() + # Read the file content to be included in the history file. + file_content = "" + with open(this_file, "r", encoding="utf-8") as target_file: + file_content = target_file.read() + target_file.close() + # If the `--new` flag is set, overwrite the history file. + write_mode = "a" + if new: + write_mode = "w" + # Record this exchange in the history file. + with open(history_file, write_mode, encoding="utf-8") as out_file: + out_file.write(f"QUESTION: {question}\n\n") + out_file.write(f"FILE NAME: {file}\n") + out_file.write(f"FILE CONTENT:\n\n{file_content}\n") + out_file.write(f"RESPONSE:\n\n{this_output}\n") + out_file.close() + elif helpme_mode == "ALL_FILES": + # All files mode, which makes the request to all files in the path. + # Set the `file_type` variable for display only. + file_type = "." + str(file_ext) + if file_ext == None: + file_type = "All types" + + # Ask the user to confirm. + if click.confirm( + f"Making a request to all files found in the path below:\n" + + f"Question: {question}\nPath: {path}\nFile type: {file_type}\n" + + f"Do you want to continue?", + abort=True, + ): + print() + out_buffer = "" + for root, dirs, files in os.walk(resolve_path(path)): + for file in files: + file_path = os.path.realpath(os.path.join(root, file)) + if file_ext == None: + # Apply to all files. + print(f"# File: {file_path}") + this_output = console.ask_model_with_file( + question, + product_config, + file=file_path, + rag=rag, + return_output=True, + ) + this_output = this_output.strip() + print() + print(f"{this_output}") + print() + if yaml is True: + out_buffer += ( + f" - question: {question}\n" + + f" response: {this_output}\n" + + f" file: {file_path}\n" + ) + time.sleep(2) + elif file.endswith(file_ext): + print(f"# File: {file_path}") + this_output = console.ask_model_with_file( + question, + product_config, + file=file_path, + rag=rag, + return_output=True, + ) + this_output = this_output.strip() + print() + print(f"{this_output}") + print() + if yaml is True: + out_buffer += ( + f" - question: {question}\n" + + f" response: {this_output}\n" + + f" file: {file_path}\n" + ) + time.sleep(2) + + if yaml is True: + output_filename = "./responses.yaml" + with open(output_filename, "w", encoding="utf-8") as out_file: + out_file.write("benchmarks:\n") + out_file.write(out_buffer) + out_file.close() else: + # Terminal output mode, which reads the terminal ouput as context. terminal_output = "" # Set the default filename created from the `script` command. file_path = "/tmp/docs_agent_console_input" # Set the maximum number of lines to read from the terminal. lines_limit = -150 - if product_config.products[0].models.language_model.startswith( - "models/gemini-1.5-pro" - ): - # Read, at the most, 5000 lines printed from the latest command ran on the terminal. + # For the new 1.5 pro model, increase the limit to 5000 lines. + if this_model.startswith("models/gemini-1.5-pro"): lines_limit = -5000 try: with open(file_path, "r", encoding="utf-8") as file: @@ -308,16 +478,22 @@ def helpme( except: terminal_output = "No console output is provided." context = "THE FOLLOWING IS MY CONSOLE OUTPUT:\n\n" + terminal_output - console.ask_model_for_help(question.strip(), context, product_config) + console.ask_model_for_help(question, context, product_config) @cli_admin.command() +@click.option( + "--model", + help="Specify a model to use. Overrides the model (language_model) for all loaded product configurations.", + default=None, + multiple=False, +) @common_options -def benchmark(config_file: typing.Optional[str], product: list[str] = [""]): +def benchmark(config_file: typing.Optional[str], product: list[str] = [""], model: typing.Optional[str] = None): """Run the Docs Agent benchmark test.""" # Loads configurations from common options loaded_config, product_config = return_config_and_product( - config_file=config_file, product=product + config_file=config_file, product=product, model=model ) benchmarks.run_benchmarks() @@ -538,5 +714,34 @@ def backup_chroma( click.echo(f"Can't backup chroma database specified: {input_chroma}") +@cli_admin.command() +@click.option("--date", default="None") +@common_options +def write_logs_to_csv( + date: typing.Optional[str], + config_file: typing.Optional[str], + product: list[str] = [""], +): + """Write captured debug information to a CSV file.""" + # Loads configurations from common options + loaded_config, product_config = return_config_and_product( + config_file=config_file, product=product + ) + output_filename = "debug-info-all.csv" + if date != "None": + output_filename = "debug-info-" + str(date) + ".csv" + click.echo( + f"Writing all debug logs from {date} to the logs/{output_filename} file:\n" + ) + else: + click.echo(f"Writing all debug logs to the logs/{output_filename} file:\n") + # Write the target debug logs to a CSV file. + write_logs_to_csv_file(log_date=date) + # Print the content of the CSV file. + with open("./logs/" + output_filename, "r") as f: + print(f.read()) + f.close() + + if __name__ == "__main__": cli() diff --git a/examples/gemini/python/docs-agent/docs_agent/interfaces/run_console.py b/examples/gemini/python/docs-agent/docs_agent/interfaces/run_console.py index 0b11d9bc5..84b3ca775 100644 --- a/examples/gemini/python/docs-agent/docs_agent/interfaces/run_console.py +++ b/examples/gemini/python/docs-agent/docs_agent/interfaces/run_console.py @@ -218,187 +218,146 @@ def ask_model_with_file( question: str, product_configs: ConfigFile, file: typing.Optional[str] = None, + context_file: typing.Optional[str] = None, rag: bool = False, + return_output: bool = False, ): # Initialize Rich console ai_console = Console(width=160) full_prompt = "" final_context = "" - results_num = 5 - # Initialize Docs Agent + response = "" + + # Get the content of the target file. + file_content = "" + if file != None: + try: + with open(file, "r", encoding="utf-8") as auto: + content = auto.read() + auto.close() + file_content = f"\nTHE CONTENT BELOW IS FROM THE FILE {file}:\n\n" + content + except: + print(f"Cannot open the file {file}") + exit(1) + + # Get the content of the context file. + context_file_content = "" + if context_file != None: + try: + with open(context_file, "r", encoding="utf-8") as auto: + content = auto.read() + auto.close() + context_file_content = ( + f"\nTHE CONTENT BELOW IS FROM THE PREVIOUS EXCHANGES WITH GEMINI:\n\n" + + content + ) + file_content = context_file_content + "\n\n" + file_content + except: + print(f"Cannot open the context file {file}") + exit(1) + + # Use the first product by default. + product = product_configs.products[0] + language_model = product.models.language_model with Progress(transient=True) as progress: - search_results = [] - responses = [] - links = [] task_docs_agent = progress.add_task( "[turquoise4 bold]Starting Docs Agent ", total=None, refresh=True ) - for product in product_configs.products: - if "gemini" in product.models.language_model: - if rag and product.db_type == "chroma": - docs_agent = DocsAgent(config=product, init_chroma=True) - else: - docs_agent = DocsAgent(config=product, init_chroma=False) + if rag: + # RAG specified. Retrieve additional context from a database. + if product.db_type == "chroma": + # Use a local Chroma database. + # Initialize Docs Agent. + docs_agent = DocsAgent( + config=product, init_chroma=True, init_semantic=False + ) + # Get the Chroma collection name. + collection = docs_agent.return_chroma_collection() + # Set the progress bar. + label = f"[turquoise4 bold]Asking Gemini (model: {language_model}, source: {collection}) " progress.update( - task_docs_agent, - description=f"[turquoise4 bold]Asking Gemini (model: {product.models.language_model}, source: {docs_agent.return_chroma_collection()}) ", - total=None, + task_docs_agent, description=label, total=None, refresh=True ) - if docs_agent.config.docs_agent_config == "experimental": - results_num = 10 - new_question_count = 5 - else: - results_num = 5 - new_question_count = 5 - if file is not None: - results_num = 15 - if rag: - ( - search_result, - final_context, - ) = docs_agent.query_vector_store_to_build( - question=question, - token_limit=500000, - results_num=results_num, - max_sources=results_num, - ) - final_context = ( - "This is the file: \n" + str(file) + "\n\n" + final_context - ) - else: - search_result = [] - link = "" - final_context = "This is the file: \n" + str(file) + "\n" - else: - # Issue if max_sources > results_num, so leave the same for now - ( - search_result, - final_context, - ) = docs_agent.query_vector_store_to_build( - question=question, - token_limit=30000, - results_num=results_num, - max_sources=results_num, - ) + # Retrieve context from the local Chroma database. ( - response, - full_prompt, - ) = docs_agent.ask_content_model_with_context_prompt( - context=final_context, question=question + search_result, + returned_context, + ) = docs_agent.query_vector_store_to_build( + question=question, + token_limit=500000, + results_num=5, + max_sources=5, ) - if len(search_result) >= 1: - if search_result[0].section.url == "": - link = str(search_result[0].section) - else: - link = search_result[0].section.url - search_results.append(search_result) - responses.append(response) - links.append(link) - elif "aqa" in product.models.language_model: - if product.db_type == "google_semantic_retriever": - docs_agent = DocsAgent(config=product, init_chroma=False) - label = f"[turquoise4 bold]Asking Gemini (model: {product.models.language_model}, " - corpus_name = "" - link = "" - for db_config in product.db_configs: - if db_config.db_type == "google_semantic_retriever": - corpus_name = db_config.corpus_name - if corpus_name != "": - label += "source: " + corpus_name + ") " - progress.update( - task_docs_agent, description=label, total=None, refresh=True - ) - (response, search_result) = docs_agent.ask_aqa_model_using_corpora( - question=question - ) - if len(search_result) >= 1: - if search_result[0].section.url == "": - link = str(search_result[0].section) - else: - link = search_result[0].section.url - search_results.append(search_result) - responses.append(response) - link = "" - links.append(link) - elif product.db_type == "chroma": - docs_agent = DocsAgent(config=product, init_chroma=True) - progress.update( - task_docs_agent, - description=f"[turquoise4 bold]Asking Gemini (model: {product.models.language_model}, source: {docs_agent.return_chroma_collection()}) ", - total=None, - ) - ( - response, - search_result, - ) = docs_agent.ask_aqa_model_using_local_vector_store( - question=question, results_num=results_num - ) - if len(search_result) >= 1: - if search_result[0].section.url == "": - link = str(search_result[0].section) - else: - link = search_result[0].section.url - search_results.append(search_result) - responses.append(response) - links.append(link) - else: - logging.error(f"Unknown db_type: {product.db_type}") - final_search = [] - final_responses = [] - final_response_md = "" - final_links = [] - count = 0 - # Prune the generated answers as needed - for item in search_results: - # Gemini-pro + chroma will give distance instead of probability - # if (item[0].probability != 0): - final_search.append(search_results[count]) - final_responses.append(responses[count]) - final_response_md += responses[count] + "\n" - final_links.append(links[count]) - count += 1 - count = 0 - synthesize = False - synthesize_product = None - # Currently only triggers from a gemini entry into the provided products - for product in product_configs.products: - if "gemini" in product.models.language_model: - synthesize_product = product - if synthesize and not (synthesize_product == None): - docs_agent = DocsAgent(config=synthesize_product, init_chroma=False) + context_from_database = "" + chunk_count = 0 + for item in search_result: + context_from_database += item.section.content + context_from_database += f"\nReference [{chunk_count}]\n\n" + chunk_count += 1 + # Construct the final context for the question. + final_context = ( + "\nTHE CONTENT BELOW IS RETRIEVED FROM THE LOCAL DATABASE:\n\n" + + context_from_database.strip() + + "\n\n" + + str(file_content) + ) + elif product.db_type == "google_semantic_retriever": + # Use an online corpus from the Semantic Retrieval API. + # Initialize Docs Agent. + docs_agent = DocsAgent( + config=product, init_chroma=False, init_semantic=True + ) + # Get the corpus name. + corpus_name = "" + for db_config in product.db_configs: + if db_config.db_type == "google_semantic_retriever": + corpus_name = db_config.corpus_name + # Set the progress bar. + label = f"[turquoise4 bold]Asking Gemini (model: {language_model}, source: {corpus_name}) " + progress.update( + task_docs_agent, description=label, total=None, refresh=True + ) + # Retrieve context from the online corpus. + context_chunks = docs_agent.retrieve_chunks_from_corpus( + question, corpus_name=str(corpus_name) + ) + context_from_corpus = "" + chunk_count = 0 + for chunk in context_chunks.relevant_chunks: + context_from_corpus += chunk.chunk.data.string_value + context_from_corpus += f"\nReference [{chunk_count}]\n\n" + chunk_count += 1 + # Construct the final context for the question. + final_context = ( + "\nTHE CONTENT BELOW IS RETRIEVED FROM THE ONLINE DATABASE:\n\n" + + context_from_corpus.strip() + + "\n\n" + + str(file_content) + ) + else: + logging.error(f"Unknown db_type: {product.db_type}") + exit(1) + else: + # No RAG specified. No additional context to be retrieved from a database. + docs_agent = DocsAgent( + config=product, init_chroma=False, init_semantic=False + ) progress.update( task_docs_agent, - description=f"[turquoise4 bold]Asking {docs_agent.context_model} to synthesize a response", + description=f"[turquoise4 bold]Asking Gemini (model: {language_model}) ", total=None, ) - new_question = f"Help me synthesize the context above into a cohesive response to help me answer my original question. Only use content from the provided context. My original question: {question}" - ( - good_response, - new_prompt, - ) = docs_agent.ask_content_model_with_context_prompt( - context=final_response_md, question=new_question, prompt="" - ) - progress.update(task_docs_agent, visible=False, refresh=True) - # Final printing to console - ai_console.print() - ai_console.print(Markdown(good_response)) - else: - # This returns the responses as is for each agent interaction - progress.update(task_docs_agent, visible=False, refresh=True) - ai_console.print() - ai_console.print(Markdown(final_response_md)) - # Print results - for item in final_search: - count += 1 - count = 0 + final_context = file_content + # print(f"Context:\n{final_context}") + # print(f"Question:\n{question}") + # Ask Gemini with the question and final context. + ( + response, + full_prompt, + ) = docs_agent.ask_content_model_with_context_prompt( + context=final_context, question=question + ) + if return_output: + return str(response.strip()) ai_console.print() - md_links = "" - for item in final_links: - if isinstance(item, str): - if not item.startswith("UUID"): - md_links += f"\n* [{item}]({item})\n" - count += 1 - # Avoid printing links if first link is blank - if final_links[0] != "": - ai_console.print(Markdown("To verify this information, see:\n")) - ai_console.print(Markdown(md_links)) + ai_console.print(Markdown(response.strip())) diff --git a/examples/gemini/python/docs-agent/docs_agent/memory/__init__.py b/examples/gemini/python/docs-agent/docs_agent/memory/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/memory/logging.py b/examples/gemini/python/docs-agent/docs_agent/memory/logging.py index beb1280bc..fc443c169 100644 --- a/examples/gemini/python/docs-agent/docs_agent/memory/logging.py +++ b/examples/gemini/python/docs-agent/docs_agent/memory/logging.py @@ -17,6 +17,8 @@ from datetime import datetime import pytz import os +import re +from uuid import UUID """Module to log interactions with the chatbot""" @@ -49,13 +51,185 @@ def log_answerable_probability(user_question: str, probability: str): log_file.close() +# Save a detailed record of a question and response pair as a file for debugging. +def log_debug_info_to_file( + uid: UUID, + user_question: str, + response: str, + context: str, + top_source_url: str, + source_urls: str, + probability: str = "None", + server_url: str = "None", +): + # Compose a filename + date = datetime.now(tz=pytz.utc) + date = date.astimezone(pytz.timezone("US/Pacific")) + date_formatted = str(date.strftime("%Y-%m-%d-%H-%M-%S")) + question_formatted = str(user_question).lower().replace(" ", "-").replace("?", "") + filename = date_formatted + "-" + question_formatted + "-" + str(uid) + ".txt" + # Set the directory location. + debug_dir = "./logs/debugs" + debug_filename = f"{debug_dir}/{filename}" + if not os.path.exists(debug_dir): + os.makedirs(debug_dir) + with open(debug_filename, "w", encoding="utf-8") as debug_file: + debug_file.write("UID: " + str(uid) + "\n") + debug_file.write("DATE: " + str(date) + "\n") + debug_file.write("SERVER URL: " + server_url.strip() + "\n") + debug_file.write("\n") + debug_file.write("TOP SOURCE URL: " + top_source_url.strip() + "\n") + debug_file.write("ANSWERABLE PROBABILITY: " + str(probability) + "\n") + debug_file.write("\n") + debug_file.write("QUESTION: " + user_question.strip() + "\n\n") + debug_file.write("RESPONSE:\n\n") + debug_file.write(response.strip() + "\n\n") + debug_file.write("CONTEXT:\n\n") + debug_file.write(context.strip() + "\n\n") + debug_file.write("SOURCE URLS:\n\n") + debug_file.write(source_urls.strip() + "\n\n") + debug_file.close() + + +# Save the like and dislike interactions to the file for debugging. +def log_feedback_to_file(uid: str, is_like, is_dislike): + # Search the debugs directory. + debug_dir = "./logs/debugs" + target_string = str(uid) + ".txt" + debug_filename = "" + for root, dirs, files in os.walk(debug_dir): + for file in files: + if file.endswith(target_string): + debug_filename = f"{debug_dir}/{file}" + if debug_filename != "": + with open(debug_filename, "a", encoding="utf-8") as debug_file: + if is_like != None: + debug_file.write("LIKE: " + str(is_like) + "\n") + if is_dislike != None: + debug_file.write("DISLIKE: " + str(is_dislike) + "\n") + debug_file.close() + + +# Write captured debug logs into a CSV file. +def write_logs_to_csv_file(log_date: str = "None"): + # Compose the output CSV filename. + output_filename = "debug-info-all.csv" + if log_date != "None": + output_filename = "debug-info-" + str(log_date) + ".csv" + # Write a header for this CSV file. + log_dir = "./logs" + out_csv_filename = log_dir + "/" + output_filename + line = f"DATE, UID, QUESTION, PROBABILITY, TOP SOURCE URL, DEBUG LINK, FEEDBACK" + with open(out_csv_filename, "w", encoding="utf-8") as csv_file: + csv_file.write(line + "\n") + csv_file.close() + # Search the debugs directory. + debug_dir = "./logs/debugs" + debug_filename = "" + for root, dirs, files in os.walk(debug_dir): + for file in files: + # Read all files if date is "None" else read files from the input date only. + ok_to_read = False + if file.endswith("txt"): + ok_to_read = True + if log_date != "None": + if file.startswith(log_date): + ok_to_read = True + else: + ok_to_read = False + if ok_to_read: + debug_filename = f"{debug_dir}/{file}" + debug_record = "" + # Open and read this debug information file. + with open(debug_filename, "r", encoding="utf-8") as debug_file: + debug_record = debug_file.readlines() + debug_file.close() + uid = "" + date = "" + server_url = "None" + top_source_url = "" + probability = "" + question = "" + like = "None" + dislike = "None" + filename = str(file) + # Scan the lines from thi debug info and extract fields. + for line in debug_record: + match_uid = re.search(r"^UID:\s+(.*)$", line) + match_date = re.search(r"^DATE:\s+(.*)$", line) + match_server_url = re.search(r"^SERVER URL:\s+(.*)$", line) + match_top_url = re.search(r"^TOP SOURCE URL:\s+(.*)$", line) + match_prob = re.search(r"^ANSWERABLE PROBABILITY:\s+(.*)$", line) + match_question = re.search(r"^QUESTION:\s+(.*)$", line) + match_like = re.search(r"^LIKE:\s+(.*)$", line) + match_dislike = re.search(r"^DISLIKE:\s+(.*)$", line) + # Extract fields. + if match_uid: + uid = match_uid.group(1) + elif match_date: + date = match_date.group(1) + elif match_server_url: + server_url = match_server_url.group(1) + elif match_top_url: + top_source_url = match_top_url.group(1) + elif match_prob: + probability = match_prob.group(1) + elif match_question: + question = match_question.group(1) + elif match_like: + like = match_like.group(1) + elif match_dislike: + dislike = match_dislike.group(1) + # Write a short version of this debug information to the CSV file. + debug_file_link = filename + if server_url != "None": + debug_file_link = server_url + "debugs/" + filename + feedback = "None" + if like == "True": + feedback = "Like" + elif dislike == "True": + feedback = "Dislike" + log_debug_info_to_csv_file( + output_filename=output_filename, + date=date, + uid=uid, + question=question, + probability=probability, + top_source_url=top_source_url, + debug_file_link=debug_file_link, + feedback=feedback, + ) + + +# Save the short version of debug information into a CSV file. +def log_debug_info_to_csv_file( + output_filename: str, + date: str, + uid: str, + question: str, + probability: str = "None", + top_source_url: str = "None", + debug_file_link: str = "None", + feedback: str = "None", +): + log_dir = "./logs" + log_filename = log_dir + "/" + str(output_filename) + if not os.path.exists(log_dir): + os.makedirs(log_dir) + question_formatted = str(question).lower().replace(",", " ").replace(";", "") + line = f"{date}, {uid}, {question_formatted}, {probability}, {top_source_url}, {debug_file_link}, {feedback}" + with open(log_filename, "a", encoding="utf-8") as log_file: + log_file.write(line + "\n") + log_file.close() + + # Print and log the question and response. def log_question( uid, user_question: str, response: str, probability: str = "None", - save: str = "True", + save: bool = True, logs_to_markdown: str = "False", ): date_format = "%m/%d/%Y %H:%M:%S %Z" @@ -68,7 +242,7 @@ def log_question( # For the AQA model, also print the response's answerable_probability if probability != "None": print("Answerable probability: " + str(probability) + "\n") - if save == "True": + if save: log_dir = "./logs" log_filename = log_dir + "/chatui_logs.txt" if not os.path.exists(log_dir): @@ -89,13 +263,14 @@ def log_question( log_question_to_file(user_question, response, probability) -def log_like(is_like, uid, save: str = "True"): +def log_like(is_like, uid, save: bool = True): date_format = "%m/%d/%Y %H:%M:%S %Z" date = datetime.now(tz=pytz.utc) date = date.astimezone(pytz.timezone("US/Pacific")) + print() print("UID: " + str(uid)) print("Like: " + str(is_like)) - if save == "True": + if save: log_dir = "./logs" log_filename = log_dir + "/chatui_logs.txt" with open(log_filename, "a", encoding="utf-8") as log_file: @@ -104,3 +279,21 @@ def log_like(is_like, uid, save: str = "True"): ) log_file.write("Like: " + str(is_like) + "\n\n") log_file.close() + + +def log_dislike(is_dislike, uid, save: bool = True): + date_format = "%m/%d/%Y %H:%M:%S %Z" + date = datetime.now(tz=pytz.utc) + date = date.astimezone(pytz.timezone("US/Pacific")) + print() + print("UID: " + str(uid)) + print("Dislike: " + str(is_dislike)) + if save: + log_dir = "./logs" + log_filename = log_dir + "/chatui_logs.txt" + with open(log_filename, "a", encoding="utf-8") as log_file: + log_file.write( + "[" + date.strftime(date_format) + "][UID " + str(uid) + "]\n" + ) + log_file.write("Disike: " + str(is_dislike) + "\n\n") + log_file.close() diff --git a/examples/gemini/python/docs-agent/docs_agent/models/__init__.py b/examples/gemini/python/docs-agent/docs_agent/models/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/postprocess/__init__.py b/examples/gemini/python/docs-agent/docs_agent/postprocess/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/preprocess/README.md b/examples/gemini/python/docs-agent/docs_agent/preprocess/README.md index 6086f70a1..0cb807fd8 100644 --- a/examples/gemini/python/docs-agent/docs_agent/preprocess/README.md +++ b/examples/gemini/python/docs-agent/docs_agent/preprocess/README.md @@ -1,30 +1,95 @@ # Processing source documents in Docs Agent This `README` file provides information on the steps involved in the following two Python scripts, -which are used to process source documents into plain text chunks and store them next to their -generated embeddings in the vector database: +which are used to process source documents into plain text chunks and generate embeddings for +vector databases: -- [`markdown_to_plain_text.py`][markdown-to-plain-text] +- [`files_to_plain_text.py`][files-to-plain-text] - [`populate_vector_database.py`][populate-vector-database] -**Important**: The `markdown_to_plain_text.py` script is being deprecated in -favor of the [`files_to_plain_text.py`][files-to-plain-text] script. - ![Docs Agent pre-processing flow](../../docs/images/docs-agent-pre-processing-01.png) **Figure 1**. Docs Agent's pre-processing flow from source documents to the vector database. -## Steps in the markdown_to_plain_text.py script +**Note**: The `markdown_to_plain_text.py` script is deprecated in favor of +the `files_to_plain_text.py` script. + +## Docs Agent chunking technique example + +The [`files_to_plain_text.py`][files-to-plain-text] script splits documents +into smaller chunks based on Markdown headings (#, ##, and ###). + +For example, consider the following Markdown page: + +``` +# Page title + +This is the introduction paragraph of this page. + +## Section 1 + +This is the paragraph of section 1. + +### Sub-section 1.1 + +This is the paragraph of sub-section 1.1. + +### Sub-section 1.2 + +This is the paragraph of sub-section 1.2. + +## Section 2 + +This is the paragraph of section 2. +``` + +This example Markdown page is split into the following 5 chunks: + +``` +# Page title + +This is the introduction paragraph of this page. +``` -When processing Markdown files to plain text using the -[`markdown_to_plain_text.py`][markdown-to-plain-text] script, the following events take place: +``` +## Section 1 + +This is the paragraph of section 1. +``` + +``` +### Sub-section 1.1 + +This is the paragraph of sub-section 1.1. +``` + +``` +### Sub-section 1.2 + +This is the paragraph of sub-section 1.2. +``` + +``` +## Section 2 + +This is the paragraph of section 2. +``` + +Additionally, becasue the token size limitation of embedding models, the script +recursively splits the chunks above until each chunk's size becomes less than +5000 bytes (characters). + +## Steps in the files_to_plain_text.py script + +In the default setting, when processing Markdown files to plain text using the +[`files_to_plain_text.py`][files-to-plain-text] script, the following events take place: 1. Read the configuration file ([`config.yaml`][config-yaml]) to identify input and output directories. -1. Construct an array of input sources. +1. Construct an array of input sources (which are the `path` entries). 1. **For** each input source, do the following: - 1. Extract all input fields (path, URL, and excluded path). - 1. Call the `process_markdown_files_from_source()` method using these input fields. + 1. Extract all input fields (`path`, `url_prefix`, and more). + 1. Call the `process_files_from_input()` method using these input fields. 1. **For** each sub-path in the input directory and **for** each file in these directories: 1. Check if the file extension is `.md` (that is, a Markdown file). 1. Construct an output directory that preserves the path structure. @@ -63,7 +128,7 @@ When processing plain text chunks to embeddings using the 1. Read the configuration file ([`config.yaml`][config-yaml]) to identify the plain text directory and Chroma settings. -1. Set up the PaLM API environment. +1. Set up the Gemini API environment. 1. Select the embeddings model. 1. Configure the embedding function (including the API call limit). 1. **For** each sub-path in the plain text directory and **for** each file in these directories: @@ -73,12 +138,29 @@ When processing plain text chunks to embeddings using the 1. Read the metadata associated with the text chunk file. 1. Store the text chunk and metabase to the vector database, which also generates an embedding for the text chunk at the time of insertion. - 1. Skip if the file size is larger than 10000 bytes (due to the API limit). + 1. Skip if the file size is larger than 5000 bytes (due to the API limit). 1. Skip if the text chunk is already in the vector database and the checksum hasn’t changed. +### Delete chunks process + +The process below describes how the delete chunks feature is implemented in the +`populate_vector_database.py` script: + +1. Read all exiting entries in the target database. +2. Read all candidate entries in the `file_index.json` file (created after running the + `agent chunk` command). +3. For each entry in the existing entries found in step 1: + + Compare the `text_chunk_filename` fields (included in the entry's `metadata`). + + 1. If not found in the candidate entires in step 2, delete this entry in the database. + + 1. If found, compare the `md_hash` fields: + + If they are different, delete this entry in the database. + -[markdown-to-plain-text]: markdown_to_plain_text.py [files-to-plain-text]: files_to_plain_text.py [populate-vector-database]: populate_vector_database.py [config-yaml]: ../../config.yaml diff --git a/examples/gemini/python/docs-agent/docs_agent/preprocess/__init__.py b/examples/gemini/python/docs-agent/docs_agent/preprocess/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/preprocess/files_to_plain_text.py b/examples/gemini/python/docs-agent/docs_agent/preprocess/files_to_plain_text.py index 141156053..f8926aee8 100644 --- a/examples/gemini/python/docs-agent/docs_agent/preprocess/files_to_plain_text.py +++ b/examples/gemini/python/docs-agent/docs_agent/preprocess/files_to_plain_text.py @@ -325,6 +325,8 @@ def process_fidl_file( filename_to_save = make_file_chunk_name( new_path=new_path, filename_prefix=filename_prefix, index=chunk_number ) + # Get the text chunk filename string (after `docs-agent/data`). + text_chunk_filename = get_relative_path_and_filename(filename_to_save) # Prepare metadata for this FIDL protocol chunk. md_hash = uuid.uuid3(namespace_uuid, fidl_protocol) uuid_file = uuid.uuid3(namespace_uuid, filename_to_save) @@ -350,6 +352,7 @@ def process_fidl_file( "previous_id": int(1), "URL": str(fidl_url), "md_hash": str(md_hash), + "text_chunk_filename": str(text_chunk_filename), "token_estimate": float(1.0), "full_token_estimate": float(1.0), } @@ -410,7 +413,7 @@ def process_files_from_input( file_index = [] full_file_metadata = {} resolved_output_path = resolve_path(product_config.output_path) - chunk_group_name = "text_chunks_" + '{:03d}'.format(input_path_count) + chunk_group_name = "text_chunks_" + "{:03d}".format(input_path_count) # Get the total file count. file_count = sum(len(files) for _, _, files in os.walk(resolve_path(inputpath))) # Set up a status bar for the terminal display. @@ -432,11 +435,17 @@ def process_files_from_input( # Displays status bar progress_bar.set_description_str(f"Processing file {file}", refresh=True) progress_bar.update(1) + # Skip this file if it starts with `_`. + if file.startswith("_"): + continue # Get the full path to this input file. filename_to_open = os.path.join(root, file) # Construct a new sub-directory for storing output plain text files. - new_path = resolved_output_path + "/" + chunk_group_name + re.sub( - resolve_path(inputpath), "", os.path.join(root, "") + new_path = ( + resolved_output_path + + "/" + + chunk_group_name + + re.sub(resolve_path(inputpath), "", os.path.join(root, "")) ) is_exist = os.path.exists(new_path) if not is_exist: @@ -445,66 +454,67 @@ def process_files_from_input( relative_path = make_relative_path( file=file, root=root, inputpath=inputpath ) - # Check the input file type: Markdown, FIDL, or HTML. - if file.endswith(".md") and not file.startswith("_"): - # Add filename to a list - file_index.append(relative_path) - # Increment the Markdown file count. - md_count += 1 - # Process a Markdown file. - this_file_metadata = process_markdown_file( - filename_to_open, - root, - inputpathitem, - splitter, - new_path, - file, - namespace_uuid, - relative_path, - url_prefix, - ) - # Merge this file's metadata to the global metadata. - full_file_metadata.update(this_file_metadata) - elif file.endswith(".fidl") and not file.startswith("_"): - # Add filename to a list - file_index.append(relative_path) - # Increment the FIDL file count. - fidl_count += 1 - # Process a FIDL protocol file. - this_file_metadata = process_fidl_file( - filename_to_open, - root, - inputpathitem, - splitter, - new_path, - file, - namespace_uuid, - relative_path, - url_prefix, - ) - # Merge this file's metadata to the global metadata. - full_file_metadata.update(this_file_metadata) - elif ( - file.endswith(".htm") or file.endswith(".html") - ) and not file.startswith("_"): - # Add filename to a list - file_index.append(relative_path) - # Increment the HTML file count. - html_count += 1 - # Process a HTML file. - this_file_metadata = process_html_file( - filename_to_open, - root, - inputpathitem, - splitter, - new_path, - file, - namespace_uuid, - relative_path, - url_prefix, - ) - # Merge this file's metadata to the global metadata. - full_file_metadata.update(this_file_metadata) + # Select Splitter mode: Markdown, FIDL, or HTML. + if splitter == "token_splitter" or splitter == "process_sections": + if file.endswith(".md"): + # Add filename to a list + file_index.append(relative_path) + # Increment the Markdown file count. + md_count += 1 + # Process a Markdown file. + this_file_metadata = process_markdown_file( + filename_to_open, + root, + inputpathitem, + splitter, + new_path, + file, + namespace_uuid, + relative_path, + url_prefix, + ) + # Merge this file's metadata to the global metadata. + full_file_metadata.update(this_file_metadata) + elif splitter == "fidl_splitter": + if file.endswith(".fidl"): + # Add filename to a list + file_index.append(relative_path) + # Increment the FIDL file count. + fidl_count += 1 + # Process a FIDL protocol file. + this_file_metadata = process_fidl_file( + filename_to_open, + root, + inputpathitem, + splitter, + new_path, + file, + namespace_uuid, + relative_path, + url_prefix, + ) + # Merge this file's metadata to the global metadata. + full_file_metadata.update(this_file_metadata) + else: + if file.endswith(".htm") or file.endswith(".html"): + # Add filename to a list + file_index.append(relative_path) + # Increment the HTML file count. + html_count += 1 + # Process a HTML file. + this_file_metadata = process_html_file( + filename_to_open, + root, + inputpathitem, + splitter, + new_path, + file, + namespace_uuid, + relative_path, + url_prefix, + ) + # Merge this file's metadata to the global metadata. + full_file_metadata.update(this_file_metadata) # The processing of input files is finished. progress_bar.set_description_str(f"Finished processing files.", refresh=False) @@ -705,19 +715,21 @@ def get_chunk_size_distribution_from_product(input_product: ProductConfig): else: count = chunk_size_map["6000"] chunk_size_map["6000"] = count + 1 + total_file_count += 1 # Print the distribution result. - print("\nDistribution of text chunk sizes and counts:") + print("\nSpread of text chunk sizes and counts:") prev_size = 0 for key in list(chunk_size_map): count = chunk_size_map[key] if int(key) == 50: - print(f"Chunks smaller than {key} bytes: {count}") + print(f"- Chunks smaller than {key} bytes: {count}") elif int(key) == 6000: - print(f"Chunks larger than {key} bytes: {count}") + print(f"- Chunks larger than {key} bytes: {count}") else: - print(f"Chunks between {prev_size} and {key} bytes: {count}") + print(f"- Chunks between {prev_size} and {key} bytes: {count}") prev_size = int(key) + print(f"\nTotal number of chunks: {total_file_count}") # Given a ReadConfig object, process all products diff --git a/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/__init__.py b/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/fidl_splitter.py b/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/fidl_splitter.py index 8b7322d2b..50e409b3c 100644 --- a/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/fidl_splitter.py +++ b/examples/gemini/python/docs-agent/docs_agent/preprocess/splitters/fidl_splitter.py @@ -58,8 +58,8 @@ def divide_a_protocol(lines): def construct_chunks(library_name: str, protocol_name: str, lines): contents = [] buffer_size = get_byte_size(lines) - if int(buffer_size) > 6000: - # If the protocol is larget than 6KB, divide it into two. + if int(buffer_size) > 5000: + # If the protocol is larget than 5KB, divide it into two. logging.info( "Found a text chunk (" + str(protocol_name) diff --git a/examples/gemini/python/docs-agent/docs_agent/storage/__init__.py b/examples/gemini/python/docs-agent/docs_agent/storage/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/tests/__init__.py b/examples/gemini/python/docs-agent/docs_agent/tests/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/tests/cli_unit_tests.py b/examples/gemini/python/docs-agent/docs_agent/tests/cli_unit_tests.py new file mode 100644 index 000000000..9cfb28c48 --- /dev/null +++ b/examples/gemini/python/docs-agent/docs_agent/tests/cli_unit_tests.py @@ -0,0 +1,11 @@ +"""Unit tests for Docs Agent CLI.""" + +import unittest + + +class DocsAgentCLIUnitTest(unittest.TestCase): + def test_basic(self): + self.assertEqual('hello'.upper(), 'HELLO') + +if __name__ == "__main__": + unittest.main() diff --git a/examples/gemini/python/docs-agent/docs_agent/testing/test_vector_database.py b/examples/gemini/python/docs-agent/docs_agent/tests/test_vector_database.py similarity index 100% rename from examples/gemini/python/docs-agent/docs_agent/testing/test_vector_database.py rename to examples/gemini/python/docs-agent/docs_agent/tests/test_vector_database.py diff --git a/examples/gemini/python/docs-agent/docs_agent/utilities/__init__.py b/examples/gemini/python/docs-agent/docs_agent/utilities/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/examples/gemini/python/docs-agent/docs_agent/utilities/config.py b/examples/gemini/python/docs-agent/docs_agent/utilities/config.py index 84c32e9f4..aa1846d93 100644 --- a/examples/gemini/python/docs-agent/docs_agent/utilities/config.py +++ b/examples/gemini/python/docs-agent/docs_agent/utilities/config.py @@ -309,6 +309,7 @@ def __init__( feedback_mode: str = "rewrite", enable_show_logs: str = "False", enable_logs_to_markdown: str = "False", + enable_logs_for_debugging: str = "False", enable_delete_chunks: str = "False", secondary_db_type: typing.Optional[str] = None, secondary_corpus_name: typing.Optional[str] = None, @@ -328,6 +329,7 @@ def __init__( self.feedback_mode = feedback_mode self.enable_show_logs = enable_show_logs self.enable_logs_to_markdown = enable_logs_to_markdown + self.enable_logs_for_debugging = enable_logs_for_debugging self.enable_delete_chunks = enable_delete_chunks self.secondary_db_type = secondary_db_type self.secondary_corpus_name = secondary_corpus_name @@ -350,6 +352,7 @@ def __str__(self): Feedback mode: {self.feedback_mode}\n\ Enable show logs: {self.enable_show_logs}\n\ Enable logs to Markdown: {self.enable_logs_to_markdown}\n\ +Enable logs for debugging: {self.enable_logs_for_debugging}\n\ Enable delete chunks: {self.enable_delete_chunks}\n\ Markdown splitter: {self.markdown_splitter}\n\ Database type: {self.db_type}\n\ @@ -443,6 +446,10 @@ def returnProducts(self, product: typing.Optional[str] = None) -> ConfigFile: enable_logs_to_markdown = item["enable_logs_to_markdown"] except KeyError: enable_logs_to_markdown = "False" + try: + enable_logs_for_debugging = item["enable_logs_for_debugging"] + except KeyError: + enable_logs_for_debugging = "False" try: enable_delete_chunks = item["enable_delete_chunks"] except KeyError: @@ -471,7 +478,7 @@ def returnProducts(self, product: typing.Optional[str] = None) -> ConfigFile: app_port=app_port, feedback_mode=feedback_mode, enable_show_logs=enable_show_logs, - enable_logs_to_markdown=enable_logs_to_markdown, + enable_logs_for_debugging=enable_logs_for_debugging, enable_delete_chunks=enable_delete_chunks, secondary_db_type=secondary_db_type, secondary_corpus_name=secondary_corpus_name, @@ -528,7 +535,7 @@ def returnProducts(self, product: typing.Optional[str] = None) -> ConfigFile: # Function to make using common_options simpler def return_config_and_product( - config_file: typing.Optional[str] = None, product: list[str] = [""] + config_file: typing.Optional[str] = None, product: list[str] = [""], model: typing.Optional[str] = None ): if config_file is None: loaded_config = ReadConfig() @@ -542,5 +549,9 @@ def return_config_and_product( for item in product: product_config = loaded_config.returnProducts(product=item) final_products.append(product_config.products[0]) + # Overwrites the language_model for all products + if model is not None: + for item in final_products: + item.models.language_model = model product_config = ConfigFile(products=final_products) return loaded_config, product_config