Update advanced

Lin-jun-xiang · Lin-jun-xiang · commit 80af41de6e30 · 2023-07-12T20:52:14.000+08:00
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
     - [What's LangChain?](#whats-langchain)
     - [How to Use docGPT?](#how-to-use-docgpt)
     - [How to Develop a docGPT with Streamlit?](#how-to-develop-a-docgpt-with-streamlit)
-
+    - [Advanced - How to build a better model in langchain](#advanced---how-to-build-a-better-model-in-langchain)
 
 * Main Development Software and Packages:
     * `Python 3.8.6`
@@ -107,4 +107,67 @@ There are two methods:
     * Click "Deploy an App" and paste your GitHub URL.
     * Complete the deployment of your [application](https://docgpt-app.streamlit.app/).
 
+---
+
+### Advanced - How to build a better model in langchain
+
+Using Langchain to build docGPT, you can pay attention to the following details that can make your model more powerful:
+
+1. **Language Model**
+
+    Choosing the right LLM Model can save you time and effort. For example, you can choose OpenAI's `gpt-3.5-turbo` (default is `text-davinci-003`):
+
+    ```python
+    # ./docGPT/docGPT.py
+    llm = ChatOpenAI(
+    temperature=0.2,
+    max_tokens=2000,
+    model_name='gpt-3.5-turbo'
+    )
+    ```
+
+    Please note that there is no best or worst model. You need to try multiple models to find the one that suits your use case the best. For more OpenAI models, please refer to the [documentation](https://platform.openai.com/docs/models).
+    
+    (Some models support up to 16,000 tokens!)
+
+2. **PDF Loader**
+
+    There are various PDF text loaders available in Python, each with its own advantages and disadvantages. Here are three loaders the authors have used:
+    
+    ([official Langchain documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf))
+
+    * `PyPDF`: Simple and easy to use.
+    * `PyMuPDF`: Reads the document very **quickly** and provides additional metadata such as page numbers and document dates.
+    * `PDFPlumber`: Can **extract text within tables**. Similar to PyMuPDF, it provides metadata but takes longer to parse.
+
+    If your document contains multiple tables and important information is within those tables, it is recommended to try `PDFPlumber`, which may give you unexpected results!
+
+    Please do not overlook this detail, as without correctly parsing the text from the document, even the most powerful LLM model would be useless!
+
+3. **Tracking Token Usage**
+
+    This doesn't make the model more powerful, but it allows you to track the token usage and OpenAI API key consumption during the QA Chain process.
+
+    When using `chain.run`, you can try using the [method](https://python.langchain.com/docs/modules/model_io/models/llms/how_to/token_usage_tracking) provided by Langchain to track token usage here:
+
+    ```python
+    from langchain.callbacks import get_openai_callback
+
+    with get_openai_callback() as callback:
+        response = self.qa_chain.run(query)
+
+    print(callback)
+
+    # Result of print
+    """
+    chain...
+    ...
+    > Finished chain.
+    Total Tokens: 1506
+    Prompt Tokens: 1350
+    Completion Tokens: 156
+    Total Cost (USD): $0.03012
+    ```
+
 <a href="#top">Back to top</a>
+ 
diff --git a/README.zh-TW.md b/README.zh-TW.md
@@ -7,7 +7,7 @@
     - [What's LangChain?](#whats-langchain)
     - [How to Use docGPT?](#how-to-use-docgpt)
     - [How to develope a docGPT with streamlit?](#how-to-develope-a-docgpt-with-streamlit)
-
+    - [Advanced - How to build a better model in langchain](#advanced---how-to-build-a-better-model-in-langchain)
 
 * 主要開發軟體與套件:
     * `Python 3.8.6`
@@ -26,7 +26,7 @@
       * 整合 LLM 與其他工具，達到**連網功能**，本專案以 Serp API 為例子，透過 Langchain 框架，使您能夠詢問模型有關**現今問題** (即 **google 搜尋引擎**)
       * 整合 LLM 與 **LLM Math 模型**，使您能夠讓模型準確做到**數學計算**
 * 本專案的設計架構主要有三個元素:
-    * [`DataConnection`](../model/data_connection.py): 讓 LLM 負責與外部數據溝通，也就是讀取 PDF 檔案，並針對大型 PDF 進行文本切割，避免超出 OPENAI 4000 tokens 的限制
+    * [`DataConnection`](../model/data_connection.py): 讓 LLM 負責與外部數據溝通，也就是讀取 PDF 檔案，並針對大型 PDF 進行文本切割，避免超出 OPENAI 4096 tokens 的限制
     * [`docGPT`](../docGPT/): 該元素就是讓模型了解 PDF 內容的核心，包含將 PDF 文本進行向量嵌入、建立 langchain 的 retrievalQA 模型。詳細簡介請[參考](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa)
     * [`agent`](../agent/agent.py): 負責管理模型所用到的工具、並根據使用者提問**自動判斷**使用何種工具處理，工具包含
         * `SerpAI`: 當使用者問題屬於 "**現今問題**"，使用該工具可以進行 **google 搜索**
@@ -39,14 +39,15 @@
 ### What's LangChain?
 
 * LangChain 是一個用於**開發由語言模型支持的應用程序的框架**。它支持以下應用程序
-        1. 可以將 LLM 模型與外部數據源進行連接
-        2. 允許與 LLM 模型進行交互
+    1. 可以將 LLM 模型與外部數據源進行連接
+    2. 允許與 LLM 模型進行交互
+
 * 有關 langchain 的介紹，建議查看官方文件、[Github源專案](https://github.com/hwchase17/langchain)
 
 
 **ChatGPT 無法回答的問題，交給 Langchain 實現!**
 
-在這邊，作者將簡單介紹 langchain 與 chatgpt 之間的差異，相信您理解以下例子，你會對 langchain 這個開源項目感到震驚!
+在這邊，作者將簡單介紹 langchain 與 chatgpt 之間的差異，相信您理解以下例子，您會對 langchain 這個開源項目感到震驚!
 
 >今天可以想像 chatgpt 無法回答數學問題、超過 2020 年後的事情(例如2023年貴國總統是誰?)
 >
@@ -88,7 +89,7 @@
 
 ### How to develope a docGPT with streamlit?
 
-手把手教學，讓你快速建立一個屬於自己的 chatGPT !
+手把手教學，讓您快速建立一個屬於自己的 chatGPT !
 
 首先請進行 `git clone https://github.com/Lin-jun-xiang/docGPT-streamlit.git`
 
@@ -105,4 +106,66 @@
     * 單擊“部署應用程序”，然後粘貼您的 GitHub URL
     * 完成部屬[應用程序](https://docgpt-app.streamlit.app//)
 
+---
+
+### Advanced - How to build a better model in langchain
+
+使用 Langchain 搭建 docGPT，您可以注意以下幾個點，這些小細節能夠讓您的模型更強大:
+
+1. **Language Model**
+   
+   使用適當的 LLM Model，會讓您事半功倍，例如您可以選擇使用 OpenAI 的 `gpt-3.5-turbo` (預設是 `text-davinci-003`):
+
+   ```python
+   # ./docGPT/docGPT.py
+   llm = ChatOpenAI(
+    temperature=0.2,
+    max_tokens=2000,
+    model_name='gpt-3.5-turbo'
+   ) 
+   ```
+
+   請注意，模型之間並沒有最好與最壞，您需要多試幾個模型，才會發現最適合自己案例的模型，更多 OpenAI model 請[參考](https://platform.openai.com/docs/models)
+   
+   (部分模型可以使用 16,000 tokens!)
+
+2. **PDF Loader**
+
+    在 Python 中有許多解析 PDF 文字的 Loader，每個 Loader 各有優缺點，以下整理三個作者用過的
+    
+    ([Langchain官方介紹](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf)):
+
+    * `PyPDF`: 簡單易用
+    * `PyMuPDF`: 讀取文件**速度非常快速**，除了能解析文字，還能取得頁數、文檔日期...等 MetaData。
+    * `PDFPlumber`: 能夠解析出**表格內部文字**，使用方面與 `PyMuPDF` 相似，皆能取得 MetaData，但是解析時間較長。
+
+    如果您的文件具有多個表格，且重要資訊存在表格中，建議您嘗試 `PDFPlumber`，它會給您意想不到的結果!
+    請不要忽略這個細節，因為沒有正確解析出文件中的文字，即使 LLM 模型再強大也無用! 
+
+3. **Tracking Token Usage**
+
+    這個並不能讓模型強大，但是能讓您清楚知道 QA Chain 的過程中，您使用的 tokens、openai api key 的使用量。
+
+    當您使用 `chain.run` 時，可以嘗試用 langchain 提供的 [方法](https://python.langchain.com/docs/modules/model_io/models/llms/how_to/token_usage_tracking):
+
+    ```python
+    from langchain.callbacks import get_openai_callback
+
+    with get_openai_callback() as callback:
+        response = self.qa_chain.run(query)
+
+    print(callback)
+
+    # Result of print
+    """
+    chain...
+    ...
+    > Finished chain.
+    Total Tokens: 1506
+    Prompt Tokens: 1350
+    Completion Tokens: 156
+    Total Cost (USD): $0.03012
+    """
+    ```
+
 <a href="#top">Back to top</a>
diff --git a/agent/agent.py b/agent/agent.py
@@ -5,8 +5,9 @@
 from langchain import LLMMathChain, SerpAPIWrapper
 from langchain.agents import AgentType, Tool, initialize_agent
 from langchain.callbacks import get_openai_callback
+from langchain.chains import LLMChain
 from langchain.llms import OpenAI
-
+from langchain.prompts import PromptTemplate
 
 openai.api_key = os.getenv('OPENAI_API_KEY')
 os.environ['SERPAPI_API_KEY'] = os.getenv('SERPAPI_API_KEY')
@@ -51,6 +52,20 @@ def create_doc_chat(self, docGPT) -> Tool:
         )
         return tool
 
+    def create_llm_chain(self) -> Tool:
+        """Add a llm tool"""
+        prompt = PromptTemplate(
+            input_variables = ['query'],
+            template = '{query}'
+        )
+        llm_chain = LLMChain(llm=self.llm, prompt = prompt)
+
+        tool = Tool(
+            name='LLM',
+            func=llm_chain.run,
+            description='useful for general purpose queries and logic'
+        )
+        return tool
     def initialize(self, tools):
         for tool in tools:
             if isinstance(tool, Tool):
@@ -66,6 +81,9 @@ def initialize(self, tools):
     def query(self, query: str) -> Optional[str]:
         response = None
         with get_openai_callback() as callback:
+            # TODO: The true result will hide in 'Observation'
+            # https://github.com/hwchase17/langchain/issues/4916
+            # https://python.langchain.com/docs/modules/agents/how_to/intermediate_steps
             response = self.agent_.run(query)
             print(callback)
         return response
diff --git a/app.py b/app.py
@@ -1,4 +1,3 @@
-import asyncio
 import os
 import tempfile
 from functools import lru_cache
@@ -67,7 +66,7 @@ def load_api_key() -> None:
                 type="password",
                 key='SERPAPI_API_KEY'
             )
-            st.session_state.serpai_api_key = SERPAPI_API_KEY
+            st.session_state.serpapi_api_key = SERPAPI_API_KEY
 
         os.environ['SERPAPI_API_KEY'] = SERPAPI_API_KEY
 
@@ -97,29 +96,35 @@ def load_api_key() -> None:
             docGPT.create_qa_chain(
                 chain_type='refine',
             )
+
             docGPT_tool = agent_.create_doc_chat(docGPT)
+            calculate_tool = agent_.get_calculate_chain
+            llm_tool = agent_.create_llm_chain()
+
         except Exception as e:
-            st.write(e)
+            print(e)
 
         try:
             search_tool = agent_.get_searp_chain
         except Exception as e:
-            st.warning('⚠️ You have not pass SEARPAPI key. (Or your api key cannot use.)')
+            print(e)
 
         try:
-            calculate_tool = agent_.get_calculate_chain
-
             tools = [
                 docGPT_tool,
-                search_tool
+                search_tool,
+                llm_tool
             ]
             agent_.initialize(tools)
         except Exception as e:
             st.write(e)
 
 
 if not st.session_state['openai_api_key']:
-    st.error('⚠️ :red[You have not pass OpenAPI key. (Or your api key cannot use.)] Necessary')
+    st.error('⚠️ :red[You have not pass OpenAPI key. (Or your api key cannot use.)] Necessary Pass')
+
+if not st.session_state['serpapi_api_key']:
+    st.warning('⚠️ You have not pass SEARPAPI key. (You cannot ask current events.)')
 
 st.write('---')
 
@@ -131,13 +136,13 @@ def load_api_key() -> None:
 
 
 @lru_cache(maxsize=20)
-async def get_response(query: str):
+def get_response(query: str):
     try:
         if agent_.agent_ is not None:
             response = agent_.query(query)
             return response
     except Exception as e:
-        pass
+        print(e)
 
 query = st.text_input(
     "#### Question:",
@@ -149,7 +154,7 @@ async def get_response(query: str):
 
 with user_container:
     if query and query != '':
-        response = asyncio.run(get_response(query))
+        response = get_response(query)
         st.session_state.query.append(query)
         st.session_state.response.append(response) 
 
diff --git a/requirements.txt b/requirements.txt
@@ -5,3 +5,4 @@ streamlit_chat==0.1.1
 pymupdf
 chromadb
 tiktoken
+google-search-results==2.4.2