Skip to content

Commit 80af41d

Browse files
committed
Update advanced
2 parents 7aea2d8 + a26589b commit 80af41d

File tree

5 files changed

+169
-19
lines changed

5 files changed

+169
-19
lines changed

README.md

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
- [What's LangChain?](#whats-langchain)
88
- [How to Use docGPT?](#how-to-use-docgpt)
99
- [How to Develop a docGPT with Streamlit?](#how-to-develop-a-docgpt-with-streamlit)
10-
10+
- [Advanced - How to build a better model in langchain](#advanced---how-to-build-a-better-model-in-langchain)
1111

1212
* Main Development Software and Packages:
1313
* `Python 3.8.6`
@@ -107,4 +107,67 @@ There are two methods:
107107
* Click "Deploy an App" and paste your GitHub URL.
108108
* Complete the deployment of your [application](https://docgpt-app.streamlit.app/).
109109

110+
---
111+
112+
### Advanced - How to build a better model in langchain
113+
114+
Using Langchain to build docGPT, you can pay attention to the following details that can make your model more powerful:
115+
116+
1. **Language Model**
117+
118+
Choosing the right LLM Model can save you time and effort. For example, you can choose OpenAI's `gpt-3.5-turbo` (default is `text-davinci-003`):
119+
120+
```python
121+
# ./docGPT/docGPT.py
122+
llm = ChatOpenAI(
123+
temperature=0.2,
124+
max_tokens=2000,
125+
model_name='gpt-3.5-turbo'
126+
)
127+
```
128+
129+
Please note that there is no best or worst model. You need to try multiple models to find the one that suits your use case the best. For more OpenAI models, please refer to the [documentation](https://platform.openai.com/docs/models).
130+
131+
(Some models support up to 16,000 tokens!)
132+
133+
2. **PDF Loader**
134+
135+
There are various PDF text loaders available in Python, each with its own advantages and disadvantages. Here are three loaders the authors have used:
136+
137+
([official Langchain documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf))
138+
139+
* `PyPDF`: Simple and easy to use.
140+
* `PyMuPDF`: Reads the document very **quickly** and provides additional metadata such as page numbers and document dates.
141+
* `PDFPlumber`: Can **extract text within tables**. Similar to PyMuPDF, it provides metadata but takes longer to parse.
142+
143+
If your document contains multiple tables and important information is within those tables, it is recommended to try `PDFPlumber`, which may give you unexpected results!
144+
145+
Please do not overlook this detail, as without correctly parsing the text from the document, even the most powerful LLM model would be useless!
146+
147+
3. **Tracking Token Usage**
148+
149+
This doesn't make the model more powerful, but it allows you to track the token usage and OpenAI API key consumption during the QA Chain process.
150+
151+
When using `chain.run`, you can try using the [method](https://python.langchain.com/docs/modules/model_io/models/llms/how_to/token_usage_tracking) provided by Langchain to track token usage here:
152+
153+
```python
154+
from langchain.callbacks import get_openai_callback
155+
156+
with get_openai_callback() as callback:
157+
response = self.qa_chain.run(query)
158+
159+
print(callback)
160+
161+
# Result of print
162+
"""
163+
chain...
164+
...
165+
> Finished chain.
166+
Total Tokens: 1506
167+
Prompt Tokens: 1350
168+
Completion Tokens: 156
169+
Total Cost (USD): $0.03012
170+
```
171+
110172
<a href="#top">Back to top</a>
173+

README.zh-TW.md

Lines changed: 69 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
- [What's LangChain?](#whats-langchain)
88
- [How to Use docGPT?](#how-to-use-docgpt)
99
- [How to develope a docGPT with streamlit?](#how-to-develope-a-docgpt-with-streamlit)
10-
10+
- [Advanced - How to build a better model in langchain](#advanced---how-to-build-a-better-model-in-langchain)
1111

1212
* 主要開發軟體與套件:
1313
* `Python 3.8.6`
@@ -26,7 +26,7 @@
2626
* 整合 LLM 與其他工具,達到**連網功能**,本專案以 Serp API 為例子,透過 Langchain 框架,使您能夠詢問模型有關**現今問題** (即 **google 搜尋引擎**)
2727
* 整合 LLM 與 **LLM Math 模型**,使您能夠讓模型準確做到**數學計算**
2828
* 本專案的設計架構主要有三個元素:
29-
* [`DataConnection`](../model/data_connection.py): 讓 LLM 負責與外部數據溝通,也就是讀取 PDF 檔案,並針對大型 PDF 進行文本切割,避免超出 OPENAI 4000 tokens 的限制
29+
* [`DataConnection`](../model/data_connection.py): 讓 LLM 負責與外部數據溝通,也就是讀取 PDF 檔案,並針對大型 PDF 進行文本切割,避免超出 OPENAI 4096 tokens 的限制
3030
* [`docGPT`](../docGPT/): 該元素就是讓模型了解 PDF 內容的核心,包含將 PDF 文本進行向量嵌入、建立 langchain 的 retrievalQA 模型。詳細簡介請[參考](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa)
3131
* [`agent`](../agent/agent.py): 負責管理模型所用到的工具、並根據使用者提問**自動判斷**使用何種工具處理,工具包含
3232
* `SerpAI`: 當使用者問題屬於 "**現今問題**",使用該工具可以進行 **google 搜索**
@@ -39,14 +39,15 @@
3939
### What's LangChain?
4040

4141
* LangChain 是一個用於**開發由語言模型支持的應用程序的框架**。它支持以下應用程序
42-
1. 可以將 LLM 模型與外部數據源進行連接
43-
2. 允許與 LLM 模型進行交互
42+
1. 可以將 LLM 模型與外部數據源進行連接
43+
2. 允許與 LLM 模型進行交互
44+
4445
* 有關 langchain 的介紹,建議查看官方文件、[Github源專案](https://github.com/hwchase17/langchain)
4546

4647

4748
**ChatGPT 無法回答的問題,交給 Langchain 實現!**
4849

49-
在這邊,作者將簡單介紹 langchain 與 chatgpt 之間的差異,相信您理解以下例子,你會對 langchain 這個開源項目感到震驚!
50+
在這邊,作者將簡單介紹 langchain 與 chatgpt 之間的差異,相信您理解以下例子,您會對 langchain 這個開源項目感到震驚!
5051

5152
>今天可以想像 chatgpt 無法回答數學問題、超過 2020 年後的事情(例如2023年貴國總統是誰?)
5253
>
@@ -88,7 +89,7 @@
8889

8990
### How to develope a docGPT with streamlit?
9091

91-
手把手教學,讓你快速建立一個屬於自己的 chatGPT !
92+
手把手教學,讓您快速建立一個屬於自己的 chatGPT !
9293

9394
首先請進行 `git clone https://github.com/Lin-jun-xiang/docGPT-streamlit.git`
9495

@@ -105,4 +106,66 @@
105106
* 單擊“部署應用程序”,然後粘貼您的 GitHub URL
106107
* 完成部屬[應用程序](https://docgpt-app.streamlit.app//)
107108

109+
---
110+
111+
### Advanced - How to build a better model in langchain
112+
113+
使用 Langchain 搭建 docGPT,您可以注意以下幾個點,這些小細節能夠讓您的模型更強大:
114+
115+
1. **Language Model**
116+
117+
使用適當的 LLM Model,會讓您事半功倍,例如您可以選擇使用 OpenAI 的 `gpt-3.5-turbo` (預設是 `text-davinci-003`):
118+
119+
```python
120+
# ./docGPT/docGPT.py
121+
llm = ChatOpenAI(
122+
temperature=0.2,
123+
max_tokens=2000,
124+
model_name='gpt-3.5-turbo'
125+
)
126+
```
127+
128+
請注意,模型之間並沒有最好與最壞,您需要多試幾個模型,才會發現最適合自己案例的模型,更多 OpenAI model 請[參考](https://platform.openai.com/docs/models)
129+
130+
(部分模型可以使用 16,000 tokens!)
131+
132+
2. **PDF Loader**
133+
134+
在 Python 中有許多解析 PDF 文字的 Loader,每個 Loader 各有優缺點,以下整理三個作者用過的
135+
136+
([Langchain官方介紹](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf)):
137+
138+
* `PyPDF`: 簡單易用
139+
* `PyMuPDF`: 讀取文件**速度非常快速**,除了能解析文字,還能取得頁數、文檔日期...等 MetaData。
140+
* `PDFPlumber`: 能夠解析出**表格內部文字**,使用方面與 `PyMuPDF` 相似,皆能取得 MetaData,但是解析時間較長。
141+
142+
如果您的文件具有多個表格,且重要資訊存在表格中,建議您嘗試 `PDFPlumber`,它會給您意想不到的結果!
143+
請不要忽略這個細節,因為沒有正確解析出文件中的文字,即使 LLM 模型再強大也無用!
144+
145+
3. **Tracking Token Usage**
146+
147+
這個並不能讓模型強大,但是能讓您清楚知道 QA Chain 的過程中,您使用的 tokens、openai api key 的使用量。
148+
149+
當您使用 `chain.run` 時,可以嘗試用 langchain 提供的 [方法](https://python.langchain.com/docs/modules/model_io/models/llms/how_to/token_usage_tracking):
150+
151+
```python
152+
from langchain.callbacks import get_openai_callback
153+
154+
with get_openai_callback() as callback:
155+
response = self.qa_chain.run(query)
156+
157+
print(callback)
158+
159+
# Result of print
160+
"""
161+
chain...
162+
...
163+
> Finished chain.
164+
Total Tokens: 1506
165+
Prompt Tokens: 1350
166+
Completion Tokens: 156
167+
Total Cost (USD): $0.03012
168+
"""
169+
```
170+
108171
<a href="#top">Back to top</a>

agent/agent.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@
55
from langchain import LLMMathChain, SerpAPIWrapper
66
from langchain.agents import AgentType, Tool, initialize_agent
77
from langchain.callbacks import get_openai_callback
8+
from langchain.chains import LLMChain
89
from langchain.llms import OpenAI
9-
10+
from langchain.prompts import PromptTemplate
1011

1112
openai.api_key = os.getenv('OPENAI_API_KEY')
1213
os.environ['SERPAPI_API_KEY'] = os.getenv('SERPAPI_API_KEY')
@@ -51,6 +52,20 @@ def create_doc_chat(self, docGPT) -> Tool:
5152
)
5253
return tool
5354

55+
def create_llm_chain(self) -> Tool:
56+
"""Add a llm tool"""
57+
prompt = PromptTemplate(
58+
input_variables = ['query'],
59+
template = '{query}'
60+
)
61+
llm_chain = LLMChain(llm=self.llm, prompt = prompt)
62+
63+
tool = Tool(
64+
name='LLM',
65+
func=llm_chain.run,
66+
description='useful for general purpose queries and logic'
67+
)
68+
return tool
5469
def initialize(self, tools):
5570
for tool in tools:
5671
if isinstance(tool, Tool):
@@ -66,6 +81,9 @@ def initialize(self, tools):
6681
def query(self, query: str) -> Optional[str]:
6782
response = None
6883
with get_openai_callback() as callback:
84+
# TODO: The true result will hide in 'Observation'
85+
# https://github.com/hwchase17/langchain/issues/4916
86+
# https://python.langchain.com/docs/modules/agents/how_to/intermediate_steps
6987
response = self.agent_.run(query)
7088
print(callback)
7189
return response

app.py

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import asyncio
21
import os
32
import tempfile
43
from functools import lru_cache
@@ -67,7 +66,7 @@ def load_api_key() -> None:
6766
type="password",
6867
key='SERPAPI_API_KEY'
6968
)
70-
st.session_state.serpai_api_key = SERPAPI_API_KEY
69+
st.session_state.serpapi_api_key = SERPAPI_API_KEY
7170

7271
os.environ['SERPAPI_API_KEY'] = SERPAPI_API_KEY
7372

@@ -97,29 +96,35 @@ def load_api_key() -> None:
9796
docGPT.create_qa_chain(
9897
chain_type='refine',
9998
)
99+
100100
docGPT_tool = agent_.create_doc_chat(docGPT)
101+
calculate_tool = agent_.get_calculate_chain
102+
llm_tool = agent_.create_llm_chain()
103+
101104
except Exception as e:
102-
st.write(e)
105+
print(e)
103106

104107
try:
105108
search_tool = agent_.get_searp_chain
106109
except Exception as e:
107-
st.warning('⚠️ You have not pass SEARPAPI key. (Or your api key cannot use.)')
110+
print(e)
108111

109112
try:
110-
calculate_tool = agent_.get_calculate_chain
111-
112113
tools = [
113114
docGPT_tool,
114-
search_tool
115+
search_tool,
116+
llm_tool
115117
]
116118
agent_.initialize(tools)
117119
except Exception as e:
118120
st.write(e)
119121

120122

121123
if not st.session_state['openai_api_key']:
122-
st.error('⚠️ :red[You have not pass OpenAPI key. (Or your api key cannot use.)] Necessary')
124+
st.error('⚠️ :red[You have not pass OpenAPI key. (Or your api key cannot use.)] Necessary Pass')
125+
126+
if not st.session_state['serpapi_api_key']:
127+
st.warning('⚠️ You have not pass SEARPAPI key. (You cannot ask current events.)')
123128

124129
st.write('---')
125130

@@ -131,13 +136,13 @@ def load_api_key() -> None:
131136

132137

133138
@lru_cache(maxsize=20)
134-
async def get_response(query: str):
139+
def get_response(query: str):
135140
try:
136141
if agent_.agent_ is not None:
137142
response = agent_.query(query)
138143
return response
139144
except Exception as e:
140-
pass
145+
print(e)
141146

142147
query = st.text_input(
143148
"#### Question:",
@@ -149,7 +154,7 @@ async def get_response(query: str):
149154

150155
with user_container:
151156
if query and query != '':
152-
response = asyncio.run(get_response(query))
157+
response = get_response(query)
153158
st.session_state.query.append(query)
154159
st.session_state.response.append(response)
155160

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ streamlit_chat==0.1.1
55
pymupdf
66
chromadb
77
tiktoken
8+
google-search-results==2.4.2

0 commit comments

Comments
 (0)