Skip to content

Commit 63ff0d6

Browse files
committed
parse pdf link instead of upload file from local base on crawl
1 parent 1a3c903 commit 63ff0d6

File tree

5 files changed

+65
-6
lines changed

5 files changed

+65
-6
lines changed

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,15 @@ If you like this project, please give it a ⭐`Star` to support the developers~
6262

6363
---
6464

65+
### 🧨Features
66+
67+
- **`gpt4free` Integration**: Everyone can use `docGPT` for **free** without needing an OpenAI API key.
68+
- **Direct PDF URL Input**: Users can input PDF `URL` links for parsing without uploading `.pdf` files.
69+
- **Langchain Agent**: Enables AI to answer current questions and achieve Google search-like functionality.
70+
- **User-Friendly Environment**: Easy-to-use interface for simple operations.
71+
72+
---
73+
6574
### 🦜️What's LangChain?
6675

6776
* LangChain is a framework for developing applications powered by language models. It supports the following applications:
@@ -101,7 +110,9 @@ Through LangChain, you can create a universal AI model or tailor it for business
101110
- `OpenAI API KEY`: Ensure you have available usage.
102111
- `SERPAPI API KEY`: Required if you want to query content not present in the PDF.
103112

104-
3. 📁Upload a PDF file from local storage.
113+
3. 📁Upload a PDF file (choose one method)
114+
* Method 1: Browse and upload your own `.pdf` file from your local machine.
115+
* Method 2: Enter the PDF `URL` link directly.
105116

106117
4. 🚀Start asking questions!
107118

README.zh-TW.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,15 @@
6565

6666
---
6767

68+
### 🧨Features
69+
70+
- **`gpt4free` 整合**:任何人都可以免費使用 GPT4,無需輸入 OpenAI API 金鑰。
71+
- **直接輸入 PDF 網址**:使用者可以直接輸入 PDF 網址進行解析,無需上傳 .pdf 檔案。
72+
- **Langchain Agent**:AI 能夠回答當前問題,實現類似 Google 搜尋功能。
73+
- **簡易操作環境**:友善的界面,操作簡便
74+
75+
---
76+
6877
### 🦜️What's LangChain?
6978

7079
* LangChain 是一個用於**開發由語言模型支持的應用程序的框架**。它支持以下應用程序
@@ -105,7 +114,10 @@ LangChain 填補了 ChatGPT 的不足之處。通過以下示例,您可以理
105114
* `OpenAI API KEY`: 確保還有可用的使用次數。
106115
* `SERPAPI API KEY`: 如果您要查詢 PDF 中不存在的內容,則需要使用此金鑰。
107116

108-
3. 📁上傳來自本地的 PDF 檔案
117+
3. 📁上傳來自本地的 PDF 檔案 (選擇一個方法)
118+
* 方法一: 從本地機瀏覽並上傳自己的 `.pdf`
119+
* 方法二: 輸入 PDF URL 連結
120+
109121
4. 🚀開始提問 !
110122

111123
![RGB_cleanup](https://github.com/Lin-jun-xiang/docGPT-streamlit/blob/main/img/docGPT.gif?raw=true)

app.py

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,9 @@ def theme() -> None:
4242
1. Enter your API keys: (You can choose to skip it and use the `gpt4free` free model)
4343
* `OpenAI API Key`: Make sure you still have usage left
4444
* `SERPAPI API Key`: Optional. If you want to ask questions about content not appearing in the PDF document, you need this key.
45-
2. Upload a PDF file from your local machine.
45+
2. Upload a PDF file (choose one method):
46+
* method1: Browse and upload your own `.pdf` file from your local machine.
47+
* method2: Enter the PDF `URL` link directly.
4648
3. Start asking questions!
4749
4. More details.(https://github.com/Lin-jun-xiang/docGPT-streamlit)
4850
5. If you have any questions, feel free to leave comments and engage in discussions.(https://github.com/Lin-jun-xiang/docGPT-streamlit/issues)
@@ -92,10 +94,30 @@ def load_api_key() -> None:
9294

9395

9496
def upload_and_process_pdf() -> list:
95-
upload_file = st.file_uploader('#### Upload a PDF file:', type='pdf')
97+
st.write('#### Upload a PDF file:')
98+
browse, url_link = st.tabs(
99+
['Drag and drop file (Browse files)', 'Enter PDF URL link']
100+
)
101+
with browse:
102+
upload_file = st.file_uploader(
103+
'Browse file',
104+
type='pdf',
105+
label_visibility='hidden'
106+
)
107+
upload_file = upload_file.read() if upload_file else None
108+
109+
with url_link:
110+
pdf_url = st.text_input(
111+
"Enter PDF URL Link",
112+
placeholder='https://www.xxx/uploads/file.pdf',
113+
label_visibility='hidden'
114+
)
115+
if pdf_url:
116+
upload_file = PDFLoader.crawl_pdf_file(pdf_url)
117+
96118
if upload_file:
97119
temp_file = tempfile.NamedTemporaryFile(delete=False)
98-
temp_file.write(upload_file.read())
120+
temp_file.write(upload_file)
99121
temp_file_path = temp_file.name
100122

101123
docs = PDFLoader.load_documents(temp_file_path)

model/data_connection.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
import json
21
import os
32
from typing import Iterator
43

4+
import requests
55
from langchain.document_loaders import PyMuPDFLoader
66
from langchain.text_splitter import RecursiveCharacterTextSplitter
7+
import streamlit as st
78

89

910
class PDFLoader:
@@ -35,3 +36,15 @@ def split_documents(
3536
)
3637

3738
return splitter.split_documents(document)
39+
40+
@staticmethod
41+
def crawl_pdf_file(url: str) -> str:
42+
try:
43+
response = requests.get(url)
44+
content_type = response.headers.get('content-type')
45+
if response.status_code == 200 and 'pdf' in content_type:
46+
return response.content
47+
else:
48+
st.warning('Url cannot parse to PDF')
49+
except:
50+
st.warning('Url cannot parse to PDF')

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ tiktoken==0.4.0
99
tenacity==8.1.0
1010
google-search-results==2.4.2
1111
sentence_transformers
12+
requests

0 commit comments

Comments
 (0)