You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
Commit to Help
I commit to help with one of those options 👆
Example Code
defload_document(file_path: str, file_extension: str) ->List[Document]:
"""Loads a PDF file into a list of documents"""iffile_extension.lower() ==".pdf":
loader=PyMuPDFLoader(file_path)
eliffile_extension.lower() ==".txt":
loader=TextLoader(file_path)
eliffile_extensionin [".doc", ".docx"]:
loader=UnstructuredWordDocumentLoader(file_path)
else:
raiseValueError(f"Unsupported file extension: {file_extension}")
returnloader.load()
Description
I am running an indexing service inside an Azure App service. Where I am calling the App service endpoint to trigger the indexing from a Blob storage where I have kept the documents to be indexed. When I am triggering the indexing service via the API Call the documents from the Azure blob is taken into the indexing service running in the App Service and tries to copy the files from the blob to a folder in the Azure App service as a temporary storage and then the path is passed in the file_path in the above load_document function. While the PDF and Txt files are getting loaded and indexed properly the .docx files are failing with the error as Error:403 with the below traceback.
Traceback (most recent call last): File "/app/rag/tool_indexer.py", line 80, in index_files doc = load_document(file_path=file, file_extension=file_extension) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/rag/document_loader_pdf.py", line 40, in load_document print("Sustain-check DL content information:", loader.load()) ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 30, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 89, in lazy_load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/word_document.py", line 126, in _get_elements return partition_docx(filename=self.file_path, **self.unstructured_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 149, in partition_docx return list(elements) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 380, in _iter_document_elements yield from self._iter_paragraph_elements(block_item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 598, in _iter_paragraph_elements yield from self._classify_paragraph_to_element(item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 441, in _classify_paragraph_to_element TextSubCls = self._parse_paragraph_text_for_element_type(paragraph) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 902, in _parse_paragraph_text_for_element_type if is_possible_narrative_text(text): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 74, in is_possible_narrative_text if exceeds_cap_ratio(text, threshold=cap_threshold): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 270, in exceeds_cap_ratio if sentence_count(text, 3) > 1: ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 219, in sentence_count sentences = sent_tokenize(text) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 134, in sent_tokenize _download_nltk_packages_if_not_present() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 128, in _download_nltk_packages_if_not_present download_nltk_packages() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 86, in download_nltk_packages urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path) File "/usr/local/lib/python3.11/urllib/request.py", line 241, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I am running an indexing service inside an Azure App service. Where I am calling the App service endpoint to trigger the indexing from a Blob storage where I have kept the documents to be indexed. When I am triggering the indexing service via the API Call the documents from the Azure blob is taken into the indexing service running in the App Service and tries to copy the files from the blob to a folder in the Azure App service as a temporary storage and then the path is passed in the
file_path
in the aboveload_document
function. While the PDF and Txt files are getting loaded and indexed properly the .docx files are failing with the error as Error:403 with the below traceback.Traceback (most recent call last): File "/app/rag/tool_indexer.py", line 80, in index_files doc = load_document(file_path=file, file_extension=file_extension) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/rag/document_loader_pdf.py", line 40, in load_document print("Sustain-check DL content information:", loader.load()) ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 30, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 89, in lazy_load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/word_document.py", line 126, in _get_elements return partition_docx(filename=self.file_path, **self.unstructured_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 149, in partition_docx return list(elements) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 380, in _iter_document_elements yield from self._iter_paragraph_elements(block_item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 598, in _iter_paragraph_elements yield from self._classify_paragraph_to_element(item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 441, in _classify_paragraph_to_element TextSubCls = self._parse_paragraph_text_for_element_type(paragraph) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 902, in _parse_paragraph_text_for_element_type if is_possible_narrative_text(text): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 74, in is_possible_narrative_text if exceeds_cap_ratio(text, threshold=cap_threshold): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 270, in exceeds_cap_ratio if sentence_count(text, 3) > 1: ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 219, in sentence_count sentences = sent_tokenize(text) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 134, in sent_tokenize _download_nltk_packages_if_not_present() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 128, in _download_nltk_packages_if_not_present download_nltk_packages() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 86, in download_nltk_packages urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path) File "/usr/local/lib/python3.11/urllib/request.py", line 241, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
System Info
Python - 3.11
langchain==0.2.17
langchain-community==0.2.16
langchain-openai==0.1.17
Beta Was this translation helpful? Give feedback.
All reactions