Error:403 - While trying to load the document in UnstructuredWordDocumentLoader inside a Azure App service. #29969

gk2588 · 2025-02-24T18:28:15Z

gk2588
Feb 24, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

def load_document(file_path: str, file_extension: str) -> List[Document]:
    """Loads a PDF file into a list of documents"""

    if file_extension.lower() == ".pdf":
        loader = PyMuPDFLoader(file_path)
    elif file_extension.lower() == ".txt":
        loader = TextLoader(file_path)
    elif file_extension in [".doc", ".docx"]:
        loader = UnstructuredWordDocumentLoader(file_path)
    else:
        raise ValueError(f"Unsupported file extension: {file_extension}")


    return loader.load()

Description

I am running an indexing service inside an Azure App service. Where I am calling the App service endpoint to trigger the indexing from a Blob storage where I have kept the documents to be indexed. When I am triggering the indexing service via the API Call the documents from the Azure blob is taken into the indexing service running in the App Service and tries to copy the files from the blob to a folder in the Azure App service as a temporary storage and then the path is passed in the file_path in the above load_document function. While the PDF and Txt files are getting loaded and indexed properly the .docx files are failing with the error as Error:403 with the below traceback.

Traceback (most recent call last): File "/app/rag/tool_indexer.py", line 80, in index_files doc = load_document(file_path=file, file_extension=file_extension) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/rag/document_loader_pdf.py", line 40, in load_document print("Sustain-check DL content information:", loader.load()) ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 30, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 89, in lazy_load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain_community/document_loaders/word_document.py", line 126, in _get_elements return partition_docx(filename=self.file_path, **self.unstructured_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 149, in partition_docx return list(elements) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 380, in _iter_document_elements yield from self._iter_paragraph_elements(block_item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 598, in _iter_paragraph_elements yield from self._classify_paragraph_to_element(item) File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 441, in _classify_paragraph_to_element TextSubCls = self._parse_paragraph_text_for_element_type(paragraph) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/docx.py", line 902, in _parse_paragraph_text_for_element_type if is_possible_narrative_text(text): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 74, in is_possible_narrative_text if exceeds_cap_ratio(text, threshold=cap_threshold): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 270, in exceeds_cap_ratio if sentence_count(text, 3) > 1: ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 219, in sentence_count sentences = sent_tokenize(text) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 134, in sent_tokenize _download_nltk_packages_if_not_present() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 128, in _download_nltk_packages_if_not_present download_nltk_packages() File "/usr/local/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 86, in download_nltk_packages urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path) File "/usr/local/lib/python3.11/urllib/request.py", line 241, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

System Info

Python - 3.11
langchain==0.2.17
langchain-community==0.2.16
langchain-openai==0.1.17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error:403 - While trying to load the document in UnstructuredWordDocumentLoader inside a Azure App service. #29969

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Error:403 - While trying to load the document in UnstructuredWordDocumentLoader inside a Azure App service. #29969

Uh oh!

gk2588 Feb 24, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 0 comments

gk2588
Feb 24, 2025