Skip to content

Missing text when parsing #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tomercagan opened this issue Mar 25, 2024 · 10 comments
Open

Missing text when parsing #115

tomercagan opened this issue Mar 25, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@tomercagan
Copy link

First off - I've been evaluating the service recently and it is nice to use. Quite accurate and easy to work with.

I want to report an issue. I am working with a technical specification document and while working on it, I run into an instance where some content is just omitted. I assume it is a bug.

Here is the code I used, loosely based on the advanced demo example:

from llama_parse import LlamaParse

parser_md = LlamaParse(
    api_key=llama_parse_api_key,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    # num_workers=4, # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en" # Optionaly you can define a language, default=en
)

doc_md = parser_md.load_data("example_pdfs/ts_136124v170100p.pdf")[0]

# print the relevant section
start = doc_md.text.index("Interpretation of the measurement results")
start = doc_md.text.index("Interpretation of the measurement results", start+100)
for line in doc_md.text[start: start + 900].split("\n"):
    print(line)

The output of the code above is:

image

Following is the corresponding page from the PDF (Page numbered 17, the 18th page in the document). Note that before the table, there are two additional paragraphs ("Table 4 specifies..." and "A confidence level of ..."), as well as the caption of the table, that are not in the output of the parsed document.

image

I double checked whether the text appears elsewhere using the following code:

txts = ["Table 4 specifies", "Maximum measurement uncertainty of the Test System", "A confidence level of 95"]
for txt in txts:
    try:
        doc_md.text.index(txt)
    except ValueError:
        print(f"The text '{txt}' is not in parsed document")

which results in the following output:

The text 'Table 4 specifies' is not in parsed document
The text 'Maximum measurement uncertainty of the Test System' is not in parsed document
The text 'A confidence level of 95' is not in parsed document

So this seems like a bug...

When working with text output, that paragraph is parsed correctly. When working with json output (see following) these section are also missing.

parser = LlamaParse(api_key=llama_parse_api_key, verbose=True)
json_objs = parser.get_json_result("example_pdfs/ts_136124v170100p.pdf")

I have the following versions installed:

# main deps
llama-index                                       0.10.20
llama-index-core                                  0.10.21.post1
llama-parse                                       0.3.9
llama-index-readers-file                          0.1.11
llama-index-readers-llama-parse                   0.1.3

# others - maybe relevant
llama-index-agent-openai                          0.1.6
llama-index-embeddings-huggingface                0.1.4
llama-index-embeddings-openai                     0.1.7
llama-index-indices-managed-llama-cloud           0.1.4
llama-index-legacy                                0.9.48
llama-index-llms-anthropic                        0.1.5
llama-index-llms-openai                           0.1.12
llama-index-multi-modal-llms-anthropic            0.1.2
llama-index-multi-modal-llms-openai               0.1.4
llama-index-postprocessor-flag-embedding-reranker 0.1.2
llama-index-program-openai                        0.1.4
llama-index-question-gen-openai                   0.1.3
llamaindex-py-client                              0.1.13
@hexapode hexapode added the bug Something isn't working label Mar 25, 2024
@hexapode hexapode self-assigned this Mar 25, 2024
@tritos-design
Copy link

I can second the exact same behavior when using the plain API via curl. It seems like sometimes text before tables is omitted in markdown output, but the text is present in text output.

@aukinfo
Copy link

aukinfo commented Apr 8, 2024

I have the same problem with other documents

@ggjx22
Copy link

ggjx22 commented Apr 25, 2024

I can confirm this bug still exist. My use case is parsing documents like purchase order and invoice. When the table gets very large (around 30 lines), the parser gets "lazy" and only give results of the first few line items. This behavior is inconsistent. When the table spreads to the next page of the document where its ending (lesser line items, around 20 lines), it decides to parse ALL rows almost perfectly. This has been tested several times when result_type='markdown'. The parser also struggles when it encounters 1) columns with no title and not empty 2) columns with title, and empty. Using parsing_instructions to control the results are not effective too (e.g. do this if table is like that). This leaves the need to consistently verify the parsing quality before indexing them.

# create a simple directory loader
parser = LlamaParse(
    api_key=LLAMA_CLOUD_API_KEY,
    result_type='markdown',
    parsing_instruction=<my_parsing_instructions>
)
file_extractor = {'.pdf': parser}

document = SimpleDirectoryReader(
    input_dir='data/',
    file_extractor=file_extractor,
).load_data()

image
image

@adiko4
Copy link

adiko4 commented Jun 23, 2024

How can I help with that? I'm experiencing the same behavior while extracting PDF to Markdown..
Some text ahead of a table is thrown away.
Would love to solve this :)

@ilyav123
Copy link

I have the same issue (missing text in MD/JSON compared to parsed text) even on very simple PDF files. This is very embarrassing as some strings from the document may be critical for the correct interpretation of the content and therefore for RAG.

Attaching a very simple example that would hopefully help the developers to resolve the bug - for this document I constantly don't get anything before the table in MD (see jobs efc31bb5-f90d-486a-ad39-9dea0f820ebd, d635040b-7af7-4061-a756-d09f232566e3, 182d7cc9-77c1-4ff7-8065-954d54dd09a7)

Workaround: However if I add "Return as much information from the document as possible, don't skip any text from the document. Parse tables into tables." as a parsing instruction, the text from the document starts to appear in MD output! (I didn't do hundrends of test iterations, but for 5 it was OK)

LTD Example.pdf

@dominikpeter
Copy link

We have the same issue. However, with the workaround from @ilyav123 it worked for us too.

@chengyin38
Copy link

chengyin38 commented Nov 26, 2024

Also noticing that llamaparse skip text, particularly for a column with values that span across a few lines, but adding the instruction here like suggested (#115 (comment)) still resulted in incomplete parsing.

@Atharwa1234
Copy link

It still persists even after the instruction was added!

@vivkver
Copy link

vivkver commented Mar 27, 2025

Facing similar issue, details missing from markdown output but present in text output

@tr-amogha
Copy link

facing the same issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests