Missing text when parsing #115

tomercagan · 2024-03-25T11:33:42Z

First off - I've been evaluating the service recently and it is nice to use. Quite accurate and easy to work with.

I want to report an issue. I am working with a technical specification document and while working on it, I run into an instance where some content is just omitted. I assume it is a bug.

Here is the code I used, loosely based on the advanced demo example:

from llama_parse import LlamaParse

parser_md = LlamaParse(
    api_key=llama_parse_api_key,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    # num_workers=4, # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en" # Optionaly you can define a language, default=en
)

doc_md = parser_md.load_data("example_pdfs/ts_136124v170100p.pdf")[0]

# print the relevant section
start = doc_md.text.index("Interpretation of the measurement results")
start = doc_md.text.index("Interpretation of the measurement results", start+100)
for line in doc_md.text[start: start + 900].split("\n"):
    print(line)

The output of the code above is:

Following is the corresponding page from the PDF (Page numbered 17, the 18th page in the document). Note that before the table, there are two additional paragraphs ("Table 4 specifies..." and "A confidence level of ..."), as well as the caption of the table, that are not in the output of the parsed document.

I double checked whether the text appears elsewhere using the following code:

txts = ["Table 4 specifies", "Maximum measurement uncertainty of the Test System", "A confidence level of 95"]
for txt in txts:
    try:
        doc_md.text.index(txt)
    except ValueError:
        print(f"The text '{txt}' is not in parsed document")

which results in the following output:

The text 'Table 4 specifies' is not in parsed document
The text 'Maximum measurement uncertainty of the Test System' is not in parsed document
The text 'A confidence level of 95' is not in parsed document

So this seems like a bug...

When working with text output, that paragraph is parsed correctly. When working with json output (see following) these section are also missing.

parser = LlamaParse(api_key=llama_parse_api_key, verbose=True)
json_objs = parser.get_json_result("example_pdfs/ts_136124v170100p.pdf")

I have the following versions installed:

# main deps
llama-index                                       0.10.20
llama-index-core                                  0.10.21.post1
llama-parse                                       0.3.9
llama-index-readers-file                          0.1.11
llama-index-readers-llama-parse                   0.1.3

# others - maybe relevant
llama-index-agent-openai                          0.1.6
llama-index-embeddings-huggingface                0.1.4
llama-index-embeddings-openai                     0.1.7
llama-index-indices-managed-llama-cloud           0.1.4
llama-index-legacy                                0.9.48
llama-index-llms-anthropic                        0.1.5
llama-index-llms-openai                           0.1.12
llama-index-multi-modal-llms-anthropic            0.1.2
llama-index-multi-modal-llms-openai               0.1.4
llama-index-postprocessor-flag-embedding-reranker 0.1.2
llama-index-program-openai                        0.1.4
llama-index-question-gen-openai                   0.1.3
llamaindex-py-client                              0.1.13

The text was updated successfully, but these errors were encountered:

tritos-design · 2024-04-03T12:36:41Z

I can second the exact same behavior when using the plain API via curl. It seems like sometimes text before tables is omitted in markdown output, but the text is present in text output.

aukinfo · 2024-04-08T08:34:23Z

I have the same problem with other documents

ggjx22 · 2024-04-25T02:59:55Z

I can confirm this bug still exist. My use case is parsing documents like purchase order and invoice. When the table gets very large (around 30 lines), the parser gets "lazy" and only give results of the first few line items. This behavior is inconsistent. When the table spreads to the next page of the document where its ending (lesser line items, around 20 lines), it decides to parse ALL rows almost perfectly. This has been tested several times when result_type='markdown'. The parser also struggles when it encounters 1) columns with no title and not empty 2) columns with title, and empty. Using parsing_instructions to control the results are not effective too (e.g. do this if table is like that). This leaves the need to consistently verify the parsing quality before indexing them.

# create a simple directory loader
parser = LlamaParse(
    api_key=LLAMA_CLOUD_API_KEY,
    result_type='markdown',
    parsing_instruction=<my_parsing_instructions>
)
file_extractor = {'.pdf': parser}

document = SimpleDirectoryReader(
    input_dir='data/',
    file_extractor=file_extractor,
).load_data()

adiko4 · 2024-06-23T18:54:23Z

How can I help with that? I'm experiencing the same behavior while extracting PDF to Markdown..
Some text ahead of a table is thrown away.
Would love to solve this :)

ilyav123 · 2024-08-13T09:33:07Z

I have the same issue (missing text in MD/JSON compared to parsed text) even on very simple PDF files. This is very embarrassing as some strings from the document may be critical for the correct interpretation of the content and therefore for RAG.

Attaching a very simple example that would hopefully help the developers to resolve the bug - for this document I constantly don't get anything before the table in MD (see jobs efc31bb5-f90d-486a-ad39-9dea0f820ebd, d635040b-7af7-4061-a756-d09f232566e3, 182d7cc9-77c1-4ff7-8065-954d54dd09a7)

Workaround: However if I add "Return as much information from the document as possible, don't skip any text from the document. Parse tables into tables." as a parsing instruction, the text from the document starts to appear in MD output! (I didn't do hundrends of test iterations, but for 5 it was OK)

LTD Example.pdf

dominikpeter · 2024-08-22T12:37:00Z

We have the same issue. However, with the workaround from @ilyav123 it worked for us too.

chengyin38 · 2024-11-26T21:29:09Z

Also noticing that llamaparse skip text, particularly for a column with values that span across a few lines, but adding the instruction here like suggested (#115 (comment)) still resulted in incomplete parsing.

Atharwa1234 · 2024-12-03T08:41:02Z

It still persists even after the instruction was added!

vivkver · 2025-03-27T17:23:25Z

Facing similar issue, details missing from markdown output but present in text output

tr-amogha · 2025-03-30T20:33:29Z

facing the same issue!

hexapode added the bug Something isn't working label Mar 25, 2024

hexapode self-assigned this Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing text when parsing #115

Missing text when parsing #115

tomercagan commented Mar 25, 2024

tritos-design commented Apr 3, 2024

aukinfo commented Apr 8, 2024

ggjx22 commented Apr 25, 2024

adiko4 commented Jun 23, 2024

ilyav123 commented Aug 13, 2024

dominikpeter commented Aug 22, 2024

chengyin38 commented Nov 26, 2024 •

edited

Loading

Atharwa1234 commented Dec 3, 2024

vivkver commented Mar 27, 2025 •

edited

Loading

tr-amogha commented Mar 30, 2025

Missing text when parsing #115

Missing text when parsing #115

Comments

tomercagan commented Mar 25, 2024

tritos-design commented Apr 3, 2024

aukinfo commented Apr 8, 2024

ggjx22 commented Apr 25, 2024

adiko4 commented Jun 23, 2024

ilyav123 commented Aug 13, 2024

dominikpeter commented Aug 22, 2024

chengyin38 commented Nov 26, 2024 • edited Loading

Atharwa1234 commented Dec 3, 2024

vivkver commented Mar 27, 2025 • edited Loading

tr-amogha commented Mar 30, 2025

chengyin38 commented Nov 26, 2024 •

edited

Loading

vivkver commented Mar 27, 2025 •

edited

Loading