-
Notifications
You must be signed in to change notification settings - Fork 412
Missing text when parsing #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can second the exact same behavior when using the plain API via curl. It seems like sometimes text before tables is omitted in markdown output, but the text is present in text output. |
I have the same problem with other documents |
I can confirm this bug still exist. My use case is parsing documents like purchase order and invoice. When the table gets very large (around 30 lines), the parser gets "lazy" and only give results of the first few line items. This behavior is inconsistent. When the table spreads to the next page of the document where its ending (lesser line items, around 20 lines), it decides to parse ALL rows almost perfectly. This has been tested several times when # create a simple directory loader
parser = LlamaParse(
api_key=LLAMA_CLOUD_API_KEY,
result_type='markdown',
parsing_instruction=<my_parsing_instructions>
)
file_extractor = {'.pdf': parser}
document = SimpleDirectoryReader(
input_dir='data/',
file_extractor=file_extractor,
).load_data() |
How can I help with that? I'm experiencing the same behavior while extracting PDF to Markdown.. |
I have the same issue (missing text in MD/JSON compared to parsed text) even on very simple PDF files. This is very embarrassing as some strings from the document may be critical for the correct interpretation of the content and therefore for RAG. Attaching a very simple example that would hopefully help the developers to resolve the bug - for this document I constantly don't get anything before the table in MD (see jobs efc31bb5-f90d-486a-ad39-9dea0f820ebd, d635040b-7af7-4061-a756-d09f232566e3, 182d7cc9-77c1-4ff7-8065-954d54dd09a7) Workaround: However if I add "Return as much information from the document as possible, don't skip any text from the document. Parse tables into tables." as a parsing instruction, the text from the document starts to appear in MD output! (I didn't do hundrends of test iterations, but for 5 it was OK) |
We have the same issue. However, with the workaround from @ilyav123 it worked for us too. |
Also noticing that llamaparse skip text, particularly for a column with values that span across a few lines, but adding the instruction here like suggested (#115 (comment)) still resulted in incomplete parsing. |
It still persists even after the instruction was added! |
Facing similar issue, details missing from markdown output but present in text output |
facing the same issue! |
First off - I've been evaluating the service recently and it is nice to use. Quite accurate and easy to work with.
I want to report an issue. I am working with a technical specification document and while working on it, I run into an instance where some content is just omitted. I assume it is a bug.
Here is the code I used, loosely based on the advanced demo example:
The output of the code above is:
Following is the corresponding page from the PDF (Page numbered 17, the 18th page in the document). Note that before the table, there are two additional paragraphs ("Table 4 specifies..." and "A confidence level of ..."), as well as the caption of the table, that are not in the output of the parsed document.
I double checked whether the text appears elsewhere using the following code:
which results in the following output:
So this seems like a bug...
When working with text output, that paragraph is parsed correctly. When working with json output (see following) these section are also missing.
I have the following versions installed:
The text was updated successfully, but these errors were encountered: