parse: support partitioning files before parsing #709

pmrowla · 2025-05-01T06:30:20Z

Adds LlamaParse.partition_pages to allow partitioning multiple page docs into separate parsing jobs
- Defaults to None (docs are not partitioned)
- When set, docs will be split into segments of up to partition_pages and each segment will be parsed in a separate job
- Works in conjunction with target_pages and max_pages
When parsing multiple documents, partition segments will be parsed sequentially per document (num_workers batching is handled at the file level, and segments within a single file are not split into additional batches)
When parsing a single document, partitioned segment jobs will be batched by num_workers if target_pages is set. When target_pages is None, segments are parsed sequentially. This is done since we have to naively try to parse each segment until we either reach max_pages or error out from trying to parse past the end of the file.

Known limitations:

Docs are re-uploaded for each partitioned segment. This can be resolved once parse can reference previously uploaded docs in the backend.
Finding the end of a document is naive in order to avoid needing to open docs locally (and adding dependencies to handle determining page count for supported doc types). This could be resolved if we had a way to retrieve total page count for previously uploaded docs via the API

Requires:

parse: fix num_workers/parse job batching #708

pmrowla · 2025-05-02T04:43:07Z

llama_cloud_services/parse/base.py

+                # Fetch JSON result type first to get accurate pagination data
+                # and then fetch the user's desired result type if needed


May be worth considering returning pagination information as a part of job_metadata in the API? Currently we return job_pages which corresponds to the number of pages which were parsed (but may be less than the actual number of pages returned in the response if this was a cache hit)

logan-markewich

The methods in this lib are starting to get very confusing to follow 😅 Going to checkout the code locally to better review this...

logan-markewich · 2025-05-13T16:44:28Z

llama_cloud_services/parse/base.py

+                if result_type == ResultType.JSON.value:
+                    job_result = json_result
+                else:
+                    job_result = await self._get_job_result(


If we already have the JSON result, no need to make another API call right? We can get any info we need out of the JSON?

fixed in 0d6c3bf

llama_cloud_services/parse/base.py

logan-markewich · 2025-05-13T17:08:43Z

llama_cloud_services/parse/base.py

+                    )
+            except Exception:
+                if results:
+                    # API Error is expected if we try to read past the end of the file


If specific errors are expected, we should catch them explicitly

fixed in 72b8b63

logan-markewich · 2025-05-13T17:14:08Z

llama_cloud_services/parse/base.py

+
+        total = 0
+        results: List[Tuple[str, Dict[str, Any]]] = []
+        while self.max_pages is None or total < self.max_pages:


oh boy this loop took me a while to understand. To confirm

The user has enabled parition_pages

The user has set max_pages

The user has not set target pages

So, we parse in batches of pages, until we hit max pages, and return the result

Is that correct?

Yes that's correct.

The issue is that we don't have a way to get the total number of pages in the doc locally. So in the case where the user wants to parse the entire doc (and is not using target_pages), we have to take the naive approach and loop and parse each batch of pages until either:

we reach max_pages if it is set by the user

we reach EOF, which can be indicated by either:

parse returns fewer pages than the batch range we requested, meaning the requested range starts on a valid page and then runs past the end of the doc

API returns a page does not exist error, meaning the requested range starts past the end of the doc (this should only happen when the previous range ends exactly on the last page of the doc)

In the future, we can simplify this behavior if we actually return the total number of available pages within a job result (and then we would know exactly how many remaining batches we need to request, and can run them concurrently instead of doing this sequential loop)

pmrowla self-assigned this May 1, 2025

pmrowla force-pushed the pdf-partition branch 2 times, most recently from 4483d47 to d4a050f Compare May 2, 2025 04:29

pmrowla commented May 2, 2025

View reviewed changes

pmrowla changed the title ~~[WIP] parse: support partitioning files before parsing~~ parse: support partitioning files before parsing May 2, 2025

pmrowla marked this pull request as ready for review May 2, 2025 04:44

pmrowla requested a review from logan-markewich May 7, 2025 07:49

pmrowla added 3 commits May 12, 2025 17:37

parse: add utils for handling target_pages

7e5b15a

parse: support partitioning docs into multiple parse jobs

a9d18b0

tests: add tests for partitioned parse

1ca7f2a

pmrowla force-pushed the pdf-partition branch from d4a050f to 1ca7f2a Compare May 12, 2025 08:37

logan-markewich reviewed May 13, 2025

View reviewed changes

llama_cloud_services/parse/base.py Outdated Show resolved Hide resolved

logan-markewich reviewed May 13, 2025

View reviewed changes

pmrowla added 2 commits May 14, 2025 10:52

drop unneeded get_job_result call

0d6c3bf

add parse JobFailedException and expected error handling

72b8b63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse: support partitioning files before parsing #709

parse: support partitioning files before parsing #709

pmrowla commented May 1, 2025 •

edited

Loading

pmrowla May 2, 2025

logan-markewich left a comment •

edited

Loading

logan-markewich May 13, 2025

pmrowla May 14, 2025

logan-markewich May 13, 2025

pmrowla May 14, 2025

logan-markewich May 13, 2025

pmrowla May 14, 2025

		# Fetch JSON result type first to get accurate pagination data
		# and then fetch the user's desired result type if needed

parse: support partitioning files before parsing #709

Are you sure you want to change the base?

parse: support partitioning files before parsing #709

Conversation

pmrowla commented May 1, 2025 • edited Loading

pmrowla May 2, 2025

Choose a reason for hiding this comment

logan-markewich left a comment • edited Loading

Choose a reason for hiding this comment

logan-markewich May 13, 2025

Choose a reason for hiding this comment

pmrowla May 14, 2025

Choose a reason for hiding this comment

logan-markewich May 13, 2025

Choose a reason for hiding this comment

pmrowla May 14, 2025

Choose a reason for hiding this comment

logan-markewich May 13, 2025

Choose a reason for hiding this comment

pmrowla May 14, 2025

Choose a reason for hiding this comment

pmrowla commented May 1, 2025 •

edited

Loading

logan-markewich left a comment •

edited

Loading