Skip to content

parse: support partitioning files before parsing #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 1, 2025

  • Adds LlamaParse.partition_pages to allow partitioning multiple page docs into separate parsing jobs
    • Defaults to None (docs are not partitioned)
    • When set, docs will be split into segments of up to partition_pages and each segment will be parsed in a separate job
    • Works in conjunction with target_pages and max_pages
  • When parsing multiple documents, partition segments will be parsed sequentially per document (num_workers batching is handled at the file level, and segments within a single file are not split into additional batches)
  • When parsing a single document, partitioned segment jobs will be batched by num_workers if target_pages is set. When target_pages is None, segments are parsed sequentially. This is done since we have to naively try to parse each segment until we either reach max_pages or error out from trying to parse past the end of the file.

Known limitations:

  • Docs are re-uploaded for each partitioned segment. This can be resolved once parse can reference previously uploaded docs in the backend.
  • Finding the end of a document is naive in order to avoid needing to open docs locally (and adding dependencies to handle determining page count for supported doc types). This could be resolved if we had a way to retrieve total page count for previously uploaded docs via the API

Requires:

@pmrowla pmrowla self-assigned this May 1, 2025
@pmrowla pmrowla force-pushed the pdf-partition branch 2 times, most recently from 4483d47 to d4a050f Compare May 2, 2025 04:29
Comment on lines +1067 to +1068
# Fetch JSON result type first to get accurate pagination data
# and then fetch the user's desired result type if needed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth considering returning pagination information as a part of job_metadata in the API? Currently we return job_pages which corresponds to the number of pages which were parsed (but may be less than the actual number of pages returned in the response if this was a cache hit)

@pmrowla pmrowla changed the title [WIP] parse: support partitioning files before parsing parse: support partitioning files before parsing May 2, 2025
@pmrowla pmrowla marked this pull request as ready for review May 2, 2025 04:44
@pmrowla pmrowla requested a review from logan-markewich May 7, 2025 07:49
Copy link
Contributor

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods in this lib are starting to get very confusing to follow 😅 Going to checkout the code locally to better review this...

if result_type == ResultType.JSON.value:
job_result = json_result
else:
job_result = await self._get_job_result(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have the JSON result, no need to make another API call right? We can get any info we need out of the JSON?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 0d6c3bf

)
except Exception:
if results:
# API Error is expected if we try to read past the end of the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If specific errors are expected, we should catch them explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 72b8b63


total = 0
results: List[Tuple[str, Dict[str, Any]]] = []
while self.max_pages is None or total < self.max_pages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy this loop took me a while to understand. To confirm

  1. The user has enabled parition_pages
  2. The user has set max_pages
  3. The user has not set target pages

So, we parse in batches of pages, until we hit max pages, and return the result

Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct.

The issue is that we don't have a way to get the total number of pages in the doc locally. So in the case where the user wants to parse the entire doc (and is not using target_pages), we have to take the naive approach and loop and parse each batch of pages until either:

  • we reach max_pages if it is set by the user
  • we reach EOF, which can be indicated by either:
    • parse returns fewer pages than the batch range we requested, meaning the requested range starts on a valid page and then runs past the end of the doc
    • API returns a page does not exist error, meaning the requested range starts past the end of the doc (this should only happen when the previous range ends exactly on the last page of the doc)

In the future, we can simplify this behavior if we actually return the total number of available pages within a job result (and then we would know exactly how many remaining batches we need to request, and can run them concurrently instead of doing this sequential loop)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants