-
Notifications
You must be signed in to change notification settings - Fork 412
parse: support partitioning files before parsing #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4483d47
to
d4a050f
Compare
# Fetch JSON result type first to get accurate pagination data | ||
# and then fetch the user's desired result type if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth considering returning pagination information as a part of job_metadata
in the API? Currently we return job_pages
which corresponds to the number of pages which were parsed (but may be less than the actual number of pages returned in the response if this was a cache hit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods in this lib are starting to get very confusing to follow 😅 Going to checkout the code locally to better review this...
llama_cloud_services/parse/base.py
Outdated
if result_type == ResultType.JSON.value: | ||
job_result = json_result | ||
else: | ||
job_result = await self._get_job_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have the JSON result, no need to make another API call right? We can get any info we need out of the JSON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 0d6c3bf
llama_cloud_services/parse/base.py
Outdated
) | ||
except Exception: | ||
if results: | ||
# API Error is expected if we try to read past the end of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If specific errors are expected, we should catch them explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 72b8b63
|
||
total = 0 | ||
results: List[Tuple[str, Dict[str, Any]]] = [] | ||
while self.max_pages is None or total < self.max_pages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh boy this loop took me a while to understand. To confirm
- The user has enabled parition_pages
- The user has set max_pages
- The user has not set target pages
So, we parse in batches of pages, until we hit max pages, and return the result
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct.
The issue is that we don't have a way to get the total number of pages in the doc locally. So in the case where the user wants to parse the entire doc (and is not using target_pages
), we have to take the naive approach and loop and parse each batch of pages until either:
- we reach
max_pages
if it is set by the user - we reach EOF, which can be indicated by either:
- parse returns fewer pages than the batch range we requested, meaning the requested range starts on a valid page and then runs past the end of the doc
- API returns a page does not exist error, meaning the requested range starts past the end of the doc (this should only happen when the previous range ends exactly on the last page of the doc)
In the future, we can simplify this behavior if we actually return the total number of available pages within a job result (and then we would know exactly how many remaining batches we need to request, and can run them concurrently instead of doing this sequential loop)
LlamaParse.partition_pages
to allow partitioning multiple page docs into separate parsing jobsNone
(docs are not partitioned)partition_pages
and each segment will be parsed in a separate jobtarget_pages
andmax_pages
num_workers
batching is handled at the file level, and segments within a single file are not split into additional batches)num_workers
iftarget_pages
is set. Whentarget_pages
isNone
, segments are parsed sequentially. This is done since we have to naively try to parse each segment until we either reachmax_pages
or error out from trying to parse past the end of the file.Known limitations:
Requires: