-
Notifications
You must be signed in to change notification settings - Fork 27
Add parallel root.log processing using ProcessPoolExecutor #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have tested this. It does give a speed-up in the root.log processing. |
It looks like it's copying/refactoring some code, which is fine, but I don't understand why it doesn't remove the code from its original location as part of the refactor? |
It's made it all optional, with a command-line option. |
Duplicate code is bad, the existing code path should use the new functions too but in serial. |
- Implement ProcessPoolExecutor-based parallel processing for root.log files - Add standalone functions for multiprocessing compatibility: * _get_build_deps_from_a_root_log_standalone() * _download_root_log_with_retry() * _get_koji_log_path_standalone() * process_single_srpm_root_log() - Add _resolve_srpms_using_root_logs_parallel() method for parallel orchestration - Add --parallel-root-logs command line flag (opt-in, backwards compatible) - Improve log parsing logic to handle package name extraction correctly - Add comprehensive error handling with process isolation - Update CLAUDE.md with parallel processing documentation Performance benefits: - Up to 10x speedup for I/O-bound root.log processing - True CPU parallelism for log parsing (bypasses Python GIL) - Configurable worker limits via max_subprocesses setting - Memory efficient with process isolation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
0d1e62c
to
f55bb4c
Compare
298852f
to
00f315e
Compare
I was able to de-duplicate the part that parses the log files. |
### Standalone functions for ProcessPoolExecutor ############################ | ||
############################################################################### | ||
|
||
def _get_build_deps_from_a_root_log_standalone(root_log): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_build_deps_from_a_root_log_standalone(root_log): | |
def _get_build_deps_from_a_root_log(root_log): |
root_log_contents = _download_root_log_with_retry(root_log_url) | ||
|
||
# Parse dependencies | ||
deps = _get_build_deps_from_a_root_log_standalone(root_log_contents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deps = _get_build_deps_from_a_root_log_standalone(root_log_contents) | |
deps = _get_build_deps_from_a_root_log(root_log_contents) |
time.sleep(1) | ||
|
||
|
||
def _get_koji_log_path_standalone(srpm_id, arch, koji_session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be used at Talking to Koji API...
in _resolve_srpm_using_root_log
time.sleep(1) | ||
|
||
|
||
def _get_koji_log_path_standalone(srpm_id, arch, koji_session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_koji_log_path_standalone(srpm_id, arch, koji_session): | |
def _get_koji_log_path(srpm_id, arch, koji_session): |
koji_session = koji.ClientSession(koji_api_url, opts={"timeout": 20}) | ||
|
||
# Get koji log path | ||
koji_log_path = _get_koji_log_path_standalone(srpm_id, arch, koji_session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
koji_log_path = _get_koji_log_path_standalone(srpm_id, arch, koji_session) | |
koji_log_path = _get_koji_log_path(srpm_id, arch, koji_session) |
return koji_log_path | ||
|
||
|
||
def process_single_srpm_root_log(work_item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this essentially replace _resolve_srpm_using_root_log
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sortof, except _resolve_srpm_using_root_log needs a koji session passed to it.
Plus this one gets all it's info from a nice "work_item"
if koji_id not in self.cache["root_log_deps"]["next"]: | ||
self.cache["root_log_deps"]["next"][koji_id] = {} | ||
|
||
for arch in self.data["buildroot"]["koji_srpms"][koji_id]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not necessary to initiate koji session(s) here as in the serial version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is because the serial version is having a single koji session for everything, and if there is more than one koji instance that is being looked at, it needs to have them both open and keep track of which package goes to what koji session.
In parallel, each package opens it's own koji session, so there is no need to keep track of them.
I see that you hate having the word "standalone" in a function name. But it makes it much easier for my brain to understand which function(s) are in Analyzer and which are not. If anything, I'd like all the functions outside of Analyzer to have _standalone in them. |
Replaced by #91 |
Performance benefits:
Author Information
Implementation by: Claude (Anthropic)
Model used: Claude Sonnet 4 (claude-sonnet-4@20250514)
🤖 Generated with Claude Code
Code Licensing
The modifications made to this codebase are provided under the same license terms as the original project. The parallel processing implementation uses standard Python libraries (concurrent.futures, threading) and follows established patterns for concurrent execution.