Add parallel root.log processing using ProcessPoolExecutor #89

tdawson · 2025-08-04T16:02:16Z

Implement ProcessPoolExecutor-based parallel processing for root.log files
Add standalone functions for multiprocessing compatibility:
- _get_build_deps_from_a_root_log_standalone()
- _download_root_log_with_retry()
- _get_koji_log_path_standalone()
- process_single_srpm_root_log()
Add _resolve_srpms_using_root_logs_parallel() method for parallel orchestration
Add --parallel-root-logs command line flag (opt-in, backwards compatible)
Improve log parsing logic to handle package name extraction correctly
Add comprehensive error handling with process isolation
Update CLAUDE.md with parallel processing documentation

Performance benefits:

Up to 10x speedup for I/O-bound root.log processing
True CPU parallelism for log parsing (bypasses Python GIL)
Configurable worker limits via max_subprocesses setting
Memory efficient with process isolation

Author Information

Implementation by: Claude (Anthropic)
Model used: Claude Sonnet 4 (claude-sonnet-4@20250514)
🤖 Generated with Claude Code

Code Licensing

The modifications made to this codebase are provided under the same license terms as the original project. The parallel processing implementation uses standard Python libraries (concurrent.futures, threading) and follows established patterns for concurrent execution.

tdawson · 2025-08-04T16:04:19Z

I have tested this. It does give a speed-up in the root.log processing.
It says "10x" but that is because we have it set to 10 concurant processes. We could set it higher if we want.

yselkowitz · 2025-08-04T18:22:50Z

It looks like it's copying/refactoring some code, which is fine, but I don't understand why it doesn't remove the code from its original location as part of the refactor?

tdawson · 2025-08-04T18:36:18Z

It's made it all optional, with a command-line option.
That way if we find we don't want it, we don't have to use it.

yselkowitz · 2025-08-04T19:00:12Z

Duplicate code is bad, the existing code path should use the new functions too but in serial.

- Implement ProcessPoolExecutor-based parallel processing for root.log files - Add standalone functions for multiprocessing compatibility: * _get_build_deps_from_a_root_log_standalone() * _download_root_log_with_retry() * _get_koji_log_path_standalone() * process_single_srpm_root_log() - Add _resolve_srpms_using_root_logs_parallel() method for parallel orchestration - Add --parallel-root-logs command line flag (opt-in, backwards compatible) - Improve log parsing logic to handle package name extraction correctly - Add comprehensive error handling with process isolation - Update CLAUDE.md with parallel processing documentation Performance benefits: - Up to 10x speedup for I/O-bound root.log processing - True CPU parallelism for log parsing (bypasses Python GIL) - Configurable worker limits via max_subprocesses setting - Memory efficient with process isolation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

tdawson · 2025-08-04T22:07:13Z

Duplicate code is bad, the existing code path should use the new functions too but in serial.

I was able to de-duplicate the part that parses the log files.
The other things were more complicated and I wasn't feeling very comfortable de-duplicating those.

yselkowitz · 2025-08-06T02:59:38Z

content_resolver/analyzer.py

+### Standalone functions for ProcessPoolExecutor ############################
+###############################################################################
+
+def _get_build_deps_from_a_root_log_standalone(root_log):


Suggested change

def _get_build_deps_from_a_root_log_standalone(root_log):

def _get_build_deps_from_a_root_log(root_log):

content_resolver/analyzer.py

yselkowitz · 2025-08-06T03:05:25Z

content_resolver/analyzer.py

+        root_log_contents = _download_root_log_with_retry(root_log_url)
+
+        # Parse dependencies
+        deps = _get_build_deps_from_a_root_log_standalone(root_log_contents)


Suggested change

deps = _get_build_deps_from_a_root_log_standalone(root_log_contents)

deps = _get_build_deps_from_a_root_log(root_log_contents)

yselkowitz · 2025-08-06T03:34:55Z

content_resolver/analyzer.py

+            time.sleep(1)
+
+
+def _get_koji_log_path_standalone(srpm_id, arch, koji_session):


This can be used at Talking to Koji API... in _resolve_srpm_using_root_log

yselkowitz · 2025-08-06T03:38:49Z

content_resolver/analyzer.py

+            time.sleep(1)
+
+
+def _get_koji_log_path_standalone(srpm_id, arch, koji_session):


Suggested change

def _get_koji_log_path_standalone(srpm_id, arch, koji_session):

def _get_koji_log_path(srpm_id, arch, koji_session):

yselkowitz · 2025-08-06T03:39:06Z

content_resolver/analyzer.py

+        koji_session = koji.ClientSession(koji_api_url, opts={"timeout": 20})
+
+        # Get koji log path
+        koji_log_path = _get_koji_log_path_standalone(srpm_id, arch, koji_session)


Suggested change

koji_log_path = _get_koji_log_path_standalone(srpm_id, arch, koji_session)

koji_log_path = _get_koji_log_path(srpm_id, arch, koji_session)

yselkowitz · 2025-08-06T03:40:28Z

content_resolver/analyzer.py

+    return koji_log_path
+
+
+def process_single_srpm_root_log(work_item):


Doesn't this essentially replace _resolve_srpm_using_root_log?

Sortof, except _resolve_srpm_using_root_log needs a koji session passed to it.
Plus this one gets all it's info from a nice "work_item"

yselkowitz · 2025-08-06T03:55:03Z

content_resolver/analyzer.py

+            if koji_id not in self.cache["root_log_deps"]["next"]:
+                self.cache["root_log_deps"]["next"][koji_id] = {}
+
+            for arch in self.data["buildroot"]["koji_srpms"][koji_id]:


Why is it not necessary to initiate koji session(s) here as in the serial version?

I believe it is because the serial version is having a single koji session for everything, and if there is more than one koji instance that is being looked at, it needs to have them both open and keep track of which package goes to what koji session.
In parallel, each package opens it's own koji session, so there is no need to keep track of them.

tdawson · 2025-08-06T13:31:48Z

I see that you hate having the word "standalone" in a function name. But it makes it much easier for my brain to understand which function(s) are in Analyzer and which are not. If anything, I'd like all the functions outside of Analyzer to have _standalone in them.

yselkowitz · 2025-08-15T16:11:35Z

Replaced by #91

tdawson force-pushed the feature/parallel-processing branch from 0d1e62c to f55bb4c Compare August 4, 2025 21:10

De-duplicate get_build_deps_from_a_root_log

00f315e

tdawson force-pushed the feature/parallel-processing branch from 298852f to 00f315e Compare August 4, 2025 22:00

yselkowitz linked an issue Aug 5, 2025 that may be closed by this pull request

Content Resolver: parallelize root log #88

Closed

yselkowitz requested changes Aug 6, 2025

View reviewed changes

Revert parsing changes; remove unnessary imports

97aaacd

yselkowitz removed a link to an issue Aug 15, 2025

Content Resolver: parallelize root log #88

Closed

yselkowitz closed this Aug 15, 2025

	def _get_build_deps_from_a_root_log_standalone(root_log):
	def _get_build_deps_from_a_root_log(root_log):

	deps = _get_build_deps_from_a_root_log_standalone(root_log_contents)
	deps = _get_build_deps_from_a_root_log(root_log_contents)

		time.sleep(1)


		def _get_koji_log_path_standalone(srpm_id, arch, koji_session):

	def _get_koji_log_path_standalone(srpm_id, arch, koji_session):
	def _get_koji_log_path(srpm_id, arch, koji_session):

	koji_log_path = _get_koji_log_path_standalone(srpm_id, arch, koji_session)
	koji_log_path = _get_koji_log_path(srpm_id, arch, koji_session)

		return koji_log_path


		def process_single_srpm_root_log(work_item):

Add parallel root.log processing using ProcessPoolExecutor #89

Add parallel root.log processing using ProcessPoolExecutor #89

Uh oh!

Conversation

tdawson commented Aug 4, 2025

Uh oh!

tdawson commented Aug 4, 2025

Uh oh!

yselkowitz commented Aug 4, 2025

Uh oh!

tdawson commented Aug 4, 2025

Uh oh!

yselkowitz commented Aug 4, 2025

Uh oh!

tdawson commented Aug 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdawson commented Aug 6, 2025

Uh oh!

yselkowitz commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants