-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Milestone
Description
Purpose
Parallelize the refresh pipeline to efficiently handle the download and analysis of large projects concurrently.
Part 1 (Proof of Concept)
Process
For a proof of concept (POC), parallelize the parsing capabilities of .mbox files. Each job will handle the processing of one month's data, ensuring that tasks remain independent and manageable. Python will manage the job queue and dispatcher, assigning jobs to threads as resources become available (scheduling policy TBD).
Workflow
- User uses CLI to initiate parsing of .mbox files via /exec R scripts.
- /exec script parses the argument, and invokes the Python script.
- Python manages the job queue, assigning jobs to threads.
- Each thread calls /exec/parse_mbox.R to handle the parsing of its assigned file.
- Scripts perform their tasks and save output.
Task List
- Define project structure. Find a home for Python scripts (cannot live in R/).
- Format a configuration file to use with parse managers.
- A new R/exec script to parse a single mbox file into a table calling the parse_mbox() function in it. R/exec scripts should be taking config files as input, and not its own set of parameters.
- A python script that can call said R script in parallel
- Update documentation.
Libraries
- httr
- stringi
- yaml
- ThreadPoolExecutor
Part 2
Implement parallel processing for all downloaders. This section will be addressed after the successful implementation of Part 1.
References
Issue #248: Scaling Analysis with BatchJobPool
Issue #231: Parallel Git Log Entity Analysis
PR #234: Adds Parallelization Support for Git Log Entities
Metadata
Metadata
Assignees
Labels
No labels