-
Notifications
You must be signed in to change notification settings - Fork 3
Description
This high-level issue describes a new workflow to be added to Enduro.
In some cases, content may not have been appraised, yet information professionals may need to establish some basic bit-level preservation and intellectual control over the materials until they can be more thorough reviewed and selected from.
To support such cases, we propose adding a new Analysis workflow to Enduro. The purpose of an initial analysis workflow is to:
- Document the state of a given package at receipt
- Provide a high-level overview of the contents of a package that can support appraisal, selection, arrangement, and deduplication
- Generate a report that captures key information about the package contents for later review and selection
- Generate checksums for all submitted content that can then be used to verify integrity until final preservation decisions are made
Analysis workflow overview
A detailed proposal document has been prepared here:
What follows is a summary.
Enduro operators should be able to:
- Upload a package
- Initiate an analysis workflow
- Download a report of the analysis findings
This report can then be used to assist in appraisal and selection prior to preservation ingest.
Initial analysis workflow activities will include:
- Receipt
- Package checksum calculation
- Extraction
- File checksum list generation
- File identification via Sigfried
- Duplicate checking among package files
- Directory tree documentation
- Report generation
Depending on development time and priorities during our iterative process, this may further include:
- Bagging of content
- Storage in a new location
Workflow initiation
Packages submitted for analysis must be zippped. The workflow could theoretically be manually launched by an operator either by:
- Adding a zipped package to a new secondary "analysis" watched minIO bucket, or
- Using the Enduro UI uploader, and selecting the target workflow from a new upload configuration option
In this initial implementation, we will let the sponsoring client determine the preferred method. In the future, it would be possible to support both methods (in addition to others, such as launch via API; watched filesystem directory, etc).
If they choose the Enduro UI uploader option, here is a simple mockup of the new configuration option we would propose adding:
Development cadence and deliverables
This issue ticket can either be reused, or treated as an epic and linked to other issues as needed. We have proposed organizing this work into a series of deliverable-focused sprints :
Deliverable goal 1 - Support for multiple workflows
- Focus on internal work to sustainably add support in Enduro for multiple workflows.
- Ensure new workflow can support child workflows - consider that in the future users may wish to customize the Analysis workflow with custom business logic
- Consider any impacted enums, internal naming conventions, UI changes, etc required
Deliverable goal 2 - Initial workflow creation and basic file outputs
- Add initial workflow activities:
- Receipt
- Package checksum calculation
- Extraction
- File checksum calculation and text file generation
- Directory tree calculation and text file generation
- Bundling of 2 text files in a ZIP and delivery via UI link
- Ensure workflow, tasks, etc are all shown in Enduro UI
Deliverable goal 3 - additional activities
- Add the following activities
- File identification using Sigfried
- Check package for file duplicates
- Consider how best to make this new information available to operators
Deliverable goal 4 - improved reporting
- Depending on client preference, return analysis outcome in
- A CSV (possibly with the tree diagram kept in a separate TXT file), or
- A PDF
Deliverable goal 5 - refinement; optional work
- Incorporate any feedback
- Consider usabilty improvements (e.g. workflow type filter on browse pages)
- Time and interest allowing, implement optional work
- Add activities for package bagging and moving
- Consider if these last activities can be made configurable
- etc
Metadata
Metadata
Assignees
Labels
Type
Projects
Status