preprocessing-sfa is an Enduro preprocessing workflow for SFA SIPs.
The preprocessing workers need to share the filesystem with Enduro's a3m or Archivematica workers. They must be connected to the same Temporal server and related to each other with the namespace, task queue and workflow name.
The required configuration for the preprocessing worker:
debug = false
verbosity = 0
sharedPath = "/home/enduro/preprocessing"
[temporal]
address = "temporal.enduro-sdps:7233"
namespace = "default"
taskQueue = "preprocessing"
workflowName = "preprocessing"
[worker]
maxConcurrentSessions = 1
Optional BagIt bag configuration:
[bagit]
checksumAlgorithm = "md5"
The preprocessing section for Enduro's configuration:
[preprocessing]
enabled = true
extract = true # Extract must be true for the preprocessing-sfa workflow.
sharedPath = "/home/enduro/preprocessing"
[preprocessing.temporal]
namespace = "default"
taskQueue = "preprocessing"
workflowName = "preprocessing"
# Enable the AIS poststorage workflow.
[[poststorage]]
namespace = "default"
taskQueue = "ais"
workflowName = "ais"
This project uses Tilt to set up a local environment building the Docker images in a Kubernetes cluster. It has been tested with k3d, Minikube and Kind.
A local Kubernetes cluster:
It can run with other solutions like Microk8s or Docker for Desktop/Mac and even against remote clusters, check Tilt's Choosing a Local Dev Cluster and Install documentation for more information to install these requirements.
Additionally, follow the Manage Docker as a non-root user post-install guide
so that you don’t have to run Tilt with sudo
. Note that managing Docker as a
non-root user is different from running the docker daemon as a non-root user
(rootless).
While we run the services inside a Kubernetes cluster we recomend installing Go and other tools locally to ease the development process.
Start a local Kubernetes cluster with a local registry. For example, with k3d:
k3d cluster create preprocessing --registry-create sdps-registry
Or using an existing registry:
k3d cluster create preprocessing --registry-use sdps-registry
Make sure kubectl is available and configured to use that cluster:
kubectl config view
Clone this repository and move into its folder if you have not done that previously:
git clone git@github.com:artefactual-sdps/preprocessing-sfa.git
cd preprocessing-sfa
Bring up the environment:
tilt up
While the Docker images are built/downloaded and the Kubernetes resources are
created, hit space
to open the Tilt UI in your browser. Check the Tilt UI
documentation to learn more about it.
Tilt, by default, will watch for file changes in the project folder and it will
sync those changes, rebuild the Docker images and recreate the resources when
necessary. However, we have disabled auto-load within the Tiltfile to reduce
the use of hardware resources. There are refresh buttons on each resource in the
Tilt UI that allow triggering manual updates and re-executing jobs and local
resources. You can also set the trigger_mode
env string to TRIGGER_MODE_AUTO
within your local .tilt.env
file to override this change and enable auto mode.
Run ctrl-c
on the terminal where tilt up
is running and stop the cluster
with:
k3d cluster stop preprocessing
To start the environment again:
k3d cluster start preprocessing
tilt up
Check the Tilt UI helpers below to just flush the existing data.
To remove the resources created by Tilt in the cluster, execute:
tilt down
Note that it will take some time to delete the persistent volumes when you
run tilt down
and flushing the existing data does not delete the cluster.
To delete the volumes immediately, you can delete the cluster.
Deleting the cluster will remove all the resources immediatly, deleting cluster container from the host. With k3d, run:
k3d cluster delete preprocessing
A few configuration options can be changed by having a .tilt.env
file
located in the root of the project. Example:
TRIGGER_MODE_AUTO=true
Enables live updates on code changes for the preprocessing worker.
In the Tilt UI header there is a cloud icon/button that can trigger the preprocessing workflow. Click the caret to set the path to a file/directory in the host, then click the cloud icon to trigger the workflow.
Also in the Tilt UI header, click the trash button to flush the existing data. This will recreate the MySQL databases and restart the required resources.
The Makefile provides developer utility scripts via command line make
tasks.
Running make
with no arguments (or make help
) prints the help message.
Dependencies are downloaded automatically.
The debug mode produces more output, including the commands executed. E.g.:
- Calculate SIP checksum
- Check for duplicate SIP
- Unbag SIP
- Identify SIP structure
- Validate SIP structure
- Validate SIP name
- Verify SIP manifest
- Verify SIP checksums
- Validate SIP files
- Validate logical metadata
- Create premis.xml
- Restrucuture SIP
- Create identifiers.json
- Other activities
Part 1 of a 2-part activity around duplicate checking - see also:
Generates and stores a checksum for the entire SIP, so it can be used to check for duplicates
- Generate a SHA256 checksum for the incoming package
- Read SIP name
- Store SIP name and checksum in the persistence layer (
sips
table)
- A SHA256 checksum is successfully generate for the SIP
- the SIP name and generated checksum are stored in the persistence layer
Part 2 of a 2-part activity around duplicate checking - see also:
Determines if an identical SIP has previously been ingested
- Use the generated checksum from part 1 to search
for an existing match in the
sips
database table - If an existing match is found, return a content error for a duplicateSIP and terminate the workflow
- Else, continue to next activity
- The activity is able to read the generated checksum and the
sips
database table - No matching checksum is found the SIPs database table
Extracts the contents of the bag.
Only runs if the SIP is a BagIt bag. If the SIP is not a bag, this activity will not run.
- Check if SIP is a bag
- If yes, extract the contents of the bag for additional ingest processing
- Else, skip
- Bag is successfully extracted
Determines the SIP type by analyzing the name and distinguishing features of the package, based on eCH-0160 requirements and other internal policies.
Package types include:
- BornDigitalSIP
- DigitizedSIP
- BornDigitalAIP
- DigitizedAIP
- Base type is BornDigitalSIP; assume this is the SIP type unless other conditions are met
- Check if the package contains a
Prozess_Digitalisierung_PREMIS.xml
file- If yes, it is a Digitized package - either DigitizedSIP or DigitizedAIP
- Check if the package contains an additional directory
- If yes, it is a migration AIP - either BornDigitalAIP or DigitizedAIP
- Compare check results and determine package type
- Package is successfully identified as one of the 4 supported types
Ensures that the SIP directory structure conforms to eCH-0160 specifications, that no empty directories are included, and that there are no disallowed characters used in file and directory names.
Note: Character restrictions for file and directory names are based on some of the requirements of the tools used by Archivematica during preservation processing - at present, the file name cleanup steps in Archivematica cannot be modified or disabled without forking. To ensure that SFA package metadata matches the content, this validation check ensures that no disallowed characters are included in file or directory names that might be automatically changed once received by Archivematica.
- Read SIP type from previous activity
- Check for presence of
content
andheader
directories - Check all file and directory names for invalid characters
- Check for empty directories
- Files and directories only contain valid characters
A-Z
,a-z
,0-9
, or-_.()
- SIPs contain
content
andheader
directories- If content type is an AIP, it also contains an
additional
directory
- If content type is an AIP, it also contains an
- No empty directories are found
Ensure that submitted SIPs use the required naming convention for the identified package type.
- Read SIP type from previous activity
- Use regular expression to validate SIP name based on identified type
- SIP follows expected naming convention for package type:
- BornDigitalSIP:
SIP_[YYYYMMDD]_[delivering office]_[reference]
- DigitizedSIP:
SIP_[YYYYMMDD]_Vecteur_[delivering office]_[reference]
- BornDigitalSIP:
Checks if all files and directories listed in the metadata manifest match those found in the SIP, and that no extra files or directories are found.
- Load SIP metadata manifest into memory
- Parse the manifest contents and return a list of files and directories
- Parse the SIP and return a list of files and directories
- Compare lists
- Return a list of any missing files found in the manifest but not the SIP
- Return a list of unexpected files found in the SIP but not the manifest
- There is a matching file or directory for every entry found in the
metadata.xml
(orUpdatedAreldaMetadata.xml
) manifest - No unexpected files that are not listed in the manifest are found
Confirms that the checksums included in the metadata manifest match those calculated during validation.
- Check if a given file exists in the manifest
- If yes, calculate a checksum - else skip
- Compare calculated checksum to manifest checksum
- A checksum calculated using the same algorithm as the one used in the metadata file returns the same value as the one included in the metadata manifest for each file listed
Ensures that files included in the SIP are well-formed and match their format specifications.
- For PDF/As, use VeraPDF to validate against the PDF/A specification
- Note: additional format validation checks will be added in the future
- All files pass validation
Ensures that a logical metadata file is included for AIPs being migrated from DIR and validates the file against a PREMIS schema file
Note : this activity uses some custom workflow code and a locally stored copy of the PREMIS schema to run the general temporal activity xmlvalidate.
- Read package type from memory
- If package type is bornDigitalAIP or DigitizedAIP, check for XML file in
additional
directory - If found, validate the XML file against a locally stored copy of the PREMIS schema; fail ingest if any errors are returned
- Logical metadata file is found in the
additional
directory of the package - Logical metadata file validates against PREMIS 3.x schema
Generates a PREMIS XML file that captures ingest preservation actions performed by Enduro as PREMIS events for inclusion in the resulting AIP METS file.
NOTE: This activity is broken up into 3 different activity files in
/internal/activites
:
add_premis_agent.go
add_premis_event.go
add_premisobjects.go
The XML output is then assembled via /internal/premis/premis.go
.
- Review event details for all successful tasks
- Create premis.xml file in a new metadata directory
- Write PREMIS objects to file
- Write PREMIS events to file
- Write PREMIS agents to file
- A
premis.xml
file is successfully generated with ingest events
Reorganizes SIP directory structure into a Preservation Information Package (PIP) that the preservation engine (Archivematica) can process.
- Check if
metadata
directory exists, else create a newmetadata
directory - Move the
Prozess_Digitalisierung_PREMIS.xml
file to themetadata
directory - For AIPs, move the
UpdatedAreldaMetatdata.xml
and logical metadata files to themetadata
directory - Create an
objects
directory, and in that directory create a sub-directory with the SIP name - Delete
xsd
directory and its contents fromheader
directory - Move
content
directory into the newobjects
directory - Create a new
header
directory in objects - Move the
metadata.xml
file into the newheader
directory - Delete original top-level directories
- XSD files are removed
- Restructured package now has
objects
andmetadata
directories immediately inside parent container - All content for preservation is within the
objects
directory - Enduro-generated PREMIS file is in the
metadata
directory - For Digitized packages,
Prozess_Digitalisierung_PREMIS.xml
file is in the metadata directory
Extract original UUIDs from the SIP metadata file and add them to an
identifiers.json
file added to the metadata
directory of the package for
parsing by the preservation engine
- Parse SIP metadata file
- Extract persistent identifiers and write to memory
- Convert manifest file paths to the restructured PIP file paths
- Exclude any files in the manifest that aren't found in the PIP
- Using extracted identifiers, generate an
identifiers.json
file that conforms to Archivematica's expectations - Move generated file to package
metadata
directory
- An
identifiers.json
file is added to themetadata
directory of the package - UUIDs present in the original SIP metadata are maintained and used by the preservation engine during preservation processing
The SFA workflow that invokes the activities listed above (see the preprocessing.go file) also uses a number of other more general Enduro temporal activites, including:
archiveextract
bagcreate
bagvalidate
ffvalidate
xmlvalidate
There is also one custom post-preservation workflow activity maintained in this repository as well:
Extracts all relevant metadata from the SIP and resulting AIP and delivers it to the AIS for synchronization.
- Generate a new XML document that combines the contents of the two source files
(the SIP
metadata.xml
orUpdatedAreldaMetadata.xml
file, and the AIP METS file) - ZIP the generated file and deposit it in an
ais
MinIO bucket
- Metadata bundle is successfully generated and deposited
- AIS is able to receive and ingest the metadata
$ make env DBG_MAKEFILE=1
Makefile:10: ***** starting Makefile for goal(s) "env"
Makefile:11: ***** Fri 10 Nov 2023 11:16:16 AM CET
go env
GO111MODULE=''
GOARCH='amd64'
...