Robot code for accessioning and preservation of Web Archiving Service Seed and Crawl objects.
[Deprecated] Check the Wiki in the robot-master repo.
To run, use the lyber-core infrastructure, which uses bundle exec controller boot
to start all robots defined in config/environments/robots_ENV.yml.
Various dependencies, including cdxj-indexer which is installed via pip3 and uv, can be found in config/settings.yml and shared_configs (was-robotsxxx branches). To install cdxj-indexer:
$ uv syncAnd then to run it:
$ uv run cdxj-indexer --args --follow --hereSee below.
See consul pages in Web Archival portal, esp Web Archiving Development Documentation
Preassembly workflow for web archiving crawl objects (that include WARC or ARC files) to extract and create metadata. It consists of these robots:
build-was-crawl-druid-tree: this robot reads the crawl object content (ARCs or WARCs and logs) from directory defined by crawl object label, builds a druid tree, and copies the content to the druid tree content directory.end_was_crawl_preassembly: initiates the accessionWF (of common-accessioning).
Dissemination workflow for web archiving crawl objects. It is kicked off by the last step in the common-accessioning end-accession step that reads the disseminationWF that is suitable for this object type based on APO. It consists of these robots:
warc-extractor: extracts WARC files from WACZ filescdxj-generator: performs the basic indexing for the WARC/ARC files and generates CDXJ files (web archiving index files used by pywb). Generates 1 CDXJ file for each WARC file; the generated CDXJ files will be copied to/web-archiving-stacks.cdxj-merge: performs two main tasks: 1) Merges the individual CDXJ files that are generated in the previous step with the main index file (/web-archiving-stacks/data/indexes/cdxj/level0.cdxj) 2) Sorts the new generated index file.
Preassembly workflow for web archiving seed objects.
It consists of 4 robots:
desc-metadata-generator: generates the descMetadata in MODS format for the seed object.thumbnail-generator: captures a screenshot for the first memento using puppeteer and includes it as the main image for the object. This image will be used in Argo and SearchWorks's sul-embed. If the robot fails to generate a thumbnail, it shows as an error in Argo.content-metadata-generator: generates contentMetadata.xml for the thumbnail by processing the contentMetadata.XSLT template against the available thumbnail.jp2.end-was-seed-preassembly: initiates the accessionWF (of common-accessioning) and opens/closes version for the old object.
Workflow to route web archiving objects to wasCrawlDisseminationWF based on content type. Note that the wasDisseminationWF itself is fired off by the accessionWF by using the administrative.disseminationWorkflow value in the APO. For example, if the APO has the following, it'll fire off wasDisseminationWF:
"administrative": {
"disseminationWorkflow": "wasDisseminationWF",
It consists of 1 robot:
start_special_dissemination: sends objects with content typewebarchive-binaryto wasCrawlDisseminationWF.
There is a scheduled task to roll up the level0.cdxj files into level1 each night, plus additional rollups to level2 and level3, monthly and yearly respectively.
- Kakadu Proprietary Software Binaries - for JP2 generation
- libvips
- Exiftool
- Puppeteer
- Google Chrome
Download and install demonstration binaries from Kakadu: http://kakadusoftware.com/downloads/
NOTE: If you have upgrade to El Capitan on OS X, you will need to donwload and re-install the latest version of Kakadu, due to changes made with SIP. These changes moved the old executable binaries to an inaccessible location.
brew install libvipssudo apt install libvips42Download latest version from: http://www.sno.phy.queensu.ca/~phil/exiftool
tar -xf Image-ExifTool-#.##.tar.gz
cd Image-ExifTool-#.##
perl Makefile.PL
make test
sudo make installyarn install- Verify there are no jobs on the was-robots at https://robot-console-stage.stanford.edu/busy
- Clear collections:
rm -rf /web-archiving-stacks/data/collections/* - Clear indexes:
rm -rf /web-archiving-stacks/data/indexes/* - Clear seeds:
rm -rf /was_unaccessioned_data/seed/* - Clear jobs:
rm -rf /was_unaccessioned_data/jobs/*