Skip to content

digipres/digipres-practice-index

Repository files navigation

digipres-practice-index

An experiment in gathering together sources of information about digital preservation practices

This initial plan is to experiment with using Python to gather useful information sources, starting with iPres. Then see if this can usefully be transformed into something searchable using Datasette or Datasette Lite.

This originally relied on a tool called DVC. Why DVC? Because I wanted to manage how the data is complied, and I liked the way it handles checking data dependencies. Very #DigiPres... Also, e.g. remote storage integration for data sets on Google Drive.

However, having tried both DVC and Snakemake, they seem very difficult to work with. Lots of complex dependencies that don't always install easily, and over-engineered for this use case. So, instead, build pipelines are manage the old-fashioned way, using Make. There's lots of tutorials for Make (e.g.), and the Turing Way book has a really good section called Reproducibility with Make.

Development Setup

You need Python 3 and Make.

Clone this repo. Set up a Python 3 virtual env, e.g.

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install .

Optionally, install NLP data required for some analysis/processing (not in production use):

python -m spacy download en_core_web_lg

Local Usage

Build the data:

make

Try the Datasette view:

datasette serve practice.db --setting truncate_cells_html 120

After which you should be able to go to e.g. http://127.0.0.1:8001/practice/publications?_facet=type&_searchmode=raw&_facet=year&_facet_array=creators&_facet_array=institutions&_facet_size=10&_sort=year

Other build targets generate other derivatives. Check the Makefile for details.

Sources of Practice

iPRES

Where are the papers and metadata... Links on https://iPRES-conference.org/ are not complete.

It may make more sense to use JSON to store this data, and use JSON Schema in VSCode to make it easer to edit them. That can then be consumed by the gathering scripts as well as being used to generate tabular forms like this.

The information about each iPRES conference is now stored as a set of Markdown+metadata files in the publications repository, and are summarised at http://www.digipres.org/publications/ipres/

PHAIDRA

OSF

  • More recent conferences appear in OSF, which has a much more complicated structure, but allows more types of materials to be stored.
  • The iPRES 2022 data in OSF has been gathered via a Zotero collection, using sub-collections to make it clear what the publication type is.

IDEALS

  • The IDEALS service was using for iPRES 2023, and offers a OAI-PMH API endpoint for gathering collections.

EventsAir

  • iPRES 2022 used the EventsAir plaform, and the JSON file that powered the conference programme has been captured here.
  • A manual process was necessary to gather an map Zotero entries referring to OSF and match them with EventsAir entries. This code merges these mappings.

PubPub

  • iPRES 2023 uses the PubPub platform, and the committee provided a detailed spreadsheet containing the necessary data and pointing to the PubPub items.

Zotero

  • I (ANJ) set up an iPRES group library on Zotero: https://www.zotero.org/groups/5564150/ipres/items/CB68T53W/library
  • This has been used to gather the iPRES 2022 data, as setting up suitable groups then adding them via the Zotero browser extension is a reasonably fast way of working.
  • The other iPRES conferences that make use of OSF could use this approach.
  • It is an open question as to whether much of this publications data would be better managed as a Zotero library.

About

An experiment in gathering together sources of information about digital preservation practices

Resources

License

Stars

Watchers

Forks

Packages

No packages published