digipres-practice-index

An experiment in gathering together sources of information about digital preservation practices

This initial plan is to experiment with using Python to gather useful information sources, starting with iPres. Then see if this can usefully be transformed into something searchable using Datasette or Datasette Lite.

This originally relied on a tool called DVC. Why DVC? Because I wanted to manage how the data is complied, and I liked the way it handles checking data dependencies. Very #DigiPres... Also, e.g. remote storage integration for data sets on Google Drive.

However, having tried both DVC and Snakemake, they seem very difficult to work with. Lots of complex dependencies that don't always install easily, and over-engineered for this use case. So, instead, build pipelines are manage the old-fashioned way, using Make. There's lots of tutorials for Make (e.g.), and the Turing Way book has a really good section called Reproducibility with Make.

Development Setup

You need Python 3 and Make.

Clone this repo. Set up a Python 3 virtual env, e.g.

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install .

Optionally, install NLP data required for some analysis/processing (not in production use):

python -m spacy download en_core_web_lg

Local Usage

Build the data:

make

Try the Datasette view:

datasette serve practice.db --setting truncate_cells_html 120

After which you should be able to go to e.g. http://127.0.0.1:8001/practice/publications?_facet=type&_searchmode=raw&_facet=year&_facet_array=creators&_facet_array=institutions&_facet_size=10&_sort=year

Other build targets generate other derivatives. Check the Makefile for details.

Sources of Practice

iPRES

Where are the papers and metadata... Links on https://iPRES-conference.org/ are not complete.

It may make more sense to use JSON to store this data, and use JSON Schema in VSCode to make it easer to edit them. That can then be consumed by the gathering scripts as well as being used to generate tabular forms like this.

The information about each iPRES conference is now stored as a set of Markdown+metadata files in the publications repository, and are summarised at http://www.digipres.org/publications/ipres/

PHAIDRA

Seems to contain PDFs of individual contributions and whole-conference proceedings documents.
There are e.g. posters as well as articles and it should be possible to distinguish them.
There does not seem to be an other materials, e.g. links to recordings, etc.
Has a kind of implicit API, seems to expose parts of Solr, e.g. items from the iPRES 2004 collection which can be simplified to this
Might be easier to just download the CSV for each iPRES collection manually, and then use the object IDs.
Once you have an object ID for an article, it's straightforward to get:

OSF

More recent conferences appear in OSF, which has a much more complicated structure, but allows more types of materials to be stored.
The iPRES 2022 data in OSF has been gathered via a Zotero collection, using sub-collections to make it clear what the publication type is.

IDEALS

The IDEALS service was using for iPRES 2023, and offers a OAI-PMH API endpoint for gathering collections.

EventsAir

iPRES 2022 used the EventsAir plaform, and the JSON file that powered the conference programme has been captured here.
A manual process was necessary to gather an map Zotero entries referring to OSF and match them with EventsAir entries. This code merges these mappings.

PubPub

iPRES 2023 uses the PubPub platform, and the committee provided a detailed spreadsheet containing the necessary data and pointing to the PubPub items.

Zotero

I (ANJ) set up an iPRES group library on Zotero: https://www.zotero.org/groups/5564150/ipres/items/CB68T53W/library
This has been used to gather the iPRES 2022 data, as setting up suitable groups then adding them via the Zotero browser extension is a reasonably fast way of working.
The other iPRES conferences that make use of OSF could use this approach.
It is an open question as to whether much of this publications data would be better managed as a Zotero library.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.dvc		.dvc
dppi		dppi
releases		releases
sources/ipres		sources/ipres
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
practice-test.db		practice-test.db
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

digipres-practice-index

Development Setup

Local Usage

Sources of Practice

iPRES

PHAIDRA

OSF

IDEALS

EventsAir

PubPub

Zotero

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

digipres/digipres-practice-index

Folders and files

Latest commit

History

Repository files navigation

digipres-practice-index

Development Setup

Local Usage

Sources of Practice

iPRES

PHAIDRA

OSF

IDEALS

EventsAir

PubPub

Zotero

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages