Follow DOIs. Produce Artifacts. Reverse URLs into DOIs.
## Operation - Collecting data.
Back-fill DOIs to catch up with all DOIs. Date format is that understood by Crossref MDAPI (i.e. YYYY
, YYYY-MM
or YYYY-MM-DD
).
e.g.
lein with-profile dev run update-items-many 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-07 2016-08 2016-09 2016-10
Daily, update the DOIs.
TODO
Crawl every Item's DOI to find the Resource URL. This will run until it's finished.
lein with-profile dev run update-resource-urls
Run this once after back-fill. Then run every day.
Take a sample of every referrer domain's resource URLs. Record naïve redirects. Takes approx 6 hours for 20 samples on 2000 domains.
sample-naive-redirect-urls
Run this once after back-fill. Then run every day.
Take a sample of every referrer domain's resource URLs. Record redirects using headless browser.
TODO
Run this once after back-fill. Then run every day.
For those domains that require naïve redirects, follow resource URLs for all Items where data is missing.
TODO
Run this once after back-fill. Then run every day.
For those domains that require browser redirects, follow resource URLs for all Items where data is missing.
TODO
Run this once after back-fill. Then run every day.
derive-heuristics
TODO
Run every day.
Create the Artifact for the domain list, archive it, and upload to the Evidence Service.
TODO
Run this once a week.
Create the Artifact for the URL DOI list, archive it, and upload to the Evidence Service.
TODO
Run this once a week.
Run the service.
lein with-profile dev run server
The service returns only valid, existing DOIs. It accepts the following input:
- landing page URL
- DOI
- free text
And returns a DOI. The DOI that is returned is the 'canonical' version, which means that if a valid DOI is passed in, you may get a different DOI out if there is a conflict. The following rules are followed:
- if an item is aliased to another, the DOI of other will be retured
- if two items are registered for the metdata and the abstract of a work (this can happen with SICIs), the DOI for the metdata is returned
Run a back-fill process in parallel for a number of dates, e.g.
lein with-profile dev run update-items-many 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-07 2016-08 2016-09 2016-10
To get the timeout
utility,
brew install core-utils
sudo ln -s /usr/local/bin/gtimeout /usr/local/bin/timeout
Follow link shorteners, try reverse at each stage. e.g. https://t.co/VIXpgGrl8p
Sanity check
- no aliased doi is alias of another
- no doi.org resource urls