Skip to content

Conversation

@kleintom
Copy link
Contributor

@kleintom kleintom commented Nov 1, 2025

Functionality to add (please correct/amend/etc.)

@mjy has a much fuller/clearer picture of the goals here, these are my scattered rememberings of what we discussed.

  • User is on a TaxonName
    • They may even already have a source and maybe citation on the TN
      • They find that source in BHL
      • They want to record the BHL reference information with their TN

BHL reference should be more than just a url, it should include page mapping (from BHL page numbers to article page numbers) and perhaps ocr mapping as well

  • with eventually maybe a way to send this validated-by-human mapping back to GN/BHL

Relation to #4603: as discussed there, likely we'll want to create a 'virtual document' to represent the BHL document, and a (new) CitationDocumentation object to associate the (TN, TN source, BHL copy of TN source) data, which is where the page mapping etc data will live.

Mainly Claude code - the intial prompt is below. It needed some help with:
* correct BokChoy usage
* how to find and access (with project_token) public TW projects via api
* differentiating/parsing different types of page numbers from bhl url string and TW
* using correct attributes on hash data returned from apis

The initial prompt:
Use metadata from any 2 sources to predict and extend the attributes of either of those sources.  We're reconciling data against each other, and predicting improvements, or matching elements. Ultimately we'll present these to a human user for confirmation, refinement, or selection such that curatorial decisions improve data.

* The TaxonWorks root API for data is at https://sfg.taxonworks.org/api/v1
* The TaxonWorks root API for documentations is at https://api.taxonwork.org/
* The GlobalNames BHLNames api documentation is at https://bhlnames.globalnames.org/apidoc/index.html
* The Ruby BHL gem wrapping BHLNames is at https://github.com/SpeciesFileGroup/bok_choy
* The Ruby COL gem, useful for more identifiers on names is at https://github.com/SpeciesFileGroup/colrapi
* The TaxonWorks code base is at https://github.com/SpeciesFileGroup/taxonworks
* The Global names organization at GH is at https://github.com/gnames/

We want to boostrap the infrastructure with a basic use case.

* User is navigating BHL and finds a page that contains information.
* We use the URL, and a taxon name parameter against several APIs, collectively wrapped in meta-service
* The meta-service queries TaxonWorks API, GlobalNames BHLNames API, and others it might need to resolve the problem
* It seeks to predict the citation/source/refernce that the URI refers to.
  * It should return or infer the exact page number as physically indicated for the URI
  * It shoudl confirm the presence of the name string on that page

* As a proof of concept we'll use Ruby to act as the meta-service
 * Use Thor to handle command line parameters
 * Take a name param, and a url param as input
 * Use the referenced APIs as data sources
* Return a list of 5 sources in a ranked order, with the most probable source that the URL comes from at the top
* Return a list of IDs for the TaxonName from at least the TaxonWorks API, and any other IDs you can discover from other APIs (e.g. global names)
* Suggest a diff between the metadata directly tied to the BHL "source" page and the TaxonWorks Source metadata

* A simple, executable Ruby script that ties these together
* When there is no direct path to linking API endpoints, then you should suggest API endpoints that would resolve the problem.  When you do this you must NOT add new functionality that is being encoded in this service, i.e. the new endpoints should be RESTful in nature with respect to complexity
@kleintom
Copy link
Contributor Author

kleintom commented Nov 1, 2025

This is just a small first start with many TODOs.

The main blocker in sight at this point is that Document has multiple required file_ attributes, which will make it challenging to subclass to a Document::Virtual class not having any local file associations:

taxonworks/db/schema.rb

Lines 744 to 748 in 47b28d3

create_table "documents", id: :serial, force: :cascade do |t|
t.string "document_file_file_name", null: false
t.string "document_file_content_type", null: false
t.integer "document_file_file_size", null: false
t.datetime "document_file_updated_at", precision: nil, null: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant