Releases · buzzbangorg/bsbang-crawler · GitHub

26 Feb 16:22

justinccdev

0.0.4 alpha Latest

Latest

Allow the indexer to extract values from jsonld value objects (those where the value is in the @value property)
Made crawler more robust when it encounters a sitemap that is not valid XML
Fix a bug where the crawler would error if there were blank lines in a URL configuration file
Added configuration options for locating the crawl database in places other than data/crawl.db
Added configuration options for locating the Solr instance in places other than Solr's default
Improved crawling instructions

Many thanks to @innovationchef, @aswanipranjal and @HaoPatrick for contributions on this release.

Assets 2

14 Feb 17:26

justinccdev

0.0.3 alpha

Implemented optional crawled schema properties. These are json-ld properties that will be indexed only if present. See bioschema.__init__.py for more details
Implemented properties remapping, where an older bioschemas/schema.org property can be mapped to the current one (e.g. PhysicalEntity.biologicalType -> PhysicalEntity.additionalType). Not so useful now since bioschemas is young, but the kind of thing that will be needed in the future.
Updated default crawl to capture Thing.alternateName from schema.org
Converted print statements to logging
Allow crawler to load sitemaps from https and those without xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" in their sitemap
Crawling and indexing are now 3 separate stages (crawl, extract, index) with an sqlite3 database storing the json-ld inbetween. This is to
** allow us to perform these operations separately
** re-build the index without needing to re-crawl pages each time
** Crawl very large sites in multiple stages
Added bsbang-crawl.py --force-sitemap option

Assets 2

19 Oct 10:48

justinccdev

0.0.2 alpha

Now parsing DataCatalog from frontpages as well as the old PhysicalEntity from pages and sitemap XML (will be changed to BioChemEntity sometime). This is currently only the basics ('@type', 'name', 'url', 'decscription', 'keywords'.
Now saving @type information for reuse.
Added default sites to crawl with list that includes identifiers.org, guidetopharmacology.org, fairsharing.org, scientificdata.isa-explorer.org

Assets 2