Skip to content

Releases: buzzbangorg/bsbang-crawler

0.0.4 alpha

26 Feb 16:22

Choose a tag to compare

  • Allow the indexer to extract values from jsonld value objects (those where the value is in the @value property)
  • Made crawler more robust when it encounters a sitemap that is not valid XML
  • Fix a bug where the crawler would error if there were blank lines in a URL configuration file
  • Added configuration options for locating the crawl database in places other than data/crawl.db
  • Added configuration options for locating the Solr instance in places other than Solr's default
  • Improved crawling instructions

Many thanks to @innovationchef, @aswanipranjal and @HaoPatrick for contributions on this release.

0.0.3 alpha

14 Feb 17:26

Choose a tag to compare

  • Implemented optional crawled schema properties. These are json-ld properties that will be indexed only if present. See bioschema.__init__.py for more details
  • Implemented properties remapping, where an older bioschemas/schema.org property can be mapped to the current one (e.g. PhysicalEntity.biologicalType -> PhysicalEntity.additionalType). Not so useful now since bioschemas is young, but the kind of thing that will be needed in the future.
  • Updated default crawl to capture Thing.alternateName from schema.org
  • Converted print statements to logging
  • Allow crawler to load sitemaps from https and those without xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" in their sitemap
  • Crawling and indexing are now 3 separate stages (crawl, extract, index) with an sqlite3 database storing the json-ld inbetween. This is to
    ** allow us to perform these operations separately
    ** re-build the index without needing to re-crawl pages each time
    ** Crawl very large sites in multiple stages
  • Added bsbang-crawl.py --force-sitemap option

0.0.2 alpha

19 Oct 10:48

Choose a tag to compare

  • Now parsing DataCatalog from frontpages as well as the old PhysicalEntity from pages and sitemap XML (will be changed to BioChemEntity sometime). This is currently only the basics ('@type', 'name', 'url', 'decscription', 'keywords'.
  • Now saving @type information for reuse.
  • Added default sites to crawl with list that includes identifiers.org, guidetopharmacology.org, fairsharing.org, scientificdata.isa-explorer.org