Releases: buzzbangorg/bsbang-crawler
Releases · buzzbangorg/bsbang-crawler
0.0.4 alpha
- Allow the indexer to extract values from jsonld value objects (those where the value is in the @value property)
- Made crawler more robust when it encounters a sitemap that is not valid XML
- Fix a bug where the crawler would error if there were blank lines in a URL configuration file
- Added configuration options for locating the crawl database in places other than data/crawl.db
- Added configuration options for locating the Solr instance in places other than Solr's default
- Improved crawling instructions
Many thanks to @innovationchef, @aswanipranjal and @HaoPatrick for contributions on this release.
0.0.3 alpha
- Implemented optional crawled schema properties. These are json-ld properties that will be indexed only if present. See
bioschema.__init__.pyfor more details - Implemented properties remapping, where an older bioschemas/schema.org property can be mapped to the current one (e.g. PhysicalEntity.biologicalType -> PhysicalEntity.additionalType). Not so useful now since bioschemas is young, but the kind of thing that will be needed in the future.
- Updated default crawl to capture Thing.alternateName from schema.org
- Converted print statements to logging
- Allow crawler to load sitemaps from https and those without
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"in their sitemap - Crawling and indexing are now 3 separate stages (crawl, extract, index) with an sqlite3 database storing the json-ld inbetween. This is to
** allow us to perform these operations separately
** re-build the index without needing to re-crawl pages each time
** Crawl very large sites in multiple stages - Added
bsbang-crawl.py --force-sitemapoption
0.0.2 alpha
- Now parsing DataCatalog from frontpages as well as the old PhysicalEntity from pages and sitemap XML (will be changed to BioChemEntity sometime). This is currently only the basics ('@type', 'name', 'url', 'decscription', 'keywords'.
- Now saving @type information for reuse.
- Added default sites to crawl with list that includes identifiers.org, guidetopharmacology.org, fairsharing.org, scientificdata.isa-explorer.org