Skip to content

Releases: dathere/datapusher-plus

2.0.0

25 Apr 13:18
Compare
Choose a tag to compare

[2.0.0] - 2025-04-25

🎉 Data Resource Upload First (DRUF) Workflow is finally here! 🎉

A workflow that flips the old CKAN traditional data ingestion on its head.

  • Instead of filling out the metadata first and then uploading the data, users upload data resources first
  • In a few seconds, even for very large datasets, analysis and validation is done while precompiling statistical metadata
  • This precompiled metadata are then used by Metadata Formulae defined in the scheming yaml files to either precompute other metadata fields (on both package & resource levels) or to offer metadata suggestions
  • Metadata Formulae use the same powerful Jinja2 template engine that powers CKAN's templating system.
  • It comes with an extensible library of Jinja2 filters/functions that can be used in Metadata Formulae ala Excel.

The DRUF reinvents CKAN data ingestion - by automatically calculating/suggesting "Automagical Metadata" - high-quality, high-resolution metadata that reflects and describes what's INSIDE the dataset (e.g. summary stats; frequency table; spatial extent, date range, outliers, etc. calculated with Metadata Formulae) in addition to metadata about the dataset FILE (e.g. last updated, size of the file, owner, format, license, etc - what's normally found in traditional data catalogs).

Future improvements planned:

  • "entry-time" Metadata Formulae
    In addition to the two formula types (formula to set a metadata field directly during creation/update; and suggestion_formula to suggest values using the Bootstap Popover UI), we'll add the ability to allow Data Publishers to enter formulas while they're entering metadata - fully embracing the Excel formula UI/UX aesthetic.
  • DCAT3-optimized reference profiles
    Following implementation guidance for both DCAT-US v3 and DCAT-AP 3 scheming profiles with Metadata Formulae to compute recommended and optional properties that allow publishers to more fully take advantage of DCAT3 features and improvements - metadata properties that are often too laborious to manually compile.
  • Co-Curator AI
    "Automagical metadata" is the perfect context for AI engines - as it summarizes even very large datasets in just a few kilobytes. It allows the Co-Curator1 to suggest tags, descriptions, links to related data sets and chat about the corpus WHILE the Data Publisher is curating the data.
  • Inline Data Validation
    Optional ability to infer an initial JSON Schema validation file, and then validate future updates to the dataset using it, leveraging the same blazing-fast qsv engine (validating up to 340,000 records/per second2).
  • Customizable DRUF Data ingestion pipeline
    Currently, there are numerous configuration settings to fine-tune the DRUF data-ingestion pipeline. However, the built-in default pipeline can only be customized to a limit without customizing the code. We will expose hooks that CKAN operators can take advantage of to tailor their DRUF pipelines to meet their requirements, while preserving the ability to access the precompiled statistical metadata that DP+ maintains.
  • Dynamic loading of Formula filters/functions
    So users can share custom Jinja2 filters and functions they developed for their Metadata Formulae.
  • Inline Data Enrichment
    Data can be optionally enriched while it's being ingested from other reference datasets within the same CKAN instance or external sources (e.g. enriched against high value curated sources like the Census; geocoding, etc.)
  • and more!
    It took a while for us to bake 2.0.0, but we look forward to picking up the pace and co-innovating with the CKAN ecosystem.

NOTE: To fully experience the DRUF workflow, you'll need to use scheming dataset form pages and apply some CKAN core changes. A detailed installation procedure will be published on the Wiki shortly.


Added

  • Data Resource Upload First (DRUF) Workflow
    • Enhanced resource validation for DRUF workflow
    • Metadata Formulae for precomputing metadata/metadata sugggestions
    • Spatial file support - supports GeoJSON and Shapefiles
  • Support for CKAN 2.9 compatibility in CLI operations
  • Enhanced error handling and logging for resource uploads

Changed

  • Updated CLI interface to work with CKAN 2.9
  • Refactored resource upload process to support DRUF workflow
  • Improved error messages and user feedback
  • Enhanced configuration handling

Fixed

  • Various bug fixes and improvements for CKAN 2.9 compatibility
  • Resource upload process reliability improvements

Contributors

Full Changelog: 1.0.4...2.0.0

  1. Inspired by the Curator in Ready Player One

  2. validate_index benchmark - https://qsv.dathere.com/benchmarks

1.0.4

15 Jan 17:35
Compare
Choose a tag to compare

Full Changelog: 1.0.3...1.0.4

1.0.3

30 Oct 20:29
8c86c1e
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.0.2...1.0.3

1.0.2

16 Sep 19:49
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.0.1...1.0.2

1.0.1

22 May 17:01
75a6581
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.0.0...1.0.1

1.0.0

06 May 17:55
90a4868
Compare
Choose a tag to compare
1.0.0 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: 0.16.4...1.0.0

0.16.4

23 Jan 17:40
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.16.3...0.16.4

0.16.3

23 Jan 17:18
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.16.2...0.16.3

0.16.2

23 Jan 16:50
Compare
Choose a tag to compare

CHANGED

  • explicitly create a large read buffer when reading CSV when COPYing files to the datastore.

Full Changelog: 0.16.1...0.16.2

0.16.1

15 Jan 14:13
cc1fe96
Compare
Choose a tag to compare

Fixed:

NOTE: you’ll need to install uchardet for the encoding check (apt-get install uchardet)

Full Changelog: 0.16.0...0.16.1