Releases: dathere/datapusher-plus
2.0.0
[2.0.0] - 2025-04-25
🎉 Data Resource Upload First (DRUF) Workflow is finally here! 🎉
A workflow that flips the old CKAN traditional data ingestion on its head.
- Instead of filling out the metadata first and then uploading the data, users upload data resources first
- In a few seconds, even for very large datasets, analysis and validation is done while precompiling statistical metadata
- This precompiled metadata are then used by Metadata Formulae defined in the scheming yaml files to either precompute other metadata fields (on both package & resource levels) or to offer metadata suggestions
- Metadata Formulae use the same powerful Jinja2 template engine that powers CKAN's templating system.
- It comes with an extensible library of Jinja2 filters/functions that can be used in Metadata Formulae ala Excel.
The DRUF reinvents CKAN data ingestion - by automatically calculating/suggesting "Automagical Metadata" - high-quality, high-resolution metadata that reflects and describes what's INSIDE the dataset (e.g. summary stats; frequency table; spatial extent, date range, outliers, etc. calculated with Metadata Formulae) in addition to metadata about the dataset FILE (e.g. last updated, size of the file, owner, format, license, etc - what's normally found in traditional data catalogs).
Future improvements planned:
- "entry-time" Metadata Formulae
In addition to the two formula types (formula
to set a metadata field directly during creation/update; andsuggestion_formula
to suggest values using the Bootstap Popover UI), we'll add the ability to allow Data Publishers to enter formulas while they're entering metadata - fully embracing the Excel formula UI/UX aesthetic. - DCAT3-optimized reference profiles
Following implementation guidance for both DCAT-US v3 and DCAT-AP 3 scheming profiles with Metadata Formulae to compute recommended and optional properties that allow publishers to more fully take advantage of DCAT3 features and improvements - metadata properties that are often too laborious to manually compile. - Co-Curator AI
"Automagical metadata" is the perfect context for AI engines - as it summarizes even very large datasets in just a few kilobytes. It allows the Co-Curator1 to suggest tags, descriptions, links to related data sets and chat about the corpus WHILE the Data Publisher is curating the data. - Inline Data Validation
Optional ability to infer an initial JSON Schema validation file, and then validate future updates to the dataset using it, leveraging the same blazing-fast qsv engine (validating up to 340,000 records/per second2). - Customizable DRUF Data ingestion pipeline
Currently, there are numerous configuration settings to fine-tune the DRUF data-ingestion pipeline. However, the built-in default pipeline can only be customized to a limit without customizing the code. We will expose hooks that CKAN operators can take advantage of to tailor their DRUF pipelines to meet their requirements, while preserving the ability to access the precompiled statistical metadata that DP+ maintains. - Dynamic loading of Formula filters/functions
So users can share custom Jinja2 filters and functions they developed for their Metadata Formulae. - Inline Data Enrichment
Data can be optionally enriched while it's being ingested from other reference datasets within the same CKAN instance or external sources (e.g. enriched against high value curated sources like the Census; geocoding, etc.) - and more!
It took a while for us to bake 2.0.0, but we look forward to picking up the pace and co-innovating with the CKAN ecosystem.
NOTE: To fully experience the DRUF workflow, you'll need to use scheming dataset form pages and apply some CKAN core changes. A detailed installation procedure will be published on the Wiki shortly.
Added
- Data Resource Upload First (DRUF) Workflow
- Enhanced resource validation for DRUF workflow
- Metadata Formulae for precomputing metadata/metadata sugggestions
- Spatial file support - supports GeoJSON and Shapefiles
- Support for CKAN 2.9 compatibility in CLI operations
- Enhanced error handling and logging for resource uploads
Changed
- Updated CLI interface to work with CKAN 2.9
- Refactored resource upload process to support DRUF workflow
- Improved error messages and user feedback
- Enhanced configuration handling
Fixed
- Various bug fixes and improvements for CKAN 2.9 compatibility
- Resource upload process reliability improvements
Contributors
Full Changelog: 1.0.4...2.0.0
-
Inspired by the Curator in Ready Player One ↩
-
validate_index benchmark
- https://qsv.dathere.com/benchmarks ↩
1.0.4
Full Changelog: 1.0.3...1.0.4
1.0.3
What's Changed
- Ensure we are always using the same token setting for datapusher by @avdata99 in #168
- Fix iconv by @avdata99 in #169
- Fix the api_token config variable and fix for default views creation by @tino097 in #170
- Migration added by @tino097 in #171
- Migration added by @tino097 in #172
Full Changelog: 1.0.2...1.0.3
1.0.2
What's Changed
- Update README file for DP+ as extension by @tino097 in #143
- Fix MANIFEST.in by @pdelboca in #148
- Migrate cli commands by @tino097 in #150
- [dev-v1.0] Fix init db command by @pdelboca in #151
- Config part by @tino097 in #153
- Database migrations by @tino097 in #154
- Update readme by @tino097 in #156
- Fix yaml extension in MANIFEST.in by @pdelboca in #157
- Fix datefmt compatability with qsv in dev-v1.0 by @Zharktas in #161
- Remove obsolete assets by @pdelboca in #159
New Contributors
Full Changelog: 1.0.1...1.0.2
1.0.1
1.0.0
What's Changed
- Convert the datapusher to work as plugin by @tino097 in #73
- Code cleanup by @tino097 in #89
- [72]Rewrite resource url by @TomeCirun in #109
- Feature db models by @tino097 in #120
- Add migration script by @tino097 in #121
- Code cleanup part two by @tino097 in #123
- Rewrite resource URL if it differs from the defined ckan_url by @jhbruhn in #136
- Fixing issues by @tino097 in #137
- Sync with master by @tino097 in #138
New Contributors
Full Changelog: 0.16.4...1.0.0
0.16.4
What's Changed
- sync read buffer with buffer size of copyexpert by @jqnatividad in #128
Full Changelog: 0.16.3...0.16.4
0.16.3
What's Changed
- make COPY_READBUFFER_SIZE a configurable parameter by @jqnatividad in #127
Full Changelog: 0.16.2...0.16.3
0.16.2
CHANGED
- explicitly create a large read buffer when reading CSV when COPYing files to the datastore.
Full Changelog: 0.16.1...0.16.2
0.16.1
Fixed:
- fix utf8 encoding check, replacing NamedTemporaryFile approach, with Temporary Directory approach introduced in https://github.com/dathere/datapusher-plus/pull/117/files
NOTE: you’ll need to install uchardet
for the encoding check (apt-get install uchardet
)
Full Changelog: 0.16.0...0.16.1