Mu-semtech stack for harvesting data produced from the verenigingen-register.
A scheduled job periodically retrieves datasets from the Verenigingen register and stores them in the database. The responsibility of making this data available to consumers is delegated to the delta-producer, which handles publishing and synchronization.
See the docker-compose.yml
file.
To start this stack, clone this repository and start it using docker compose
with the following example snippet:
git clone git@github.com:lblod/app-verenigingen-loket-harvester.git
cd app-verenigingen-loket-harvester
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d
After starting, you might still have to wait for all the services to boot up, and for migrations to finish running. You can check this by inspecting the Docker logs and waiting for things to settle down.
The first time you start the stack, you will need to create a local account to log in.
See the Authentication
section.
Once the stack is up and running
without errors, you can visit the frontend in a browser at
http://localhost
.
It's not finished yet. If you want to start a harvesting job, you will need a lot of configuration so harvest_scraper
can connect to the right endpoint.
We advise you to ask someone or go to DEV/QA and copy the configuration from there. The following files should be reviewed:
docker-compose.override.yml
config/harvest_scraper/private_key_test.pem
should also be present locally.
Once this is done, you can schedule a harvesting job in the frontend.
Go to scheduled jobs
, choose harvest url
, and provide a dummy URL. No authentication is needed.
The job should eventually be scheduled.
Please note: this ingests the data into the database. If you want to publish it for external consumers, that's another job. Please check the delta-producers
section.
To ensure that the app can share data, it is necessary to set up the producers. We recommend that you first ensure a significant dataset has been harvested. The more data that has been harvested before setting up the producers, the faster the consumers will retrieve their data.
During its initial run, each producer performs a sync operation, which publishes the dataset as a DCAT
dataset. This format can easily be ingested by consumers. After this initial sync, the producer switches to 'normal operation' mode, where it publishes delta files whenever new data is ingested.
Check ./config/delta-producer/background-job-initiator/config.json
for the exact timings of the healing job.
The reason why we do not provide live streaming is due to performance considerations. This has led us to skip mu-authorization
and update Virtuoso directly.
-
Only if you are flushing and restarting from scratch, ensure in
./config/delta-producer/background-job-initiator/config.json
:[ { "name": "verenigingen", // (...) other config "startInitialSync": false, // changed from 'true' to 'false' // (...) other config } ]
- Also ensure that some data has been harvested before starting the initial sync.
-
Make sure the app is up and running, and the migrations have run.
-
In the
./config/delta-producer/background-job-initiator/config.json
file, update the following configuration:[ { "name": "verenigingen", // (...) other config "startInitialSync": true, // changed from 'false' to 'true' // (...) other config } ]
-
Restart the service:
drc restart delta-producer-background-jobs-initiator
-
You can follow the status of the job through the dashboard frontend.
Dumps are used by consumers as a snapshot to start from. This is faster than consuming all deltas. They are generated by the delta-producer-dump-file-publisher, which is triggered by a task created by the delta-producer-background-jobs-initiator. The necessary config is already present in this repository, but you need to enable it by updating the config. It's recommended to set up dumps on a regular interval, preferably at a time when no harvesting is happening.
To enable dumps, edit ./config/delta-producer/background-job-initiator/config.json
, set disableDumpFileCreation
to false
, and set the cron pattern as needed:
"dumpFileCreationJobOperation": "http://redpencil.data.gift/id/jobs/concept/JobOperation/deltas/deltaDumpFileCreation/verenigingen",
"initialPublicationGraphSyncJobOperation": "http://redpencil.data.gift/id/jobs/concept/JobOperation/deltas/initialPublicationGraphSyncing/verenigingen",
"healingJobOperation": "http://redpencil.data.gift/id/jobs/concept/JobOperation/deltas/healingOperation/verenigingen",
"cronPatternDumpJob": "0 10 0 * * 6",
"cronPatternHealingJob": "0 0 2 * * *",
"startInitialSync": false,
"errorCreatorUri": "http://lblod.data.gift/services/delta-producer-background-jobs-initiator-verenigingen",
"disableDumpFileCreation": false
}
Make sure to restart the background-job-initiator
service after changing the config.
Dumps will be generated in data/files/delta-producer-dumps.
docker compose restart delta-producer-background-jobs-initiator
By default, this application requires authentication. You can generate a migration to add a user account by using mu-cli and running the included project script:
mu script project-scripts generate-account
This will generate a migration for you to add the user account.
config/migrations/local
.
Afterwards, make sure to restart the migration service to execute the migration:
docker compose restart migrations
If you wish to run this application without authentication, this is also possible. You'll need to make the following changes:
# config/authorization/config.ex
%GroupSpec{
name: "harvesting",
useage: [:write, :read_for_write, :read],
- access: logged_in_user(),
+ access: %AlwaysAccessible{},
# docker-compose.yml
identifier:
environment:
- DEFAULT_MU_AUTH_ALLOWED_GROUPS_HEADER: '[{"variables":[],"name":"public"},{"variables":[],"name":"clean"}]'
+ DEFAULT_MU_AUTH_ALLOWED_GROUPS_HEADER: '[{"variables":[],"name":"public"},{"variables":[],"name":"harvesting"},{"variables":[],"name":"clean"}]'
frontend:
environment:
- EMBER_AUTHENTICATION_ENABLED: "true"
+ EMBER_AUTHENTICATION_ENABLED: "false"
In some cases, you might want to trigger the healing job manually:
drc exec delta-producer-background-jobs-initiator wget --post-data='' http://localhost/verenigingen/healing-jobs
Trigger the debug endpoints in the delta-producer-background-jobs-initiator
The default Virtuoso settings might be too weak if you need to ingest
production data. A better config is available and can be used in your
docker-compose.override.yml
:
virtuoso:
volumes:
- ./data/db:/data
- ./config/virtuoso/virtuoso-production.ini:/data/virtuoso.ini
- ./config/virtuoso/:/opt/virtuoso-scripts
Not all required parameters are provided, as these are deployment-specific. See the delta-producer-report-generator repository.
Credentials must be provided. See the deliver-email-service repository.