Releases: alephdata/ingest-file
4.1.2
This version contains a patch for a security vulnerability in ingest-file, the component that processes files uploaded to Aleph. We recommend that you update Aleph instances you operate to use the latest patched release of ingest-file.
Please find detailed information about the patched vulnerability below.
How to update
If you operate Aleph using Docker Compose, update the ingest-file
service in your Docker Compose configuration to use the image ghcr.io/alephdata/ingest-file:4.1.2
.
If you operate Aleph using the Helm chart, update the aleph.ingestfile.image.tag
value to 4.1.2
.
Summary
Previous versions of ingest-file handled 7zip archives containing symbolic links insecurely. When processing 7zip archives, ingest-file followed symbolic links even if they were targeting files outside of the archive. A maliciously crafted archive would allow an attacker to access arbitrary files in the ingest-file container.
Depending on the exact configuration and deployment method, this might include:
- Access to files uploaded to Aleph if using the file archive (rather than object storage such as S3 or Google Cloud Storage) as the file archive is mounted into the container.
- Access to environment variables.
- Access to secrets mounted into the container.
Affected versions
All versions of ingest-file prior to 4.1.2 (this release) are affected.
Solution
ingest-file 4.1.2 contains a patch for the security vulnerability. 7zip archives containing symbolic links are now validated and archives containing symbolic links pointing to files outside of the archive are rejected.
Credits
OCCRP would like to thank everyone who identified this vulnerability and contributed to its resolution:
- Responsibly disclosed by InterSecLab
- Patch by Alex Ștefănescu
- Research, Testing, Validation: Alex Ștefănescu, Simon Wörpel, Jan Strozyk, Friedrich Lindenberg
4.1.0
What's Changed
-
Update base image to python:3.9 based on debian bookworm by @stchris in #660
-
Add Workbook metadata to Table entities by @catileptic in #657
-
Improved audio format parsing unit test assertions in e8dd833
-
Allow parsing truncated image files in 05ffe34
-
Add more frequent keepalives to unreliable download in a99b0e6
Full Changelog: 4.0.2...4.1.0
4.0.2
We're announcing the release of Aleph 4.0.2 (and ingest-file 4.0.2) and highly recommend users of the 4.x branches to update to this release.
What's changed
- Update to servicelayer 1.23.2 which fixes a significant performance regression noticeable especially when there are > 10000 tasks per dataset being processed
Full Changelog: 4.0.1...4.0.2
4.0.1
We're announcing the release of Aleph 4.0.1
(and ingest-file
4.0.1
) and highly recommend users of the 4.x
branches to update to this release.
What's changed
Bugfix
- Update to using servicelayer 1.23.1 which fixes an issue with improper clean-up when a task exhausts it's maximum number of retries
Other changes
- chore: Remove project board action by @stchris in #656
- chore: Post release announcements to Discourse by @stchris in #655
Full Changelog: 4.0.0...4.0.1
4.0.0
What's Changed
-
Use rabbitmq based task queues for workers by @catileptic and @stchris
This release makes the
ingest-file
worker into a rabbitmq based worker (see also https://github.com/alephdata/servicelayer/releases/tag/v1.23.0). To highlight some of the changes:- new settings of the form
RABBITMQ_*
have been introduced https://github.com/alephdata/servicelayer/blob/131171c137ce2a46d3ca36216b9cd7c2bd70125d/servicelayer/settings.py#L37 - a Redis connection is still needed. Redis is used to coordinate the state of job execution across all workers.
- it is possible to configure the prefetch count for tasks the ingest-file worker will grab at a time (see https://github.com/alephdata/ingest-file/blob/bd321ec7524c15a9ec0396a153f5575856e476f8/ingestors/settings.py#L57)
- new settings of the form
Other changes
Full Changelog: 3.22.0...4.0.0
3.22.0
Note
Please note that we skipped version 3.21.0. That means the previous version before this version is 3.20.3.
What's Changed
- Fix multi-line quoted-printable encoded values in vCards by @tillprochaska in #595
- Fix formatting by @stchris in #619
- Bump the dev-dependencies group with 4 updates by @dependabot in #615
- Bump servicelayer[amazon,google] from 1.22.1 to 1.22.2 by @dependabot in #614
- Bump rarfile from 4.1 to 4.2 by @dependabot in #611
- Bump sentry-sdk from 1.39.1 to 2.0.1 by @dependabot in #610
- Introduce a setting to disable sending ProcessingExceptions to Sentry by @stchris in #607
- Bump icalendar from 5.0.11 to 5.0.12 by @dependabot in #602
- Bump google-cloud-vision from 3.5.0 to 3.7.2 by @dependabot in #601
- Bump followthemoney from 3.5.8 to 3.5.9 by @dependabot in #581
- Bump click from 8.1.6 to 8.1.7 by @dependabot in #517
- Bump followthemoney-store[postgresql] from 3.0.6 to 3.1.0 by @dependabot in #598
Full Changelog: 3.20.3...3.22.0
3.20.3
Please refer to the release notes for Aleph 3.15.6 for detailed information.
3.20.2
What's Changed
- Fix TIFF processing by @catileptic in #587
- There was an issue with some types of TIFF files not being properly previewed and OCRd
- Extended test coverage to prevent regressions in OCR for gif, jpg, jp2, tiff, webp
Full Changelog: 3.20.1...3.20.2
3.20.1
What's changed
- Force installing
tesserocr
from source instead of using wheels because of sirfz/tesserocr#337. This fixes a regression which might have caused certain image file types to not have been OCRd. - Add a
clear-cache
command to theingestors
CLI, which allows one to clear the ingest cache. It also takes a prefix (for instanceocr:
orpdf:
.
Full Changelog: 3.20.0...3.20.1
3.20.0
What's Changed
- Emit a verbose error when processing a password-protected XLS / XLSX file by @catileptic in #551
- Add Prometheus instrumentation for ingest-file workers by @tillprochaska in #550
- Don't tag test- branches with latest by @catileptic in #561
- Bump ruff from 0.0.286 to 0.1.6 by @dependabot in #559
- Dependabot: remove old ignores and group dev deps by @stchris in #563
- Bump the dev-dependencies group with 3 updates by @dependabot in #564
- Bump lxml from 4.9.3 to 5.0.0 by @dependabot in #572
- Bump the dev-dependencies group with 2 updates by @dependabot in #571
- Bump sentry-sdk from 1.30.0 to 1.39.1 by @dependabot in #570
- Bump google-cloud-vision from 3.4.4 to 3.5.0 by @dependabot in #569
- Bump dbf from 0.99.3 to 0.99.9 by @dependabot in #568
- Bump olefile from 0.46 to 0.47 by @dependabot in #567
- Bump icalendar from 5.0.7 to 5.0.11 by @dependabot in #556
- Bump pyicu from 2.11 to 2.12 by @dependabot in #560
- Bump cryptography from 41.0.4 to 41.0.7 by @dependabot in #553
- Bump pymediainfo from 6.0.1 to 6.1.0 by @dependabot in #549
- Bump normality from 2.4.0 to 2.5.0 by @dependabot in #544
- Bump tesserocr from 2.6.1 to 2.6.2 by @dependabot in #547
- Bump pillow from 10.0.0 to 10.1.0 by @dependabot in #543
- Bump rarfile from 4.0 to 4.1 by @dependabot in #527
- Bump FTM version 3.5.2->3.5.8 by @catileptic in #574
Full Changelog: 3.19.3...3.20.0