-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Labels
Description
Hello there!
I am using Siegried within archivematica to identify files, and came across one issue that is maybe "minor", but easy to fix?
When a plain ASCII text file has a .doc extension to its file name, like ASCII_text.doc
, Siegfried fails, as it assumes it is a Word doc and does not even attempt to identify what's actually in there.
The file
command does identify the ASCII_text.doc
file as ASCII text 😄
Is there any way we could improve this behavior upstream? Is this too small to waste time on this?
I really think an ascii txt file should not fail to be identified, no matter the filename.
More info:
$ file bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc
bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc: ASCII text, with very long lines (690), with CRLF line terminators
But siegfried output for the same file:
$ sf bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc
---
siegfried : 1.11.0
scandate : 2024-07-29T09:58:17Z
signature : archivematica.sig
created : 2023-12-17T15:55:42+01:00
identifiers :
- name : 'archivematica'
details : 'wikidata-definitions-3.0.0; extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
---
filename : 'bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc'
filesize : 12736
modified : 2024-04-24T15:36:38Z
errors :
matches :
- ns : 'archivematica'
id : 'UNKNOWN'
format :
version :
mime :
class :
basis :
warning : 'no match; possibilities based on extension are x-fmt/42, x-fmt/43, x-fmt/44, x-fmt/131, x-fmt/274, x-fmt/275, x-fmt/276, x-fmt/329, fmt/39, fmt/40, fmt/37, fmt/38, x-fmt/393, x-fmt/394, fmt/473, fmt/609, fmt/754, fmt/892, fmt/1282, fmt/1283, fmt/1688'
Thanks for any input on this!