Skip to content

Autocorrect misidentified Files from remote sources after Exif/pronom extraction #278

@DiegoPino

Description

@DiegoPino

What?

I should have come up with this before but here I am, just a boy standing in front of an application/octet-stream knowing it is a tiff (but still does not love me?)

How?

Specially when dealing with remote Files and ill-configured HTTP servers, we have ended with Files being ingested via AMI and indentified/routed toas:document bc the headers were absent and we did not even have an extension, but once persisted, saved and exif/pronom etc. kicked in we could get the real format from inside (and more precise than we could have gotten ever by just fetching and downloading). And all this wonderful tech metadata is there and stored. The issue is the Drupal File entity is already created, the file is in its final position (and probably in S3:// using either just a dot, no extension of stuff like .bin).

But we have 3, last minute "signs", things we can act on (in the analogy of the boy standing in front, let's say these are orange flowers and chocolate covered cherries as response to a smile).

  • We know we have anapplication/octet-stream at the dr:mimetype level
  • AND We don't have a real extension. (c'mon, .bin is not real)
  • AND We know flv:exif && pronom inferred from signature mimetype are telling us a different story

Based on that we could kick a "save the night and dance at least once" action that under this conditions does a last attempt, deduces the right extension, renames the name and the S3 file path (cheaper than deleting/re ingesting), edits the File entity adding the real exif, changing the URL to the source (same size, same checksum) and moves the file to its right place under as:image or as:audio, who knows. Now the question. This should be a setting? Or are the "signs" enough to try one more time and get flowers, at midnight, at the gas station (or from someones front yard) ?

@alliomeria ping. This would be just great. Bc we could re-process (simply save) ADOs that have this issue instead of patching?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions