Skip to content

bug fix : correct file extension derivation #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rraymondgh
Copy link
Contributor

path of the form movie...mkv would derive an extension of null

  • changed regexp to derive extension from [^/.]\.([a-z0-9]+)$ to .\.([a-z0-9]+)$
  • previous regexp would allow any character except period before period extension delimiter. Changed to any character
  • go code used by classifier / protobuf has been made consistent. Unit tests created for this function

closes issue #414

@mgdigital
Copy link
Collaborator

Thanks for this. It needs some thought as the migration may take a very long time (for large DBs) for what is a fairly minor fix...

@rraymondgh
Copy link
Contributor Author

rraymondgh commented May 12, 2025

migration may take a very long time (for large DBs)

I did consider dropping extension columns completely and move to database access layer. This has fringe benefits

  • guarantees consistency - not coded multiple times, once in go and twice in psql
  • a disk space saving - columns and indexes. I haven't conducted a full impact assessment of dropping indexes, at first review I didn't find and queries using indexes of extension columns

@mgdigital
Copy link
Collaborator

mgdigital commented May 26, 2025

All the proposed solutions entail a long migration for a minor fix so this will be a difficult one to move forward.
Postgres 17 allows updating generated columns so that's a possible path forward but we'd have to drop support for postgres <17, and so this fix might have to wait a while.
The index should be getting used by the file type filter.

@rraymondgh
Copy link
Contributor Author

Postgres 17 allows updating generated columns so that's a possible path forward

I've tried Postgres 17 syntax (have been using 17 for quite a while now). I didn't expect it performs any better on changing definition of a generated column versus doing in three steps. For reference, here are timings I get below.

bitmagnet_dev=# \timing
Timing is on.
bitmagnet_dev=# alter table "torrent_files" alter column "extension" set expression as 
    (substring(lower(path) from '[^/.]\.([a-z0-9]+)$')) ;
ALTER TABLE
Time: 104993.464 ms (01:44.993)
bitmagnet_dev=# alter table "torrent_files" drop column "extension";
alter table "torrent_files" add column "extension" text generated always as 
    (substring(lower(path) from '[^/.]\.([a-z0-9]+)$')) stored;
create index on torrent_files (extension);
ALTER TABLE
Time: 18.209 ms
ALTER TABLE
Time: 98311.817 ms (01:38.312)
CREATE INDEX
Time: 9606.450 ms (00:09.606)
bitmagnet_dev=# select count(*) from torrent_files;
  count  
---------
 5948038
(1 row)

Time: 282.374 ms

Closing PR - but not deleting branch. If there is a user where filenames that have a period before extension separator is important they can clone repo and merge branch associated with this closed PR.

@rraymondgh rraymondgh closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants