Skip to content

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Sep 4, 2025

Here, we add picklist arguments to sig collect so we can select on the collected sigs.

Is there an alternate way to do this for a standalone manifest?

example usage:

/usr/bin/time -v  sourmash sig collect /group/ctbrowngrp5/wort/wort-sra/SOURMASH-MANIFEST.csv.gz --picklist wort-largest-10k.idents.txt:ident:ident -F csv -o wort-largest-10k.mf.csv --abspath                                                                                                                                                                    


== This is sourmash version 4.9.5.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

picking column 'ident' of type 'ident' from 'wort-largest-10k.idents.txt'
loaded 10000 distinct values into picklist.
Loading signature information from /group/ctbrowngrp5/wort/wort-sra/SOURMASH-MANIFEST.csv.gz.
for given picklist, found 10000 matches to 10000 distinct values
saved 30000 manifest rows to 'wort-largest-10k.mf.csv'
        Command being timed: "sourmash sig collect /group/ctbrowngrp5/wort/wort-sra/SOURMASH-MANIFEST.csv.gz --picklist wort-largest-10k.idents.txt:ident:ident -F csv -o wort-largest-10k.mf.csv --abspath"
        User time (seconds): 256.04
        System time (seconds): 21.11
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:41.10
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 15760176
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 24
        Minor (reclaiming a frame) page faults: 4007107
        Voluntary context switches: 4229
        Involuntary context switches: 27503
        Swaps: 0
        File system inputs: 1404408
        File system outputs: 8304
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

...and checking the resultant standalone manifest:

sourmash sig summarize wort-largest-10k.mf.csv

== This is sourmash version 4.9.5.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

** loading from 'wort-largest-10k.mf.csv'
path filetype: StandaloneManifestIndex
location: wort-largest-10k.mf.csv
is database? yes
has manifest? yes
num signatures: 30000
** examining manifest...
total hashes: 329265656902
summary of sketches:
   10000 sketches with DNA, k=21, scaled=1000, abund  104490720670 total hashes
   10000 sketches with DNA, k=31, scaled=1000, abund  115363100334 total hashes
   10000 sketches with DNA, k=51, scaled=1000, abund  109411835898 total hashes

Copy link

codecov bot commented Sep 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.15%. Comparing base (dab49a4) to head (bd890dc).

Additional details and impacted files
@@           Coverage Diff           @@
##           latest    #3805   +/-   ##
=======================================
  Coverage   88.15%   88.15%           
=======================================
  Files         137      137           
  Lines       22610    22616    +6     
  Branches     2303     2305    +2     
=======================================
+ Hits        19931    19937    +6     
  Misses       2366     2366           
  Partials      313      313           
Flag Coverage Δ
hypothesis-py 25.22% <0.00%> (-0.02%) ⬇️
python 92.60% <100.00%> (+<0.01%) ⬆️
rust 81.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bluegenes bluegenes changed the title WIP: add picklist to sig collect MRG: add picklist to sig collect Sep 4, 2025
@bluegenes
Copy link
Contributor Author

@ctb ready for review. If there's an alternate way to do this, just lmk -- I wasn't sure how and wanted the capacity.

@bluegenes
Copy link
Contributor Author

bluegenes commented Sep 5, 2025

Hmm, this doesn't do exactly what I wanted, which is to select on the existing manifest (keeping the internal_location paths within), rather than create abspath references to the existing manifest. It does work for what I need, though!

@bluegenes
Copy link
Contributor Author

bluegenes commented Sep 8, 2025

Ah, what I was really looking for was sig check -m:

/usr/bin/time -v sourmash sig check data/wort-sra.abspath.SOURMASH-MANIFEST.csv --picklist data/wort-largest-10k.idents.txt:ident:ident -F csv -m data/wort-largest-10k.check-mf.csv --v5 --abspath --picklist-require-all

== This is sourmash version 4.9.4. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

picking column 'ident' of type 'ident' from 'data/wort-largest-10k.idents.txt'
loaded 10000 distinct values into picklist.
loaded 15449955 signatures.
for given picklist, found 10000 matches to 10000 distinct values
wrote 30000 matching manifest rows to 'data/wort-largest-10k.check-mf.csv'
        Command being timed: "sourmash sig check data/wort-sra.abspath.SOURMASH-MANIFEST.csv --picklist data/wort-largest-10k.idents.txt:ident:ident -F csv -m data/wort-largest-10k.check-mf.csv --v5 --abspath --picklist-require-all"
        User time (seconds): 446.50
        System time (seconds): 33.96
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 8:01.88
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16782336
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 2
        Minor (reclaiming a frame) page faults: 4632109
        Voluntary context switches: 3091
        Involuntary context switches: 48428
        Swaps: 0
        File system inputs: 5906208
        File system outputs: 10120
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

picklist might still be useful for other uses ofcollect, though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants