-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Just wanted to report a behavior which I assume is intentional but might be misleading (or at least it was for me).
When using Cooler.pixels().fetch("<some_genomic_region>")
, an instance of the class RangeSelector1D
is returned. This selector returns all the pixels where the first bin id belongs to the provided interval. As an example, using the yeast.10kb.cool
file in the tests/data
folder:
from cooler import Cooler
handle = Cooler("yeast.10kb.cool")
print(handle.pixels().fetch("chrII"))
returns:
bin1_id bin2_id count
36311 33 33 2433
36312 33 34 7410
36313 33 35 3444
36314 33 36 475
36315 33 37 1741
... ... ... ...
128531 114 1209 1
128532 114 1221 2
128533 114 1223 2
128534 114 1224 2
128535 114 1225 12
[92225 rows x 3 columns]
which checks out since:
print(handle.extent("chrII")) # results in (33, 115)
The issue (in my opinion) is that it does not return all pixels with at least one bin in the region. There can be pixels where the second bin belongs to the region and yet they are not returned.
print(handle.pixels()[:].query("bin2_id >= 33 and bin2_id < 115 and bin1_id < 33"))
returns:
bin1_id bin2_id count
28 0 33 3
29 0 34 41
30 0 35 48
31 0 36 9
32 0 37 61
... ... ... ...
36110 32 65 1
36111 32 75 1
36112 32 90 2
36113 32 101 1
36114 32 113 1
[2460 rows x 3 columns]
I understand why this might be the case. There is no easy way of indexing the bin2_id
column, therefore finding these pixels would require iterating and filtering all pixels prior to the retrieved slice. This would clash with the lazy fetching and most likely be slow.
Understanding why this behavior might not change, I believe it should be clearly stated in the documentation (I did not find anything about it but I might have missed it), since one could get misleading results. I would be happy to open a PR to either change or document the behavior if needed.