Skip to content

Unclear Cooler.pixels().fetch behavior #455

@StefanoCretti

Description

@StefanoCretti

Just wanted to report a behavior which I assume is intentional but might be misleading (or at least it was for me).

When using Cooler.pixels().fetch("<some_genomic_region>"), an instance of the class RangeSelector1D is returned. This selector returns all the pixels where the first bin id belongs to the provided interval. As an example, using the yeast.10kb.cool file in the tests/data folder:

from cooler import Cooler

handle = Cooler("yeast.10kb.cool")
print(handle.pixels().fetch("chrII"))

returns:

        bin1_id  bin2_id  count
36311        33       33   2433
36312        33       34   7410
36313        33       35   3444
36314        33       36    475
36315        33       37   1741
...         ...      ...    ...
128531      114     1209      1
128532      114     1221      2
128533      114     1223      2
128534      114     1224      2
128535      114     1225     12

[92225 rows x 3 columns]

which checks out since:

print(handle.extent("chrII"))  # results in (33, 115)

The issue (in my opinion) is that it does not return all pixels with at least one bin in the region. There can be pixels where the second bin belongs to the region and yet they are not returned.

print(handle.pixels()[:].query("bin2_id >= 33 and bin2_id < 115 and bin1_id < 33"))

returns:

       bin1_id  bin2_id  count
28           0       33      3
29           0       34     41
30           0       35     48
31           0       36      9
32           0       37     61
...        ...      ...    ...
36110       32       65      1
36111       32       75      1
36112       32       90      2
36113       32      101      1
36114       32      113      1

[2460 rows x 3 columns]

I understand why this might be the case. There is no easy way of indexing the bin2_id column, therefore finding these pixels would require iterating and filtering all pixels prior to the retrieved slice. This would clash with the lazy fetching and most likely be slow.

Understanding why this behavior might not change, I believe it should be clearly stated in the documentation (I did not find anything about it but I might have missed it), since one could get misleading results. I would be happy to open a PR to either change or document the behavior if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions