Skip to content

Commit 7e0580e

Browse files
committed
Expand README.md and update the in-python help with results from the test file.
1 parent 489c1a9 commit 7e0580e

File tree

2 files changed

+118
-5
lines changed

2 files changed

+118
-5
lines changed

README.md

Lines changed: 114 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,114 @@
1-
This is an in-development python wrapper around lib2bit.
1+
[![Build Status](https://travis-ci.org/dpryan79/py2bit.svg?branch=master)](https://travis-ci.org/dpryan79/py2bit)
2+
3+
# py2bit
4+
5+
A python extension, written in C, for quick access to [2bit](https://genome.ucsc.edu/FAQ/FAQformat.html#format7) files. The extension uses [lib2bit](https://github.com/dpryan79/lib2bit) for file access.
6+
7+
Table of Contents
8+
=================
9+
10+
* [Installation](#installation)
11+
* [Usage](#usage)
12+
* [Load the extension](#load-the-extension)
13+
* [Open a 2bit file](#open-a-2bit-file)
14+
* [Access the list of chromosomes and their lengths](#access-the-list-of-chromosomes-and-their-lengths)
15+
* [Print file information](#print-file-information)
16+
* [Fetch a sequence](#fetch-a-sequence)
17+
* [Fetch per-base statistics](#fetch-per-base-statistics)
18+
* [A note on coordinates](#a-note-on-coordinates)
19+
20+
# Installation
21+
22+
You can install the extension directly from github with:
23+
24+
pip install git+https://github.com/dpryan79/py2bit
25+
26+
# Usage
27+
28+
Basic usage is as follows:
29+
30+
## Load the extension
31+
32+
>>> import py2bit
33+
34+
## Open a 2bit file
35+
36+
This will work if your working directory is the py2bit source code directory.
37+
38+
>>> tb = py2bit.open("test/foo.2bit")
39+
40+
Note that if you would like to include information about soft-masked bases, you need to manually specify that:
41+
42+
>>> tb = py2bit.open("test/foo.2bit", True)
43+
44+
## Access the list of chromosomes and the lengths
45+
46+
`TwoBit` objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the `chroms()` method.
47+
48+
>>> tb.chroms()
49+
{'chr1': 150L, 'chr2': 100L}
50+
51+
You can directly access a particular chromosome by specifying its name.
52+
53+
>>> tb.chroms()
54+
150L
55+
56+
The lengths are stored as a "long" integer type, which is why there's an `L` suffix. If you specify a nonexistent chromosome then nothing is output.
57+
58+
>>> tb.chroms("foo")
59+
>>>
60+
61+
## Print file information
62+
63+
The following information about and contained within a 2bit file can be accessed with the `info()` method:
64+
65+
* file size, in bytes (`file size`)
66+
* number of chromosomes/contigs (`nChroms`)
67+
* total sequence length, in bases (`sequence length`)
68+
* total number of hard-masked (N) bases (`hard-masked length`)
69+
* total number of soft-masked (lower case) bases(`soft-masked length`).
70+
71+
Note that `soft-masked length` will only be present if `open("file.2bit", True)` is used, since handling soft-masking increases memory requirements and decreases perfomance.
72+
73+
>>> tb.info()
74+
{'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8}
75+
76+
## Fetch a sequence
77+
78+
The sequence of a full or partial chromosome/contig can be fetched with the `sequence()` method.
79+
80+
>>> tb.sequence("chr1")
81+
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
82+
83+
By default, the whole chromosome/contig is returned. A specific range can also be requested.
84+
85+
>>> tb.sequence("chr1", 24, 74)
86+
NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC
87+
88+
The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end.
89+
90+
If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs.
91+
92+
## Fetch per-base statistics
93+
94+
It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the `frequency()` method.
95+
96+
>>> tb.frequency("chr1")
97+
{'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}
98+
99+
This returns a dictionary with bases as keys and their frequency as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 `N` in this case). One can also request this information over a particular region.
100+
101+
>>> tb.frequency("chr1", 24, 74)
102+
{'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}
103+
104+
The start and end position are as with the `sequence()` method described above.
105+
106+
## Close a file
107+
108+
A `TwoBit` object can be closed with the `close()` method.
109+
110+
>>> tb.close()
111+
112+
# A note on coordinates
113+
114+
0-based half-open coordinates are used by this python module. So to access the value for the first base on `chr1`, one would specify the starting position as `0` and the end position as `1`. Similarly, bases 100 to 115 would have a start of `99` and an end of `115`. This is simply for the sake of consistency with most other bioinformatics packages.

py2bit.h

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,15 @@ To store soft-masking information:\n\
4343
* The file size, in bytes ('file size').\n\
4444
* The number of chromosomes/contigs ('nChroms').\n\
4545
* The total sequence length ('sequence length').\n\
46-
* The total hard-masked length ('hard-masked').\n\
47-
* The total soft-masked length, if available ('soft-masked').\n\
46+
* The total hard-masked length ('hard-masked length').\n\
47+
* The total soft-masked length, if available ('soft-masked length').\n\
4848
\n\
4949
A base is hard-masked if it is an N and soft-masked if it's lower case. Note that soft-masking is ignored by default (you must specify 'storeMasked=True' when you open the file.\n\
5050
\n\
5151
>>> import py2bit\n\
5252
>>> tb = py2bit.open(\"some_file.2bit\")\n\
5353
>>> tb.info()\n\
54-
{'file size': 160L, 'nChroms': 2L, 'sequence length': 250L, 'hard-masked': 150L, 'soft-masked': 8L}\n\
54+
{'file size': 160L, 'nChroms': 2L, 'sequence length': 250L, 'hard-masked length': 150L, 'soft-masked length': 8L}\n\
5555
>>> tb.close()\n"},
5656
{"close", (PyCFunction)py2bitClose, METH_VARARGS,
5757
"Close a 2bit file.\n\
@@ -130,7 +130,7 @@ bases.\n\
130130
>>> import py2bit\n\
131131
>>> tb = py2bit.open(\"test/test.2bit\")\n\
132132
>>> tb.frequency(tb, \"chr1\")\n\
133-
{'A': 0.08, 'C': 0.08, 'T': 0.08, 'G': 0.08}\n\
133+
{'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}\n\
134134
>>> tb.frequency(tb, \"chr1\", 24, 74)\n\
135135
{'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}\n\
136136
>>> tb.close()"},

0 commit comments

Comments
 (0)