Skip to content

Commit cc2ffe1

Browse files
README.md: lots of updates
1 parent b632426 commit cc2ffe1

File tree

1 file changed

+112
-43
lines changed

1 file changed

+112
-43
lines changed

README.md

Lines changed: 112 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,78 @@
11
# Transferase
2+
23
The transferase system for retrieving methylomes from methbase.
34

45
The transferase program is `xfrase` which is quicker to type and will
5-
be used below. There are several commands within `xfrase` and the best way to
6-
start is to understand the `dnmtools roi` command, as the information
7-
functionality provided by `xfrase` is the same. If you need to learn
8-
about `dnmtools roi` you can find the docs
9-
[here](https://dnmtools.readthedocs.io/en/latest/roi/)
6+
be used below. There are several commands within `xfrase` and the best
7+
way to start is to understand the `dnmtools roi` command, as the
8+
information functionality provided by `xfrase` is the same. If you
9+
need to learn about `dnmtools roi` you can find the docs
10+
[here](https://dnmtools.readthedocs.io/en/latest/roi)
11+
12+
## Installing transferase
13+
14+
# Install the pre-compiled binary
15+
16+
If you are on a reasonably recent Linux (i.e., no older than 10
17+
yeads), then you can install the binary distribution. First
18+
download it like this:
19+
```console
20+
wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.2.0/transferase-0.2.0-Linux.sh
21+
```
22+
23+
Then run the downloaded installer (likely you want to first install it
24+
beneath your home dir):
25+
```console
26+
./transferase-0.2.0-Linux.sh --prefix=${PREFIX}
27+
```
28+
29+
This will prompt you to accept the license, and then it will install
30+
the `xfrase` binaries in `${PREFIX}/bin`, along with some config files
31+
in `${PREFIX}/share`. If you want to install it system-wide, and have
32+
the admin privs, you can do:
33+
```console
34+
./transferase-0.2.0-Linux.sh --prefix=/usr/local
35+
```
36+
37+
If you are on Debian or Ubuntu, and have admin privileges, you can use
38+
the Debian package and then transferase will be tracked in the package
39+
management system. Get the file like this:
40+
```console
41+
wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.2.0/transferase-0.2.0-Linux.deb
42+
```
43+
44+
And then install it like this:
45+
```console
46+
sudo dpkg -i ./transferase-0.2.0-Linux.deb
47+
```
48+
49+
# Building the source
50+
51+
Not recommended unless you know what you are doing. You will need the
52+
following:
53+
54+
* A compiler that can handle most of C++23, one of
55+
- GCC >= [14.2.0](https://gcc.gnu.org/pub/gcc/releases/gcc-14.2.0/gcc-14.2.0.tar.gz)
56+
- Clang >= [20.0.0](https://github.com/llvm/llvm-project.git) (no release as of 12/2024)
57+
* Boost >= [1.85](https://archives.boost.io/release/${BOOST_VERSION}/source/boost_1_85.tar.bz2)
58+
* CMake >= [3.30](https://github.com/Kitware/CMake/releases/download/v3.30.2/cmake-3.30.2.tar.gz)
59+
* ZLib, any version, just install it with `apt install zlib1g-dev`,
60+
`mamba install -c conda-forge zlib`, etc. From source is fast and
61+
easy.
62+
63+
However you install these, remember where you put them and update your
64+
paths accordingly.
65+
66+
Since transferase uses CMake to generate the build system, there are
67+
multiple ways to do it, but I like this:
68+
```shell
69+
tar -xf transferase-0.2.0-Source.tar.gz
70+
cd transferase-0.2.0-Source
71+
cmake -B build \
72+
-DCMAKE_BUILD_TYPE=Build # for a faster xfrase
73+
cmake --build build -j64 # i.e., if you have 64 cores
74+
cmake --install build --prefix=${HOME} # or wherever
75+
```
1076

1177
## Make an index file
1278

@@ -16,10 +82,12 @@ reference file name is `hg38.fa`, then do this:
1682
```console
1783
xfrase index -v -g hg38.fa -x hg38.cpg_idx
1884
```
85+
1986
If `hg38.fa` is roughly 3.0G in size, then you should expect the index
2087
file `hg38.cpg_idx` to be about 113M in size. This command will also
2188
create an "index metadata" file `hg38.cpg_idx.json`, which is named by
22-
just adding the `.json` extension to the provided output file.
89+
just adding the `.json` extension to the provided output file. You can
90+
also start with a gzip format file like `hg38.fa.gz`.
2391

2492
## Make a methylome file
2593

@@ -33,52 +101,53 @@ used by `dnmtools`. We again assume that the reference genome is
33101
`SRX012345.xsym.gz` file. This ensures a correspondence between the
34102
`SRX012345.xsym.gz` file and the `hg38.cpg_idx` file we will need.
35103
Here is how to convert such a file into the xfrase format:
36-
37104
```console
38105
xfrase compress -v -x hg38.cpg_idx -m SRX012345.xsym.gz -o SRX012345.m16
39106
```
40107

41-
If the chromosomes appear out-of-order in `hg38.cpg_idx` and
42-
`SRX012345.xsym.gz` an error will be reported. As with the `hg38.cpg_idx`
43-
index file, the methylome file `SRX012345.m16` will be accompanied by
44-
a metadata file with an additional json extension: `SRX012345.m16.json`.
108+
As with the `hg38.cpg_idx` index file, the methylome file
109+
`SRX012345.m16` will be accompanied by a metadata file with an
110+
additional json extension: `SRX012345.m16.json`.
45111

46-
If you begin with a counts format file, for example `SRX012345.sym`,
47-
created using the `dnmtools counts` and then `dnmtools sym` commands,
48-
then you will need to first convert it into `.xsym` format. You can do
49-
this as follows:
112+
If you begin with a
113+
[counts](https://dnmtools.readthedocs.io/en/latest/counts) format
114+
file, for example `SRX012345.sym`, created using the `dnmtools counts`
115+
and then `dnmtools sym` commands, then you will need to first convert
116+
it into `.xsym` format (whether gzipped or not). You can do this as
117+
follows:
50118
```console
51119
dnmtools xcounts -z -o SRX012345.xsym.gz -c hg38.fa -v SRX012345.sym
52120
```
53-
Once again, be sure to always use the same `hg38.fa` file. A hash
54-
function will be used internally to `xfrase` to ensure that the index
55-
and methylome files correspond to the same reference genome file.
56121

57-
## Run the `lookup` command locally
122+
Once again, be sure to always use the same `hg38.fa` file. A hash is
123+
generated and used internally to `xfrase` to ensure that the index and
124+
methylome files correspond to the same reference genome file.
58125

59-
This step is to make sure everything is sensible. You will need a set
60-
of genomic intervals of interest. In this example these will be named
61-
`intervals.bed`. You also need the index file and the methylome file
62-
explained in the above steps.
63-
```console
64-
xfrase lookup local --log-level debug -x hg38.cpg_idx -m SRX012345.m16 -o intervals_local_output.bed -i intervals.bed
65-
```
66-
The index file `hg38.cpg_idx` and the methylome file `SRX012345.m16`
67-
are the same as explained above. At the time of writing, the
68-
`intervals.bed` file must be 6-column bed format, so if yours is only
69-
3 columns, you can use a simple awk command to pad it out as follows:
126+
## Run the `intervals` command locally
127+
128+
This step is to make sure everything is sensible. Or you might just
129+
want to keep using this tool for your own analysis (it's fast). You
130+
will need a set of genomic intervals of interest. In this example
131+
these will be named `intervals.bed`. You also need the index file and
132+
the methylome file explained in the above steps.
70133
```console
71-
awk -v OFS="\t" '{print $1,$2,$3,"X",0,"+"}' intervals.bed3 > intervals.bed
134+
xfrase intervals local -v debug -x hg38.cpg_idx -m SRX012345.m16 -o local_output.bed -i intervals.bed
72135
```
73136

74-
The output in the `intervals_local_output.bed` file should be
75-
consistent with the information in the command:
76-
137+
The index file `hg38.cpg_idx` and the methylome file `SRX012345.m16`
138+
are the same as explained above. The `intervals.bed` file may contain
139+
any number of columns, but the first 3 columns must be in 3-column BED
140+
format: chrom, start, stop for each interval. The output in the
141+
`local_output.bed` file should be consistent with the information in
142+
the command:
77143
```console
78144
dnmtools roi -o intervals.roi intervals.bed SRX012345.xsym.gz
79145
```
80-
The format of the output might be different, but the methylation
146+
147+
The format of these output files are different, but the methylation
81148
levels on each line (i.e., for each interval) should be identical.
149+
Note that `dnmtools roi` can fail if intervals are nested, while
150+
`xfrase intervals` command will still work.
82151

83152
## Run the `server` command
84153

@@ -94,25 +163,25 @@ relevant genome assembly. For now, using the above examples, we
94163
would have a single index and a single methylome. I will assume these
95164
are in subdirectories, named `indexes` and `methylomes` respectively,
96165
of the current directory. Here is a command that will start the server:
97-
98166
```console
99167
xfrase server -v debug -s localhost -p 5000 -m methylomes -x indexes
100168
```
169+
101170
Not that this will fail with an appropriate error message if port 5000
102171
is already be in use, and you can just try 5001, etc., until one is
103-
free. The `-v debug` will ensure you see info beyond just the errors. This
104-
informtion is logged to the terminal by default.
172+
free. The `-v debug` will ensure you see info beyond just the
173+
errors. This informtion is logged to the terminal by default.
105174

106-
## Run the `lookup` command remotely
175+
## Run the `intervals` command remotely
107176

108177
We will assume for now that "remote" server is running on the local
109178
machine (localhost) and using port is 5000 (the default). The
110-
following command should give identical earlier `lookup` command:
111-
179+
following command should give identical earlier `intervals` command:
112180
```console
113-
xfrase lookup remote -v debug -s localhost -x indexes/hg38.cpg_idx \
114-
-o intervals_remote_outout.bed -a SRX012345 -i intervals.bed
181+
xfrase intervals remote -v debug -s localhost -x indexes/hg38.cpg_idx \
182+
-o remote_output.bed -a SRX012345 -i intervals.bed
115183
```
184+
116185
Note that now `SRX012345` is not a file this time. Rather, it is a
117186
methylome name or accession, and should be available on the server. If
118187
the server can't find the named methylome, it will respond indicating

0 commit comments

Comments
 (0)