1
1
# Transferase
2
+
2
3
The transferase system for retrieving methylomes from methbase.
3
4
4
5
The transferase program is ` xfrase ` which is quicker to type and will
5
- be used below. There are several commands within ` xfrase ` and the best way to
6
- start is to understand the ` dnmtools roi ` command, as the information
7
- functionality provided by ` xfrase ` is the same. If you need to learn
8
- about ` dnmtools roi ` you can find the docs
9
- [ here] ( https://dnmtools.readthedocs.io/en/latest/roi/ )
6
+ be used below. There are several commands within ` xfrase ` and the best
7
+ way to start is to understand the ` dnmtools roi ` command, as the
8
+ information functionality provided by ` xfrase ` is the same. If you
9
+ need to learn about ` dnmtools roi ` you can find the docs
10
+ [ here] ( https://dnmtools.readthedocs.io/en/latest/roi )
11
+
12
+ ## Installing transferase
13
+
14
+ # Install the pre-compiled binary
15
+
16
+ If you are on a reasonably recent Linux (i.e., no older than 10
17
+ yeads), then you can install the binary distribution. First
18
+ download it like this:
19
+ ``` console
20
+ wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.2.0/transferase-0.2.0-Linux.sh
21
+ ```
22
+
23
+ Then run the downloaded installer (likely you want to first install it
24
+ beneath your home dir):
25
+ ``` console
26
+ ./transferase-0.2.0-Linux.sh --prefix=${PREFIX}
27
+ ```
28
+
29
+ This will prompt you to accept the license, and then it will install
30
+ the ` xfrase ` binaries in ` ${PREFIX}/bin ` , along with some config files
31
+ in ` ${PREFIX}/share ` . If you want to install it system-wide, and have
32
+ the admin privs, you can do:
33
+ ``` console
34
+ ./transferase-0.2.0-Linux.sh --prefix=/usr/local
35
+ ```
36
+
37
+ If you are on Debian or Ubuntu, and have admin privileges, you can use
38
+ the Debian package and then transferase will be tracked in the package
39
+ management system. Get the file like this:
40
+ ``` console
41
+ wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.2.0/transferase-0.2.0-Linux.deb
42
+ ```
43
+
44
+ And then install it like this:
45
+ ``` console
46
+ sudo dpkg -i ./transferase-0.2.0-Linux.deb
47
+ ```
48
+
49
+ # Building the source
50
+
51
+ Not recommended unless you know what you are doing. You will need the
52
+ following:
53
+
54
+ * A compiler that can handle most of C++23, one of
55
+ - GCC >= [ 14.2.0] ( https://gcc.gnu.org/pub/gcc/releases/gcc-14.2.0/gcc-14.2.0.tar.gz )
56
+ - Clang >= [ 20.0.0] ( https://github.com/llvm/llvm-project.git ) (no release as of 12/2024)
57
+ * Boost >= [ 1.85] ( https://archives.boost.io/release/${BOOST_VERSION}/source/boost_1_85.tar.bz2 )
58
+ * CMake >= [ 3.30] ( https://github.com/Kitware/CMake/releases/download/v3.30.2/cmake-3.30.2.tar.gz )
59
+ * ZLib, any version, just install it with ` apt install zlib1g-dev ` ,
60
+ ` mamba install -c conda-forge zlib ` , etc. From source is fast and
61
+ easy.
62
+
63
+ However you install these, remember where you put them and update your
64
+ paths accordingly.
65
+
66
+ Since transferase uses CMake to generate the build system, there are
67
+ multiple ways to do it, but I like this:
68
+ ``` shell
69
+ tar -xf transferase-0.2.0-Source.tar.gz
70
+ cd transferase-0.2.0-Source
71
+ cmake -B build \
72
+ -DCMAKE_BUILD_TYPE=Build # for a faster xfrase
73
+ cmake --build build -j64 # i.e., if you have 64 cores
74
+ cmake --install build --prefix=${HOME} # or wherever
75
+ ```
10
76
11
77
## Make an index file
12
78
@@ -16,10 +82,12 @@ reference file name is `hg38.fa`, then do this:
16
82
``` console
17
83
xfrase index -v -g hg38.fa -x hg38.cpg_idx
18
84
```
85
+
19
86
If ` hg38.fa ` is roughly 3.0G in size, then you should expect the index
20
87
file ` hg38.cpg_idx ` to be about 113M in size. This command will also
21
88
create an "index metadata" file ` hg38.cpg_idx.json ` , which is named by
22
- just adding the ` .json ` extension to the provided output file.
89
+ just adding the ` .json ` extension to the provided output file. You can
90
+ also start with a gzip format file like ` hg38.fa.gz ` .
23
91
24
92
## Make a methylome file
25
93
@@ -33,52 +101,53 @@ used by `dnmtools`. We again assume that the reference genome is
33
101
` SRX012345.xsym.gz ` file. This ensures a correspondence between the
34
102
` SRX012345.xsym.gz ` file and the ` hg38.cpg_idx ` file we will need.
35
103
Here is how to convert such a file into the xfrase format:
36
-
37
104
``` console
38
105
xfrase compress -v -x hg38.cpg_idx -m SRX012345.xsym.gz -o SRX012345.m16
39
106
```
40
107
41
- If the chromosomes appear out-of-order in ` hg38.cpg_idx ` and
42
- ` SRX012345.xsym.gz ` an error will be reported. As with the ` hg38.cpg_idx `
43
- index file, the methylome file ` SRX012345.m16 ` will be accompanied by
44
- a metadata file with an additional json extension: ` SRX012345.m16.json ` .
108
+ As with the ` hg38.cpg_idx ` index file, the methylome file
109
+ ` SRX012345.m16 ` will be accompanied by a metadata file with an
110
+ additional json extension: ` SRX012345.m16.json ` .
45
111
46
- If you begin with a counts format file, for example ` SRX012345.sym ` ,
47
- created using the ` dnmtools counts ` and then ` dnmtools sym ` commands,
48
- then you will need to first convert it into ` .xsym ` format. You can do
49
- this as follows:
112
+ If you begin with a
113
+ [ counts] ( https://dnmtools.readthedocs.io/en/latest/counts ) format
114
+ file, for example ` SRX012345.sym ` , created using the ` dnmtools counts `
115
+ and then ` dnmtools sym ` commands, then you will need to first convert
116
+ it into ` .xsym ` format (whether gzipped or not). You can do this as
117
+ follows:
50
118
``` console
51
119
dnmtools xcounts -z -o SRX012345.xsym.gz -c hg38.fa -v SRX012345.sym
52
120
```
53
- Once again, be sure to always use the same ` hg38.fa ` file. A hash
54
- function will be used internally to ` xfrase ` to ensure that the index
55
- and methylome files correspond to the same reference genome file.
56
121
57
- ## Run the ` lookup ` command locally
122
+ Once again, be sure to always use the same ` hg38.fa ` file. A hash is
123
+ generated and used internally to ` xfrase ` to ensure that the index and
124
+ methylome files correspond to the same reference genome file.
58
125
59
- This step is to make sure everything is sensible. You will need a set
60
- of genomic intervals of interest. In this example these will be named
61
- ` intervals.bed ` . You also need the index file and the methylome file
62
- explained in the above steps.
63
- ``` console
64
- xfrase lookup local --log-level debug -x hg38.cpg_idx -m SRX012345.m16 -o intervals_local_output.bed -i intervals.bed
65
- ```
66
- The index file ` hg38.cpg_idx ` and the methylome file ` SRX012345.m16 `
67
- are the same as explained above. At the time of writing, the
68
- ` intervals.bed ` file must be 6-column bed format, so if yours is only
69
- 3 columns, you can use a simple awk command to pad it out as follows:
126
+ ## Run the ` intervals ` command locally
127
+
128
+ This step is to make sure everything is sensible. Or you might just
129
+ want to keep using this tool for your own analysis (it's fast). You
130
+ will need a set of genomic intervals of interest. In this example
131
+ these will be named ` intervals.bed ` . You also need the index file and
132
+ the methylome file explained in the above steps.
70
133
``` console
71
- awk -v OFS="\t" '{print $1,$2,$3,"X",0,"+"}' intervals.bed3 > intervals.bed
134
+ xfrase intervals local -v debug -x hg38.cpg_idx -m SRX012345.m16 -o local_output.bed -i intervals.bed
72
135
```
73
136
74
- The output in the ` intervals_local_output.bed ` file should be
75
- consistent with the information in the command:
76
-
137
+ The index file ` hg38.cpg_idx ` and the methylome file ` SRX012345.m16 `
138
+ are the same as explained above. The ` intervals.bed ` file may contain
139
+ any number of columns, but the first 3 columns must be in 3-column BED
140
+ format: chrom, start, stop for each interval. The output in the
141
+ ` local_output.bed ` file should be consistent with the information in
142
+ the command:
77
143
``` console
78
144
dnmtools roi -o intervals.roi intervals.bed SRX012345.xsym.gz
79
145
```
80
- The format of the output might be different, but the methylation
146
+
147
+ The format of these output files are different, but the methylation
81
148
levels on each line (i.e., for each interval) should be identical.
149
+ Note that ` dnmtools roi ` can fail if intervals are nested, while
150
+ ` xfrase intervals ` command will still work.
82
151
83
152
## Run the ` server ` command
84
153
@@ -94,25 +163,25 @@ relevant genome assembly. For now, using the above examples, we
94
163
would have a single index and a single methylome. I will assume these
95
164
are in subdirectories, named ` indexes ` and ` methylomes ` respectively,
96
165
of the current directory. Here is a command that will start the server:
97
-
98
166
``` console
99
167
xfrase server -v debug -s localhost -p 5000 -m methylomes -x indexes
100
168
```
169
+
101
170
Not that this will fail with an appropriate error message if port 5000
102
171
is already be in use, and you can just try 5001, etc., until one is
103
- free. The ` -v debug ` will ensure you see info beyond just the errors. This
104
- informtion is logged to the terminal by default.
172
+ free. The ` -v debug ` will ensure you see info beyond just the
173
+ errors. This informtion is logged to the terminal by default.
105
174
106
- ## Run the ` lookup ` command remotely
175
+ ## Run the ` intervals ` command remotely
107
176
108
177
We will assume for now that "remote" server is running on the local
109
178
machine (localhost) and using port is 5000 (the default). The
110
- following command should give identical earlier ` lookup ` command:
111
-
179
+ following command should give identical earlier ` intervals ` command:
112
180
``` console
113
- xfrase lookup remote -v debug -s localhost -x indexes/hg38.cpg_idx \
114
- -o intervals_remote_outout .bed -a SRX012345 -i intervals.bed
181
+ xfrase intervals remote -v debug -s localhost -x indexes/hg38.cpg_idx \
182
+ -o remote_output .bed -a SRX012345 -i intervals.bed
115
183
```
184
+
116
185
Note that now ` SRX012345 ` is not a file this time. Rather, it is a
117
186
methylome name or accession, and should be available on the server. If
118
187
the server can't find the named methylome, it will respond indicating
0 commit comments