Find out all the connect components in the graph that contain the start nodes.
The graph used as input is generated by the SGA - String Graph Assembler. SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads. link: https://github.com/jts/sga#sga---string-graph-assembler
the python version is 3.6.5
the graph that generated by the SGA is saved in asqg format, which contains all the information of vertices and edges,and only the information about edges are needed. Filtering out all the unnecessary data is easy with the help of zgrep command in linux.
In a terminal of linux, run the fllowing command:
- zgrep "ED.*" filename > output_filename
- gzip output_filename
If you have enough memory to handle a huge graph, use search.py for better time-efficient,whole graph will be loaded into memory for further process. A graph that contains two million nodes need at least 13 GB memory.
Usage: search.py graph_file sam_file
Example: search.py example.gz example.sam
- If you don't have enough memory, use db_build.py to build a SQLite database on your hard disk first.
Usage: db_build.py graph_file graph_database
Example: db_build.py example.gz my_database.db
- After the database is built, use search_d.py to do the search.
Usage: search_d.py graph_database sam_file
Example: search_d.py my_database.db example.sam
-
change the directory to the directory that contains your data.
cd your_dir
-
do preprocessing, note that graph.asqg.gz is the output of SGA. gzip is used to compress the file to save space.
zgrep "ED.*" graph.asqg.gz > example.txt gzip example.txt
-
if you have enough memory:
python search.py example.gz example.sam
then check the result in the result.txt.
-
if the memory is not enough:
python db_build.py example.gz my_database.db python search_d.py my_database.db example.sam
then check the result in the result.txt.