Skip to content

ynyxlxx/Graph_Traversal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph_Traversal

Find out all the connect components in the graph that contain the start nodes.

The graph used as input is generated by the SGA - String Graph Assembler. SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads. link: https://github.com/jts/sga#sga---string-graph-assembler

the python version is 3.6.5

Preprocessing

the graph that generated by the SGA is saved in asqg format, which contains all the information of vertices and edges,and only the information about edges are needed. Filtering out all the unnecessary data is easy with the help of zgrep command in linux.

In a terminal of linux, run the fllowing command:

  1. zgrep "ED.*" filename > output_filename
  2. gzip output_filename

Usage

search.py

If you have enough memory to handle a huge graph, use search.py for better time-efficient,whole graph will be loaded into memory for further process. A graph that contains two million nodes need at least 13 GB memory.

Usage: search.py graph_file sam_file

Example: search.py example.gz example.sam

db_build.py and search_d.py

  1. If you don't have enough memory, use db_build.py to build a SQLite database on your hard disk first.

Usage: db_build.py graph_file graph_database

Example: db_build.py example.gz my_database.db

  1. After the database is built, use search_d.py to do the search.

Usage: search_d.py graph_database sam_file

Example: search_d.py my_database.db example.sam

Example

  1. change the directory to the directory that contains your data.

     cd your_dir
    
  2. do preprocessing, note that graph.asqg.gz is the output of SGA. gzip is used to compress the file to save space.

     zgrep "ED.*" graph.asqg.gz > example.txt
     gzip example.txt
    
  3. if you have enough memory:

     python search.py example.gz example.sam
    

then check the result in the result.txt.

  1. if the memory is not enough:

     python db_build.py example.gz my_database.db
     python search_d.py my_database.db example.sam
    

then check the result in the result.txt.

About

find out all the connect component.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages