This is the signature genome modeling platform used in Frank Alber lab, Department of Microbiology, Immunology and Molecular Genetics, University of California Los Angeles. The source code is in the igm
folder.
A population of single cell chromosome/whole genome single-copy/diploid structures is generated, which fully recapitulate a variety of experimental genomic and/or imaging data. It does NOT preprocess raw data.
Structures can be further processed using the [analysis package][https://github.com/alberlab/genome3danalysis] that is also available.
We are currently working on an extensive user-friendly tutorial on GitHub.io
to help users navigate parameter choice, set up their configuration file and run the code.
We are also in the process of updating our supporting documentation. For the time being, please refer to the IGM1.0 documentation.
If you use genome structures generated using this platform OR you use the platform to generate your own structure, please consider citing our work
Boninsegna, L., Yildirim, A., Polles, G., Zhan, Y., Quinodoz, SA., Finn, EH., Guttman, M., Zhou, XJ., Alber, F. Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations. Nat Methods 19, 938–949 (2022)
We strongly advice against installing the software on a MacOS. Our experience showed that installation steps are not transferable from one version to the next, so we removed that information from this file.
August 25
The current version improves upon IGM 1.0, by allowing the following data to be used in the modeling:
- volume confinement from imaged single cell nuclear laminas, nucleoli, speckles
- lamina DamID when using imaged single cell nuclear laminas, nucleoli, speckles
- single cell chromating tracing data (e.g., DNA MERFISH, DNA seqFISH+):
- tracing data as target (x,y,z) locations for selected loci, OR/AND
- single cell paiwise distances OR/AND
- a chromatin fiber model that is compatible with the tracing data
Implementation changes:
- Single chromosome genome structures can also be generated, in addition to whole genome diploid structures
- The Hi-C iterative correction can be turned off by setting a flag to 0
- Chromatin bead radius can be selected by the user, instead of defining a given chromatin-to-nucleus occupancy value
- Intra HiC and inter HiC are handled as two separate restraints
- Logging has been much improved to clearly show number of violations (and the structure displaying the most of those)
- Initialization of structures has been greatly expanded; selected loci can be initialized in pre-determined locations, and linear interpolation is used to prime the other loci
- Violations are recorded and printed out even after the initial relaxation step (no actual data here)
- Remember that a version of the Molecular Dynamics software LAMMPS with the required
fixes
is necessary.
igm
: full IGM code(s)bin
: IGM run master file. In particular, refer toigm-run.sh
(actual submission script) andigm-report.sh
(post-processing automated script)demo
: example inputs (.hcs, .json files) for demo runHPC_scripts
: create ipyparallel environment and submit actual IGM run on a Sun Grid Engine (SGE) scheduler based HPC cluster (the kind of that is run on UCLA Hoffman2 cluster)igm-run_scheme.pdf
: is a schematic which breaks down the different computing levels of IGM and tries to explain how the different parts of the code are related to one another.IGM_documentation.pdf
: documentation (in progress)igm-config_TBD.json
: most comprehensive configuration file which shows parameters for all data sets that can be accommodated [update in progress]config_schema.json
: detailed explanation of each dictionary entry in the configuration file
IGM not longer supports python 2, so you'll need a python3 environment. The package depends on a number of other libraries, most of them publicly available on pip. In addition, some other packages are required:
alabtools
(github.com/alberlab/alabtools
)- a modified version of
LAMMPS
(forked @github.com/alberlab/lammpgen
) with fixes implementing user-defined forces (e.g., HarmonicLowerBound, HarmonicUpperBound, volumetric_restraint, etc)
-
Many of the alabtools and IGM dependencies can be installed with a few commands if you are using conda (https://www.anaconda.com/distribution/)
Please note, we are running conda versions back from 2019. More recent versions might cause compatibility issues.
# optional - create a new environment for igm conda create -n igm python=3.6 source activate igm # install dependencies conda install pandas swig cython cgal==4.14 hdf5 h5py numpy scipy matplotlib \ tornado ipyparallel cloudpickle
- It looks like
cgal
version needs to be 4.14, there are some compatibility issues with latest 5.0 version.
If you really do not want to use conda, most of the packages can be installed with pip, but it is up to you to download and build cgal and hdf5, and eventually set the correct include/library paths during installation.
- It looks like
-
Install alabtools (github.com/alberlab/alabtools)
pip install git+https://github.com/alberlab/alabtools.git
Note: on windows, conda CGAL generates the library, but the name depends on the build, e.g CGAL-vc140-mt-4.12.lib. Go to /Library/lib/ and copy the CGAL library to CGAL.lib before pip installing alabtools.
-
Install IGM
pip install git+https://github.com/alberlab/igm.git
-
Download and build a serial binary of the modified LAMMPS version
git clone https://github.com/alberlab/lammpgen.git cd lammpgen/src make yes-user-genome make yes-molecule make serial # create a user defaults file with the path of the executable mkdir -p ${HOME}/.igm echo "[DEFAULT]" > ${HOME}/.igm/user_defaults.cfg echo "optimization/kernel_opts/lammps/lammps_executable = "$(pwd)/src/lmp_serial >> ${HOME}/.igm/user_defaults.cfg
-
If all the dependencies have been installed correctly, successful code installation should only take a few minutes.
-
If
igm
installation is successful, typingigm
from the command line +tab
should show the different options (igm-run
,igm-report
, etc.)
- IGM uses works mostly through the file system. The reason for the design stood on the local cluster details, persistence of data, and minimization of memory required by the scheduler and workers. That means, in short, that scheduler, workers and the node which executes the igm-run script need to have access to a shared filesystem where all the files will be located.
- Over the 10+ last years of simulating genome structures, we have grown to accepting preprocessing the experimental data can be an art. For example, Hi-C raw counts need to be transformed to probability matrices. Some of these processes have yet to be completely and exaustively documented publicly. We are working on it, but in the meantime email if you need help.
In order to generate population of structures, the code has to be run in parallel mode, best if on HCP clusters. The scripts to do that on a SGE scheduler-based HPC resources are provided in the HCP_scripts
folder. Just to get an estimate, using 250 independent cores allows generating a 1000 200 kb resolution structure population in 10-15 hour computing time, which can vary depending on the number of different data sources that are used and on the number of iterations one decides to run.
Populations of 5 or 10 structures at 200kb resolution (which is the current highest resolution we simulated) could in principle be generated serially on a "normal" desktop computer, but they have little statistical relevance. For example, 10 structures would only allow to deconvolute Hi-C contacts with probability larger than 10%, which is not sufficient for getting realistic populations. Serial executions are appropriate only at much lower resolution, as the computing burden is also much lower (an example is provided in the demo
folder, see also Software demo)
Due to the necessity of HPC resources, we strongly recommend that the software be installed and run in a Linux environment. ALL the populations we have generated and analyzed were generated using a Linux
environment. Again, please understand that We cannot guarantee full functionality on a MacOS or Windows.
In order to run IGM to generate a population which uses a given combination of data sources, the igm-config.json
file needs to be edited accordingly, by specifying the input files and adding/removing the parameters for each data source when applicable (a detailed description of the different entries that are available is given under igm/core/defaults
). Then, software can be run using igm-run igm-config.json
. Specifically:
-
Go into
igm-config.json
file (or your config file) and editoptimization/kernel_opts/lammps/lammps_executable
so that it points to the actual lammps executable file being installed (see Installation on Linux) -
If run serially (as a test), go into
igm-config.json
(or your config) file and setparallel/controller
to "serial". Then execute IGM (from the command line or by submitting a serial job to HPC cluster) by typingigm-run config_file.json >> output.txt
. -
If run in parallel (this is for actual calculations), go into
igm-config.json
file and setparallel/controller
to "ipyparallel" and then follow the steps detailed inHPC_scripts\steps_to_submit_IGM.txt
file and in the documentation, which rely on scripts also in theHPC_scripts
folder. Specifically: create a running ipcluster environment (bash create_ipcluster_environment.sh
followed byqsub submit_engines.sh
) and only then submit the actual IGM calculation (qsub submit_igm.sh
), which executes theigm-run igm-config.json
command, i.e.bash create_ipcluster_environment.sh qsub submit_engines.sh qsub submit_igm.sh
[Commands and sintax will need to be adapted if different scheduler than SGE is available]
- A successful run should generate a
igm.log
andstepdb.splite
files, a number of temporary files from the Assignment Steps and finally a sequence of intermediate .hss genome populations, each resulting from a different A/M iteration (seeIGM_documentation.pdf
). The fileigm-model.hss
will contain the optimized population at the end of the pipeline. hss files can be read conveniently using thealabtools
package which was mentioned already. - A non-successful run (for whatever reason) should produce the
err_igm
file with details about the reason why the run crashed. If a run accidentally crashes (like, a node goes down), resubmitting the calculation usingqsub submit_igm.sh
(assuming the ipcluster environment is still up and running) will pick up exactly where the previous run left off. However, if a fresh new calculation has to start from the top, please make sure all the temporary files (including the databasestepdb.splite
) and thetmp
folder are removed before submitting.
In order to get familiar with the configuration file and the code execution, we provide a config_file.json
demo configuration file for running a 2Mb resolution WTC11 population using Hi-C data only: that is found in the demo
folder.
A comprehensive configuration file igm-config_all.json
for running a HFF population with all data types (Hi-C, lamina DamID, SPRITE and 3D HIPMap FISH) is also provided here as a reference/template. Clearly, each user must specify their own input files.
Sample files at provided to simulate a Hi-C only population of WTC11 (spherical nucleus) at 2Mb resolution, to get familiar with the basics of the code
- Enter the
demo
folder: data and scripts for a 2Mb IGM calculation with Hi-C restraints are provided;.hcs
file is a 2Mb resolution Hi-C contact mapconfig_file.json
is the .json configuration file with all the parameters needed for the calculation. In particular, we generate 100 structures, which means the lowest contact probability we can target is 0.01 (1 %). For different setups, we recommend using different names for the configuration file to avoid confusion. Whatever name is chosen, it will have to be updated when running the scripts.- Run IGM as detailed in the previous section (
igm-run config_file.json
), either serially or in parallel; the serial calculation (on a normal computer) all the way down to 1% probability should be completed in a few hours.