ChemGenome 2.0 : An ab-initio Gene Prediction Software



About ChemGenome2.1 Downloadable Version

ChemGenome is a physico-chemical method which accepts DNA sequence in FASTA format and predicts genes, based on hydrogen bonding energy, stacking energy and protein-nucleic acid interaction parameter for each trinucleotide (codon).

ChemGenome is ab-initio in nature and has been tested on 372 prokaryotic genomes with sensitivity, specificity and correlation coefficients averaged over 356208 genes and an equal number of frame-shifted genes (non-genes) as 97.5%, 97.20% & 94.25% respectively. The software can be downloaded from the following link.

Click Here to Download the ChemGenome Software and ReadMe file containing instruction to use the program

ReadMe File

Follow the steps to run the ChemGenome 2.1 for Linux (The ChemGenome2.1 is compiled with Linux version)

Installing and Running ChemGenome2.1
ChemGenome has been written and compiled in Linux environment. Following instructions will be run on Linux system.

1. Installation
To install ChemGenome download the files from website ChemGenome2.1.tar. Size of the compressed file is 2.9 MB. Copy the tar files in your current directory and uncompress it by using this command

$ tar -xvf ChemGenome2.1.tar

The ChemGenome2.1 contains five files-, Protein_Score.exe, Chemgenome2.0, data directory and readme.txt

To run ChemGenome2.1 properly , user should copy data directory in to their current directory before running Chemgenome2.1. After execution of ChemGenome2.1 all the result file will be copied into the current directory.

2. Running
ChemGenome2.1 can simply be called by providing first argument as chromosome file in FASTA format and second input is 1 or 2 on the basis of organism selected (1 for sequence from prokaryotic organism and 2 for sequence from Eukaryotic organism or from unknown sequence)

$ sh <genome_file_name> <1 or 2>

For Advanced feature user can modify file! In, the first executable program is Chemgenome2.0 with given parameters,

$ ./Chemgenome2.0 <genome_file_name> <orf_length> <method> <Start Codon (ATG OR|AND CTG OR|AND GTG OR|AND TTG) >

3. Arguements
ORF Length: If you have small genome you can specify lower threshold value to find smaller genes. If you have large genomes you can specify higher threshold value to weed out false positives.

Start Codon: You can specify what should be the start codon with which you want to find genes.

Method :
DNA Space: The method takes complete or part of genome sequence of prokaryotic species in FASTA format as input file. It searches for genes based on physico-chemical properties of double-helical deoxyribonucleic acid (DNA).

Protein Space: The method takes the result generated from DNA space as input file and works as a filter based on stereochemical properties of protein sequences to reduce false positives.

Swissprot Space :The method takes the result generated from protein space as input file and calculates the standard deviation of a query nucleotide sequence (predicted gene sequence) with the swissprot proteins based on the frequency of occurrence of aminoacids. A threshold standard deviation is chosen to keep the false positives at minimum and precision at maximum.

4. Output of the Program
The output of Chemgenome2.0 is further passed through protein based filters to produce final output, On Version available online, there is graphical output. In downloadable version following files are created.

The output files are
1. 1main_orfs.txt - Genes predicted in 1st main reading frame
2. 2main_orfs.txt - Genes predicted in 2nd main reading frame
3. 3main_orfs.txt - Genes predicted in 3rd main reading frame
4. 1complementary_orfs.txt - Genes predicted in 1st complementary reading frame
5. 2complementary_orfs.txt - Genes predicted in 2nd complementary reading frame
6. 3complementary_orfs.txt - Genes predicted in 3rd complementary reading frame
7. Gene_sequences.txt - Gene Sequences of the predicted genes along with position.
8. Protein_sequences.txt - Protein sequences of the predicted genes along with position.

5. Speed
Time taken by the program will depend on genome size and the speed of the system on which its run. It takes usually 1-2 minutes for 1MB genome on a Pentium 4, CPU 2.40 GHz, 248 MB RAM with swissprot space method.


[1] "Prokaryotic Gene Finding based on Physicochemical Characteristics of Codons Calculated from Molecular Dynamics Simulations", Poonam Singhal,B Jayaram,Surjit B. Dixit and David L. Beveridge, Biophys J., 2008, 94, 11, 4173-4183.
[ Read Paper ]

[2] "A Physico-Chemical model for analyzing DNA sequences", Dutta S, Singhal P, Agrawal P, Tomer R, Kritee, Khurana E and Jayaram B, J.Chem. Inf. Mod., 2006, 46(1), 78-85.
[ Abstract ]

[3] "Beyond the Wobble : The rule of conjugates", Jayaram B, Journal of Mol. Evol., 1997, 45, 704-705.
[ Read Paper ]

In case of any Suggestions/Exceptions, Please contact us at