CGAT instructions

This is the Comparative Geneomic Analysis Tools (CGAT) help file.

Contents

  1. Starting out.
  2. Running automated analyses using dna_to_fplot.
  3. Using the results from dna_to_fplot.
  4. Running indivdual analyses separately.
  5. Importing data into an analysis database.
  6. Using Repeatmasker to mask repetitive sequences.

1. Starting out.

Make sure the that configuration is complete. The blast_off.config file needs to contain your email address. Your path should contain the CGAT directory. Installation and configuration instructions are available here.


2. Running automated analyses using dna_to_fplot.

To run the automated analysis, use the program dna_to_fplot. The DNA file can be in fasta format, a Genbank file, or contain only DNA. Automated analysis can be run on genomic DNA, the default, or a set of cDNA sequences concatenated together in one file. To run cDNA analysis, use the '-c' parameter.

The command is:

dna_to_fplot DNA.fasta

This may take a long time to run, depending on the response time of the BLAST sever and how quickly email gets recieved. A larger DNA sequence, 100 kb, can take a day to complete.

The results are a set of files, all with the same base name. The base name will be the name of the DNA file, or a shortened version of it.

In this example, the resulting files would be:

DNA.grail.mask			-DNA with repetitive sequence masked..
DNA.grail.repeats		-List of repeats and simple seq. repeats.
DNA.grail2.exons		-GRAIL 2 exons predictions.
DNA.grail2.CpG.pol2.polyA	-GRAIL 1.3 CpG islands, Pol II promoters, 
				 and polyA sites.
DNA.mzef.exons			-MZEF exon predictions

For each database serched using BLAST, three files are generated:

DNA.db.BLASTN			-The raw BLAST output.
DNA.db.BLASTN.parse		-A report of good BLAST matches.
DNA.db.BLASTN.fplot		-A fplot database of the good BLAST matches.

In this example, these BLAST results files are generated:

DNA.nr.BLASTN
DNA.nr.BLASTN.parse
DNA.nr.BLASTN.fplot
DNA.dbest.BLASTN
DNA.dbest.BLASTN.parse
DNA.dbest.BLASTN.fplot
DNA.gss.BLASTN
DNA.gss.BLASTN.parse
DNA.gss.BLASTN.fplot
DNA.htgs.BLASTN
DNA.htgs.BLASTN.parse
DNA.htgs.BLASTN.fplot

The results are imported into an 'fplot' format flat-file database (a
formatted text file):

DNA.fplot

And a set of HTML files are generated:

DNA.html
DNA.Legend.html
DNA.gif.html

To view the HTML files, point a browser at DNA.html, or download the three files to your machine, and drag-and-drop it onto a browser window or using the 'Open file' menu to open DNA.html.


3. Using the results from dna_to_fplot.

After examining the automatically generated results, you may want to add annotations, or re-parse the BLAST result to eliminate matches which aren't useful, or allow weaker matches to be viewed.

To re-parse the BLAST output, use the parse program. You may want to change thep value, HSP, or % identity cutoff defaults, or use positive or negative filtering (include or eliminate database sequences with certain keywords in the description.)

After the BLAST results have been re-parsed, import them into the fplot database by using the preplot program. Delete out the lines containing the out BLAST search output, and import the re-parsed output.

Annotations can be added be editing the fplot database file. Arrows, gene labels, and confirmed exons can be added. Genes can be color coded to make it clear which exons belong to which genes. The fplot help file has instructions on the fplot format and on editing the fplot file.


4. Running indivdual analyses separately.

To run a database search separately, or to re-run a search performed before, blast_off can be used.

To use blast_off, first make sure that the blast_off.config file contains your email address. The config file is a text file, and it can be editted with any text editor.

The '-d' option specifies the database to be searched, and the '-t' option specifies the type of BLAST search (that is, the BLAST program, whether BLASTN, or BLASTP, etc.).

Here's an example searching the DNA contained in the file 'test.fasta' against the dbEST database using BLASTN:

blast_off -dbest -tBLASTN -otest.dbest.BLASTN < test.fasta

Searches are run using the BLAST email searver, and may take quite awhile. When the service is being used heavily, or had problems, a search can often take a day to finish.

You may want to run this program in the background, so you don't have to wait for it to finish.

In this case, the command would be:

blast_off -dbest -tBLASTN -otest.dbest.BLASTN < test.fasta &

On some versions of UNIX, you can log out and the job will continue to run, while on others it gets terminated automatically when you log out. One way to kep it from getting terminated when you log out is to use nohup, available on some systems.

In this case, the command would be:

nohup blast_off -dbest -tBLASTN -otest.dbest.BLASTN < test.fasta

5. Importing data into an analysis database

Use preplot. This program is menu-based, and you can select the import menu, then import GRAIL repeats, RepeatMasker repeats, other GRAIL features, MZEF exon predictions, lineplot DNA/DNA comparisons. BLAST output is parsed and converted to fplot format by the parse program.

This BLAST fplot file can be imported into the analysis database using preplot (select 'Import fplot file'). This precedure can also be used to import the analysis database of an overlapping clone.


6. Using Repeatmasker to mask repetitive sequences

By default, dna_to_fplot uses the GRAIL repeat service to identify repetitive sequences (repetitive elements and simple sequence repeats). If you wish, RepeatMasker can be used to mask repetitive sequences. I have heard reports that the RepeatMasker find some repeats missed by the GRAIL service (although the GRAIL service has proved satisfactory to me (JL)). Also, if the GRAIL service is not functioning, using the RepeatMasker server provides another optionfor identifying and masking repeats.

The RepeatMasker server can be accessed over the web, or through email. The repeat_now program provides easy access to the RepeatMasker email service and will return a list of repeats as well as the masked DNA.

To have dna_to_fplot use a masked DNA file you've prepared, place the masked DNA in a file with the name 'DNA_base.mask', where 'DNA_base' is the base name being used by dna_to_fplot, the masked DNA in this file will be used instead of the GRAIL repeat service.



Updated 7/99

Written by Jim Lund in the lab of Roger Reeves, Johns Hopkins University