lineplot

Dotplots with lineplot

This file begins assuming you know what a dotplot is. Lineplot is designed for analyzing large genomic DNA sequences. It only keeps track of the lines along which the two sequences match, and therefore requires only a fraction of the bits that a real dot plot would (NM bits, N and M being the lengths of the two DNA sequences being compared) when run with reasonable parameters. All three of these programs only work on DNA sequences. Dotplots can be done on protein sequences also, and are very useful for analyzing proteins, but these programs aren't designed for that.

'lineplot' makes an averaged representation of the dot plot graph, smoothing out certain background or noise features. It's output is a graph in the form of a text file of matches and a GIF or Postscript graphic file.

Also, you can have lineplot rerun, giving it the text output file generated previously as input and it will redraw the graph in a few seconds. It will regraph the output at a higher '-c' value, or draw only a subset of the previous graph. Also, you can edit the sequence 'features' or colors in the text output file before lineplot redraws the graph, and the changes will appear in the new graph.

DNA sequence file format: This program uses a format different than the others. DNA can be bare or in FASTA format just as with the other programs. Optionally, this program allows 'feature' description lines to be present in the DNA files. Feature lines must come after the FASTA description line, if there is one and before the DNA sequence. They start with a ';' and then have 3 numbers, which can be followed by optional description info. The first and second numbers are the start and of the 'feature' in the DNA. The third number is the color the 'feature' box will be drawn in on the GIF graph. There are 27 colors, 0-26. 0 is white, 1 is black, and 2-7 are primary colors. To see what all the colors look like, look at colors.gif in the help directory or on the PPC 7200 desktop. The colors go from 0-26 along both axes.

Example using different formats to show what's OK: >c143 human cosmid

;8048,,,,8140,,,,,2,,,,lan exon 1
;11029,12126,3, Tsk 1
;15411,16487,4,Tsk 2
;18721,19000,5, Dgsi exon 10
;19495,19610,5, Dgsi exon 9
;26582 26677 5 Dgsi exon 3
;31044, 31297, 6, Gscl exon 2
;31432 31681 6, Gscl exon1
GGATCCACACAGCAAGAAGTCCTCGCTCTTTGTCTCTCAGTATAGCTAATACCCAAGGAGCAAGAGTCACCAAACA TGAGGCCCAACCAGATGGCTCTGCTTCCTCCCTCCTCCAACTCCCATCCCTATAGAATGACTACAAGGACCTCTCC TTTTTGCCACCTCCATTAGGGCAGCTGCAAGGCAGATGGCTAAAGGAACTACTGTCCTTTGGGTTAAACTAAAACA +more DNA...

The 'features' are written in the text output file and are used in writing the graphic file when the program is run using the '-a' option. The 'features' in the text output file can be edited before running lineplot on it, but don't add or remove any extra blank lines or mess with the file format otherwise.


lineplot

This c program does a dotplot comparison of two DNA sequences, comparing them by examining a stretch at time of length 'window', and recording a hit if they are at least 'cutoff' percent identical for 'line_min' bases.

There are two modes of use--computing the lineplot, or redrawing the GIF Usage, Mode 1:

lineplot dna1 dna2 -w# -s# -c# [-gFILE_NAME.gif -tFILE_PS.ps -b -fHORZ,VERT -p >matches.file]

nohup lineplot dna1 dna2 -w# -s# -c# [-gFILE_NAME.gif -tFILE_PS.ps -b -fHORZ,VERT -p >matches.file] &

Usage, Mode 2:

lineplot -a -gFILE_NAME.gif [-b -c# -fHORZ,VERT -lseq1st,seq1end,seq2st,seq2end] <INPUT_FILE

OR

lineplot -a -tFILE_PS.ps [-b -c# -fHORZ,VERT -lseq1st,seq1end,seq2st,seq2end] <INPUT_FILE

-w# specifies the window over which identity is computed. Window length should be >10 bp. I use up to 100 bp. Bigger window sizes smooth the resulting dotplot, but with a big window you run the risk of missing a small but significant feature, like a small exon.

-s# specifies the minimum length a stretch of identity has to have to get included in the output (text and gif). This can be used as a filter to remove short, and presumably insignificant matches from the output, allowing the longer matches to be seen more clearly.

-c# is the cutoff--the % identity the sequences have reach over the windo w of comparison to be considered matching. Must be from 0 to 100. The cutoff needs to be high enough to reduce the background of random matches. 50-70% is a good minimum. Most conserved mouse and human exons show up using a 67% cutoff.

-p print a text file of the matches to standard output. Usually, you would want to redirect it into a file (i.e., >file.text.output).

-gFILE_NAME.gif graph the output and save as a GIF format picture with the name FILE_NAME.gif

-tFILE_NAME.ps graph the output and save as a Postscript format picture with the name FILE_NAME.ps

-b print a projection of the homology lines on the GIF graph along the ax es. They will be printed as black boxes inside the gridlines.

-a specifies mode 2. lineplot is given the text output file from a previ ous run of the program as input to standard input, and the output is a GIF graph.

-lseq1st,seq1end,seq2st,seq2end This is an optional parameter that magnifies the original lineplot. Only the part of the original sequence from seq1st to seq1end in DNA seq. 1, and from seq2st to seq2end in DNA seq. 2 is graphed. Think of the 4 numbers as defining a square on the original graph, and magnifing it. The second number doesn't have to be specified as a number; you can say end and the program will fill in the number, i.e. -l1000,end,1,3000.

-fHORZ,VERT This optional parameter specifies the size of the GIF or postscript picture. Default is 800x1000. You can make it bigger it you want. This will smooth out the lines on the graph, but can be a pain to work with. <INPUT_FILE INPUT_FILE is the text file that lineplot writes as output. The < is the UNIX input redirection operator. This argument must come last.

Usage, Mode 1--computing a lineplot:

The first two options are the two DNA sequences to be compared. They must come first, before the other command line switches.

The program takes over an hour to run when comparing two 50 kb DNA sequences on a multi-user Sun, so running it using 'nohup' and in the background '&' is a good idea.

If the '-p' switch is set, the results are printed to standard output at the indicated cutoff level of identity and for increasing identity, five percent at a time until there are no more lines of identity (or 100% identity is reached). This is useful for large DNA sequences, as after the original run you can rerun it using the '-a' parameter to redraw the graph and examine a blow up, or add new known features.

If the '-gFILE_NAME.gif' or '-tFILE_PS.ps' switch is set, the lines at the given level of homology are printed as a graphic file with the given name. Both '-gFILE_NAME.gif' and '-tFILE_PS.ps' can be used at the same time. The lines printed extend over the entire length of the windows giving the match which means that the line ends may extend into bases that don't match but are part of a window above the cutoff. This can be seen most clearly with short sequences, like a mouse and human cDNA being compared using a large (100 bp) window. If the '-b' flag is set, projections of the homology lines will be printed inside the gridlines. This allows you to compare the known features to the lines on the graph.

Features indicated in the sequence files will be drawn on the graph outside the gridlines in the indicated colors. The colors are numbered 0-26.

Example 1: lineplot dna1 dna2 -w50 -s100 -c67 -gdna1.2.gif

Example 2: lineplot dna1 dna2 -w50 -s100 -c67 -p -gdna1.2.gif >out.dna1.dna2

Example 3: lineplot dna1 dna2 -w50 -s100 -c67 -p -f1500,2000 -gdna1.2.gif >out.dna1.dna2

Example 4: nohup lineplot dna1 dna2 -w50 -s100 -c67 -p -gdna1.2.gif >out.dna1.dna2 &

Example 1 does a dotplot on the DNA in files dna1 and dna2 using a window of 50 bp and reporting matches of 67% or greater identity for at least 100 bp.

The output is a GIF file, dna1.1.gif.

Example 2 does the same, but prints a text file of the matches to the file out.dna1.dna2.

Example 3 is the same as example 2, but makes the GIF 1500x2000 pixels in size, instead of the default 800x1000.

Example 4 is the same as example 2, but is run in the background, and will still stay running after you log out, which is useful if you are comparing.

long sequences (> than a few kb each).

Usage, Mode 2--redrawing the GIF graph:

In this mode, the program reads in the text output file from lineplot, and redraws the graphic. It will graph the output of lineplot at any of the levels of identity given in the text output file (The original level of identity specified to lineplot using the '-c' parameter, and at 5% higher increments up to 100%, always including 100%). Additionally the '-l' parameter can be used to indicate that only a portion of the original seq. be redraw, giving a more detailed graph of a portion of the matching sequences. The '-fHORZ,VERT' can al so be used, just as in mode 1 to change the size of the GIF file.

The input file should be modified only to add or change the 'feature' lines. Otherwise it should be exactly as written by lineplot. Run lineplot, and then if you want a graph plotted at a higher level of homology, or a magnified view of part of the sequence, run it again using the '-a' switch.

Example 1: lineplot -a -c72 -gDNA1.DNA2.72.gif <output.dna1.dna2

Example 2: lineplot -a -c82 -l1,1000,2500,4000 -gDNA1.DNA2.82.gif -b <output.dna1.dna2

Example 3: lineplot -a -c82 -l1,1000,2500,4000 -f2000,3000 -gDNA1.DNA2.82.gif -tdna1.dna2.8 2.ps -b <output.dna1.dna2

Example 2 redraws the lineplot at 82% identity using the matches in the file output.dna1.dna2. Only the portion of the graph from bp 1-1000 in DNA1.

and from 2500-4000 in DNA2 are regraphed, with the new GIF file being named DNA1.DNA2.82.gif. The '-b' flag is set, so projections of the graph lines will be printed along the axes.

Example 3 is the same as example 2, except that the GIF file is 2000x3000 pixels and a postscript file is also written.

Written by Jim Lund in the lab of Roger Reeves, Johns Hopkins University