fplot help file
This is the fplot help file. fplot is a Perl 5 program that reads a text file of sequence features from standard input, and writes to standard output a postscript file which is a graph of sequence features drawn proportionally along the sequence. The program assumes perl 5 is installed in #!/usr/local/bin/perl. 11/25/97 Written by Jim Lund, jiml@stanford.edu Contents 1. Fplot usage and options 2. Formatting the feature text file 3. Generating the feature text file 4. Manipulating the feature text file 5. Feature item colors 6. Feature item types 7. Adding new feature types 8. Two example feature text files 1. Fplot usage and options fplot This perl 5 program reads a text file of sequence features from standard input, and writes either a postscript file to standard output or a set of files which make up an Html page. The output is a graph of sequence features drawn proportionally along the sequence. Usage: fplot [-a# -b# -c# -d#,# -fFONT -hHtml.base.name -i -l -m#,# -n# -s#,# -z#] \plot.ps Switches: -a# Set the number of points each plotted feature line will take up. Default is 8. -b# Set the the height of features drawn on each feature line. Default is 4; should be 1/2 or less of 'points per line' (-a#) to keep features on adjacent lines from overlapping. -c# Set label space, the room reserved on the plot's right side for the text labels to fit into. If you use less of the space for label, more is available for the plot. Default is 150 points. -d#,# Draw features of only part of the DNA sequence in the text file, the part in the specified range (-dSTART_BP,END_BP). The end base pair can be indicated by number or as 'end', capitialized or not. -fFONT Set font in which text is printed. Default is Times-Roman, a postscipt font on most printers. -hHtml.nase.name Html output is generated. A set of three or four files are made, all file names start with Html.nase.name. The image appears in one window, and the legend in a second window. Putting the mouse over matches pops up the relevant feature info in the other frame. Works only on Netscape browsers. -i Html output will be displayed in one browser window with the legend in one frame, and the image in the other. -l This option prints the plot in landscape orientation. -m#,# Set page margins (-mX_MARGIN,Y_MARGIN). Default margins are 36 points. -n# Set the number of base pairs to be plotted per line. Default is the entire sequence on a single line. -z# Set font size. Default is 9 point, which looks good with lines spaced -s#,# Set page size in points (-sX_SIZE,Y_SIZE). Defaults are 612 pts. wide by 792 points high. -q Print out help page. Typically, options -n, -l, and -d are the most used. The other options are used more rarely when you want to fiddle with how the image looks. If you want the other parameters to be different all the time, I recommend changing the defaults where they are defined, in the global variables section of the program. Example: fplot Dna.ps dna.fplot is a text file formatted to be read by fplot. The image is written to file dna.ps fplot -n10000 -l -d50001,200000 Dna.ps The image is in landscape orientation, with 10000 bp/line scaling. The part of the DNA sequence from bp 50001 to 200000 bp is plotted. fplot -hDNA.summary ". The second token is "Length:" followed by a number, the length of the DNA sequence. Next come the sections for each plot feature line, one per feature line. A section starts with the token "Line:" followed by a description of what's on the line. The description gets written on the plot to the right of the feature line. Optionally, a "Fields:" token can appear, followed by a comma separated list of field descriptors. The field token is used when HTML is output, to give titles to the feature info fields in the HTML legend. Field tokens can appear anywhere after the line with the "Line:" token. More than one "Fields:" line can appear in a plotted line. The following line will use the last "Fields:" declaration of the previous line, until a new "Fields:" declaration is made. Individual feature items come next. A feature item line begins with a semicolon, ";", followed by 4 numbers and optionally, comma separated info fields. The numbers are feature start, feature stop, color, and feature type. Feature types 13 and 14 (Line and block height) take a parameter. This must follow the color numaber, and range from 0 to 1.0. Numbers greater than 1.0 will be drawn, but they will extend over the normal feature height. This is a feature. Color is a number from 0 to 27. Feature types currently range from 0 to 26. The 4 numbers can be separated by any non-number character (except f, F, r, or R which indicate exon type features), and after the 4 numbers anything can be written (I usually have a description of the feature.). See more about color and feature types below. After a section begins (with the "Line:" token, the item feature lines that follow are drawn on the current plot feature line. When fplot encounters a new "Line:" token, a new plot feature line begins. Lines that don't begin with a token are ignored by the program. A "Line:" taken without a description means that no axis will be drawn for this plotting line. Feature that come later are drawn over earlier ones. Keep this in mind. It has it's uses; in plotting dot plot results, I first have the low stringincy matches listed in the feature text file indicated in one color, and then the smaller region of better matches come afterward, and get drawn over part of the weak match.
-------------------Start example feature text file------------------------ >DGS cosmids 103a2 to 24b, bp 40001-80000 Length:1110702 This version has the numerous poor DGS-F matches deleted. The 1110702 bp query DNA sequence: 103a2 Searched against the database: DGSdb9-26-97 Matches with a HSP score of at least 150 are reported. Matches separated by 20 or less bps are listed as a single longer match. Query Database Seq. Percent Start Stop Start Stop Len. Identity Description Line:BLAST of 9-26-97 DGS db Fields: Start bp in DB seq, Stop bp in DB seq, Length of DB seq., Percent Identical bl in match, Description ------------------------------------------------------------------------------- ;4139,4241,2,1, 104 2 1284 100% X91348:H.sapiens predicted non codin g cDNA (DGCR5). ;4240,4281,2,1, 94 135 1284 88% X91348:H.sapiens predicted non codin g cDNA (DGCR5). ;21340,21509,2,1, 108 277 1284 92% X91348:H.sapiens predicted non cod ...and so on -------------------End example feature text file-------------------------- 3. Generating the feature text file This program makes a plot of features in a DNA sequence (or any other proportional plotting needs you can think up, really). The sequence analysis is done with other programs, and the output from these programs needs to be combined and formatted so fplot can parse it. How to do this? The output from the analysis programs can be copied and concatenated into one file using pretty much any word processor. The header and section headings can be added by hand easily. Getting the feature item lines formatted is the only hard part. This can be done in several ways. The UNIX text editor vi (in line editting mode) does the job well if you are familiar with regular expresions, but is hard to learn to use. I use vi, and do a search and replace on all the feature item lines from a particular source at once, the replacement making the sequence position numbers come first, then a color and feature type. Alteratively, you can try using the search and replace capabilities of your favorite word processor. I don't know of a good way to do this in every case, these programs often have trouble reconizing the beginning of lines. Doing it by hand is always an option, but for large, feature rich sequences this can be time consuming, and makes fplot harder to use. The easier it is to format other programs output for fplot, the more useful the program is, so give some thought to optimizing this step if you expect to make a lot of use of fplot. Learning vi is a *good thing* in any case. :) Keep an eye on the output analysis programs generate. GRAIL, for example, indicates reverse strand exons as end_of_exon_bp start_of_exon_bp, and this orser needs to be reversed for fplot to draw the exons. 4. Manipulating the feature text file Simple changes in the feature text file can be used to refine the plot that gets drawn. Two plotted feature lines can be combined into one by removing the "Line:" token on the second one. The different types of information can be descriminated by using different feature types. In a section representing database matches, there may be spurious matches that clutter up the plot. They can be removed from the drawn image by deleting the semicolon at the beginning of their lines. This is preferrable to just deleting them, as you may want to them drawn in when presenting the data for another purpose or doing other analysis, and deleting the token is less permanent. 5. Feature item colors There are 28 colors recognized by the program. Color is indicated in the feature text file by a number from 0 to 27. Colors 2-7 are the primary colors. 0 White 1 Black 2 Red 3 Yellow 4 Green 5 Light Blue 6 Blue 7 Fuscia 8 Maroon 9 Forest green 10 Olive 11 Orange 12 Spring green 13 Navy 14 Royal purple 15 Hot pink 16 Gray-blue 17 Gray 18 Peach 19 Sea green 20 Pale green 21 Pale yellow 22 Purple 23 Teal blue 24 Gray purple 25 Pink 26 Baby blue 27 Black 6. Feature item types I use a few conventions when planning what sequence feature gets paired with what symbol. Generally, features that are draw using a symbol centered on the line are strand neutral, or strand independent items, such as repetative DNA sequence. Items drawn above the line are forward strand items, items depending from the line are reverse strand items. Feaure number Description Used for: ----------------------------------------------------------------------------- 0 Strand neutral box Strand neutral feature 1,F,f Forward strand box Exon, forward strand 2,R,r Reverse strand box Exon, reverse strand 3 Strand neutral 1/2 height box 4 Forward strand 1/2 height box 5 Reverse strand 1/2 height box 6 Forward strand caret GRAIL poly A site forward strand 7 Reverse strand caret GRAIL poly A site reverse strand 8 Triangle forward strand GRAIL polII promoter forward strand 9 Triangle reverse strand GRAIL polII promoter reverse strand 10 Arc GRAIL CpG island 11 Tick mark forward strand Restriction enzyme site, ?? 12 Tick mark reverse strand 13 Height bar Lineplot match, percent repetitive 14 Height block Lineplot match, percent repetitive 15 Strand neutral dotted line type 1 16 Forward strand dotted line type 1 Connect 'exons' in BLAST results 17 Reverse strand dotted line type 1 Connect 'exons' in BLAST results 18 Strand neutral dotted line type 2 19 Forward strand dotted line type 2 20 Reverse strand dotted line type 2 21 Arrow type 1 thick arrow on the line pointing right 22 Arrow type 2 thick arrow on the line pointing left 23 Arrow type 3 arrow on the line pointing right 24 Arrow type 4 arrow on the line pointing left 25 Arrow type 5 forward strand 26 Arrow type 6 reverse strand 27 Small text centered 28 Small text left justified 29 Small text right justified 30 Large text centered 31 Large text left justified 32 Large text right justified 33 Giant text centered 34 Giant text left justified 35 Giant text right justified 7. Adding new feature types If you know a little Perl programming and a little Postscript programming, you can add new feature types to the program. Here's directions on doing so: 1. In the feature subroutine, in the line "if (($shape > 26) || ($shape < 0))" Increase 26 by one to allow for your new feature type to be recognized. 2. Copy the elsif section for an existing feature type, and insert it after the last feature. For example, copy: elsif ($shape == 6) {printf("gsave %f %f %f setrgbcolor %d %d %d %d Tri grestore\n",$color_r ,$color_g,$color_b,$x2+1,$y_pos,$x1-1,(($y_pos+($rect_height/2))+1)); last FEATURE_SW; } and insert it after the last feature block (feature 26 right now). 3. Give it a new number; for example, the next feature would be 27. Change "$shape == 6" to "$shape == 27". 4. Each feature block writes postscript code to standard output to draw one feature. Use the variables $color_r ,$color_g, and $color_b to set the color. $x1 is the position of the starting bp, $x2 is the postion of the end bp, and $y_pos is the position of the feature plot line. Your feature should stay between $y_pos+($rect_height/2) and $y_pos-($rect_height/2) to keep from bumping into the neighboring feature lines. Remember, a feature sticking up from the line has y coordinate values less than $y_pos. 5. You can use the existing Postscipt functions I've written in the program: "Tri" draws a solid right trinagle given the coordinates of the hypotenuse. The first point is the one on the feature line, the second point is the one that sticks up (or down). It is called by "x1 y1 x2 y2 Tri". "Rec" draws a solid rectangle given the coordinates of two corners. It is called by "x1 y1 x2 y2 Rec". "Tic" draws a line centered on the feature line given the center point and the number of points it extends in each direction. It is called by "x1 y1 y2 Tic". x1, y1 is the center, y2 is added and subtracted from y1 to give the extention. 6. Put any new variables in the variable section of the FEATURE subroutine. Put new Postscript variables or functions in the Postscript header scetion. 7. That's it. If you add something, please email it to me so I can see! Things I've thought of adding but haven't: open rectangle, open triangle, striped box, horizontal arrows (open or closed), vertical arrows. Thinner or thicker boxs, or tick markers. 8. Two example feature text files In the first example file: Note the FASTA sequence name, "DGS cosmids 103a2 to 24b, bp 40001-80000", the "Length:" token, and the first feature plot line, whose description is "BLAST of 9-26-97 DGS db". The first feature item line starts ";4139,4241,2,1, 104 2 1284 100% X91...", and will draw a red box on top of the line from bp 4139 to 4241 to indicate a exon of DGCR5. Note that this blast match, "31578,31645,4,1, 1442 1509 2309 75% L77571:Homo sapiens DGS-A mRNA, 3' end." will not appear in the plot because it doesn't have a ";" token at the beginning of the line. For what it's worth, this blast match table was generated from the blast output by another short Perl program, parse. Then the semicolons, and color and feature type number was added using a vi search and replace command. -----------------Start example feature file------------------------------- >DGS cosmids 103a2 to 24b, bp 40001-80000 Length:1110702 This version has the numerous poor DGS-F matches deleted. The 1110702 bp query DNA sequence: 103a2 Searched against the database: DGSdb9-26-97 Matches with a HSP score of at least 150 are reported. Matches separated by 20 or less bps are listed as a single longer match. Query Database Seq. Percent Start Stop Start Stop Len. Identity Description Line:BLAST of 9-26-97 DGS db ------------------------------------------------------------------------------- ;4139,4241,2,1, 104 2 1284 100% X91348:H.sapiens predicted non codin g cDNA (DGCR5). ;4240,4281,2,1, 94 135 1284 88% X91348:H.sapiens predicted non codin g cDNA (DGCR5). ;21340,21509,2,1, 108 277 1284 92% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;22244,22320,2,1, 184 108 1284 74% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;24234,24441,2,1, 276 483 1284 100% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;25328,25534,2,1, 477 683 1284 99% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;26096,26338,3,2, 245 2 245 94% U84528:Human velo-cardio-facial sy ndrome 22q11 region mRNA sequence. ;30729,31540,4,1, 595 1407 2309 88% L77571:Homo sapiens DGS-A mRNA, 3' end. 31578,31645,4,1, 1442 1509 2309 75% L77571:Homo sapiens DGS-A mRNA, 3' end. ;31830,31875,4,1, 1655 1700 2309 89% L77571:Homo sapiens DGS-A mRNA, 3' end. ;47869,48025,2,1, 682 838 1284 99% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;51454,53762,4,1, 1 2309 2309 99% L77571:Homo sapiens DGS-A mRNA, 3' end. ;56104,56550,5,1, 1 447 447 99% L77559:Homo sapiens DGS-B partial mRNA. ;64092,64544,2,1, 833 1284 1284 100% X91348:H.sapiens predicted non cod ing cDNA (DGCR5). ;69599,70081,6,2 4398 3916 4398 97% X84076:H.sapiens mRNA for DGCR2. ;69616,69752,6,2 3987 3836 3999 72% D78641:Mouse mRNA for Membrane Glyc -----------------End example feature file--------------------------------- -----------------Start example feature file #2------------------------------- >BAC bD3-6 sequence analysis Length:127359 Line: Genes, ESTs, and features from Genbank search ;13541,14040,1,0,- matches MHC II IE intron 1 at 89% ;21888,22150,16,F GRAIL 2 excellent exon, Kelch exon 1, T1 ;29207,29372,16,F GRAIL 2 excellent exon, Kelch exon 2 ;33905,34693,16,F GRAIL 2 excellent exon, Kelch exon 3 +ESTs ;35955,36318,3,F Mouse EST MUSF076A, T2 ;41708,41900,16,F GRAIL 2 excellent exon, Kelch exon 4 +EST ;46644,46927,16,F GRAIL 2 excellent exon, Kelch exon 5 +ESTs ;49987,50246,16,F GRAIL 2 excellent exon, Kelch exon 6 +ESTs ;54985,55377,11,R 2 mouse ESTS, T3 ;57032,57490,26,F Mouse EST W53987, T4 ;60011,60112,22,F GRAIL 2 excellent exon, KIAA0149 match 1: 60013-111 ;60271,60786,22,F KIAA0149 match 2 ;60877,61095,22,F GRAIL 2 excellent exon, KIAA0149 match 3: 60878-1096 ;61227,61358,22,F KIAA0149 match 4 ;61801,61906,22,F KIAA0149 match 5 ;62304,62419,22,F GRAIL 2 excellent exon, T5 exon 1 bp 62338-62419 T5 matche s 5 ESTs ;63846,63998,22,F GRAIL 2 excellent exon, T5 exon 2 KIAA0149 match 6 & 7: 63 845-901, 63961-991 ;64245,64677,22,F T5 exon 3 ;65228,65795,7,R 11 mouse ESTs, T6 ;72742,73329,1,0,- ZNF74-1 ZN finger protein homology 76-82% ;97847,100269,8,R, DGCR2 exon 10, T7 ;101552,101460,8,R, GRAIL 2 excellent exon, DGCR2 exon 9 bp 101456 101699 ;102623,102471,8,R GRAIL 2 excellent exon, DGCR2 exon 8 bp 102471 102625 ;107356,107080,8,R GRAIL 2 excellent exon, DGCR2 exon 7 bp 107097 107284 ;114237,114404,8,R DGCR2 exon 6 ;114826,114750,8,R GRAIL 2 excellent exon, DGCR2 exon 5 bp 114749 114826 ;116028,115857,8,R GRAIL 2 excellent exon, DGCR2 exon 4 bp 115856 116067 ;117023,116901,8,R GRAIL 2 excellent exon, DGCR2 exon 3 bp 116900 117026 Line:Dotplot vs. Human seq. using a 100bp window Dotplot vs. Human cosmid 103a2 to U30597 67% is yellow, 77% is cyan, 87%+ is purple The first DNA seq. is 127359 bp, in file /disk2/people/jiml/s/bD3-6.mask2.a: -- >mouse bac bD3-6 00061 Comparing the DNA seqs. using a 100 bp window, returning regions of 67% or greater homology sustained over at least 100 bp. DNA seq. 1 DNA seq. 2 Line Line Line Line Match Start End Start End Length --------------------------------------------- ;91051,91203,3,0, 62835 62987 153 ;91106,91205,3,0, 62890 62989 100 ;91160,91260,3,0, 62945 63045 101 ;91163,91314,3,0, 62948 63099 152 ;91217,91317,3,0, 63002 63102 101 ;91265,91403,3,0, 63045 63183 139 ;91405,91577,3,0, 63182 63354 173 ;91548,91650,3,0, 63331 63433 103 ;91553,91779,3,0, 63336 63562 227 ;92928,93047,3,0, 65300 65419 120 ;93102,93232,3,0, 65493 65623 131 ;94029,94132,3,0, 66162 66265 104 ;94040,94144,3,0, 66173 66277 105 ;94052,94152,3,0, 66185 66285 101 ;94266,94365,3,0, 66429 66528 100 ;94271,94372,3,0, 66434 66535 102 ;94744,94953,3,0, 67063 67272 210 ;98944,99065,3,0, 70940 71061 122 ;99974,100304,3,0, 72146 72476 331 ;101424,101765,3,0, 74337 74678 342 ;102390,102663,3,0, 75041 75314 274 ;105612,105776,3,0, 80024 80188 165 ;107070,107323,3,0, 81740 81993 254 ;114202,114361,3,0, 90254 90413 160 ;114296,114446,3,0, 90347 90497 151 ;114709,114870,3,0, 96465 96626 162 ;115808,116146,3,0, 98103 98441 339 ;116851,117034,3,0, 101352 101535 184 Regions of 72% or greater homology: --------------------------------------------- ;91067,91176,3,0, 62851 62960 110 ;91083,91182,3,0, 62867 62966 100 ;91085,91188,3,0, 62869 62972 104 ;91178,91307,3,0, 62963 63092 130 ;91278,91397,3,0, 63058 63177 120 ;91411,91570,3,0, 63188 63347 160 ;91563,91662,3,0, 63346 63445 100 ;91573,91672,3,0, 63356 63455 100 ;91610,91710,3,0, 63393 63493 101 ;91620,91722,3,0, 63403 63505 103 ;91625,91732,3,0, 63408 63515 108 ;91638,91740,3,0, 63421 63523 103 ;91643,91761,3,0, 63426 63544 119 ;92933,93039,3,0, 65305 65411 107 ;93118,93220,3,0, 65509 65611 103 ;94753,94942,3,0, 67072 67261 190 ;94845,94944,3,0, 67164 67263 100 ;94847,94946,3,0, 67166 67265 100 ;99980,100296,3,0, 72152 72468 317 ;101430,101757,3,0, 74343 74670 328 ;102399,102657,3,0, 75050 75308 259 ;105628,105732,3,0, 80040 80144 105 ;107076,107315,3,0, 81746 81985 240 ;114208,114354,3,0, 90260 90406 147 ;114302,114402,3,0, 90353 90453 101 ;114309,114434,3,0, 90360 90485 126 ;114717,114860,3,0, 96473 96616 144 ;114764,114863,3,0, 96520 96619 100 ;115815,116141,3,0, 98110 98436 327 ;116859,117026,3,0, 101360 101527 168 Regions of 77% or greater homology: --------------------------------------------- ;91419,91551,5,0 63196 63328 133 ;94777,94876,5,0 67096 67195 100 ;94779,94883,5,0 67098 67202 105 ;94802,94902,5,0 67121 67221 101 ;94813,94912,5,0 67132 67231 100 ;94821,94936,5,0 67140 67255 116 ;99994,100094,5,0 72166 72266 101 ;99998,100193,5,0 72170 72365 196 ;100096,100219,5,0 72268 72391 124 ;100129,100282,5,0 72301 72454 154 ;100187,100286,5,0 72359 72458 100 ;101442,101732,5,0 74355 74645 291 ;102408,102650,5,0 75059 75301 243 ;107083,107309,5,0 81753 81979 227 ;114218,114347,5,0 90270 90399 130 ;114725,114852,5,0 96481 96608 128 ;115822,116134,5,0 98117 98429 313 ;116867,117017,5,0 101368 101518 151 Regions of 82% or greater homology: --------------------------------------------- ;91425,91538,5,0 63202 63315 114 ;91442,91541,5,0 63219 63318 100 ;100033,100154,5,0 72205 72326 122 ;100158,100271,5,0 72330 72443 114 ;101450,101556,5,0 74363 74469 107 ;101486,101724,5,0 74399 74637 239 ;102416,102644,5,0 75067 75295 229 ;107092,107303,5,0 81762 81973 212 ;114226,114339,5,0 90278 90391 114 ;114732,114833,5,0 96488 96589 102 ;114741,114840,5,0 96497 96596 100 ;114745,114844,5,0 96501 96600 100 ;115831,116124,5,0 98126 98419 294 ;116873,117007,5,0 101374 101508 135 Regions of 87% or greater homology: --------------------------------------------- ;101516,101616,14,0 74429 74529 101 ;101519,101619,14,0 74432 74532 101 ;101522,101704,14,0 74435 74617 183 ;102435,102534,14,0 75086 75185 100 ;102438,102632,14,0 75089 75283 195 ;102536,102636,14,0 75187 75287 101 ;107097,107198,14,0 81767 81868 102 ;107101,107206,14,0 81771 81876 106 ;107110,107225,14,0 81780 81895 116 ;107137,107279,14,0 81807 81949 143 ;107190,107290,14,0 81860 81960 101 ;115837,116002,14,0 98132 98297 166 ;115917,116029,14,0 98212 98324 113 ;115939,116038,14,0 98234 98333 100 ;115947,116056,14,0 98242 98351 110 ;115965,116085,14,0 98260 98380 121 ;116001,116116,14,0 98296 98411 116 ;116878,116988,14,0 101379 101489 111 Regions of 92% or greater homology: --------------------------------------------- ;102446,102612,14,0 75097 75263 167 ;115846,115945,14,0 98141 98240 100 Line:ORFs 100-150aa yellow,150-200 green,200+ blue ;519,824,3,R, Frame -3 ;2041,2376,3,R, Frame -1 ;5575,5955,3,R, Frame -1 ;9600,10148,3,F, Frame 3 ;9605,10228,3,R, Frame -2 ;11227,11547,3,F, Frame 1 ;11664,12029,3,F, Frame 3 ;12284,12730,3,F, Frame 2 ;14103,14519,3,F, Frame 3 -----------------End example feature file #2--------------------------------- End of help file.
Updated 2/99
Written by Jim Lund in the lab of Roger Reeves, Johns Hopkins University