This perl 5 script cuts a subsequence from a DNA seq. It requires two command line arguments: the beginning and end of the subseq. By default it reads the DNA seq. in from standard input, and writes to standard output.

Usage: cutdna [-d -l# -mX -n -q -u] start_bp end_bp <DNA.file >output.DNA

-mX Change the masking character. The default masking character is 'N'. The masking character is treated as a DNA bp. To specify no masking character, use -m alone, without a following character.

-n This optional switch prevents the program from tacking on "Subseq from START to STOP" when a subsequence is generated and a FASTA line is present in the input sequence.

-q Prints this help information.

-l# Print out DNA prefaced by line numbers, with # bp per line.

-d Use full degenerate DNA code plus masking character. Default is to consider (a, c, g, t, n) and masking character to be part of the DNA.

-u Output DNA is upper case (default is lower case).

The DNA seq. can be in FASTA format, in Genbank format, or unformatted. Characters not part of the DNA code are ignored, and don't interfere. The mask character is considered a DNA bp. Line numbers, for example, are ignored. The output will have 'Subsequence from bp #-#' appended to the FASTA desciption line. If the input doesn't have a FASTA description line, one will be added.

The DNA can also have the comment/feature lines used by the programs lineplot and fplot. These lines look like this: ;11029,12126,3, Tsk 1
;15411,16487,4,Tsk 2
;18721,19000,5, Dgsi exon 10
;19495,19610,5, Dgsi exon 9

The lines start with a ; and then have 4 or more fields. The first two are the region of the feature in the sequence, the third is the color, and the fourth is the shape. After the third comma, a description can optionally follow. This program, cutdna, will include the description lines whose features are contained in the output DNA with numbering adjusted in the program's output.

If the subsequence arguments are given backward, larger number first, then the output is the reverse complement.

'end' is recognized to be the last DNA bp in the input sequence and can be used as an argument in the place of the number of the last bp in the input file.

Example: cutdna 1000 2000 <DNA.file >output.file

Example giving rev. complement: cutdna 3000 1 <DNA.file >output.file

Example, '#' is used as the masking character: cutdna -m# 3000 end <DNA.file >output.file

Example that outputs the rev. complement of the entire file: cutdna End 1 <DNA.file >output.file

This script can be used to filter the line numbers out of a DNA seq:

Example: cutdna 1 end <DNA.with.numbers >filtered.output

Last modified 10/28/98

Written by Jim Lund in the lab of Roger Reeves, Johns Hopkins University