Description of A2M alignment format

The A2M format is used as the primary format for multiple alignments of protein or nucleic-acid sequences in the SAM suite of tools. It is a small modification of FASTA format for sequences and is compatible with most tools that read FASTA.

The main advatanages of A2M format over other multiple-alignment formats are

its compatibility with FASTA for programs that just need sequences, not a multiple alignment,
its ability to represent unaligned residues in any sequence,
its compactness, and
its ease of parsing.

A file consists of any number of sequences, each of which starts with an identifying line. The identifying line must have a ">" character in the first column, followed immediately by an identifier for the sequence. The identifier is terminated by white space or a comma---the identifier should be unique for each sequence. The rest of the line is treated as a comment, but is preserved by many of the tools in the SAM suite.

After the identifying line, the sequence is given. For proteins, the legal alphabet is

ACDEFGHIKLMNPQRSTVWY for amino acids
X for any amino acid
B for N or D
Z for Q or E
O for creating a free-insertion module (FIM)

For nucleic acids, the legal alphabet is

ACGTU for nucleotides (with T and U considered equivalent)
Y for C or T
R for A or G
N for any nucleotide
O for creating a free-insertion module (FIM)

Unknown letters (including the other nucleic acid wild cards) are handled like the general wildcards X and N. White space (including line breaks) and periods are ignored.

The alignment information is encoded using uppercase and lowercase characters, and the special gap character "-". Uppercase characters and "-" represent alignment columns, and there must be exactly the same number of alignment columns in each sequence. Lowercase characters (and spaces or ".") represent insertion positions between alignment columns or at the ends of the sequence. The spaces or periods in the multiple alignments are only for human readability, and may be omitted.

The multiple-alignment output from our web servers usually omits the dots from the alignments, since they carry no information and can increase the size of the output many-fold. (Also, some e-mail software has trouble dealing with lines that start with a dot.) Some conversion programs misinterpret dotless a2m files, so conversion to other multiple-alignment formats can be difficult. The SAM tool suite includes the prettyalign program, which can add the dots to a dotless a2m file:

    prettyalign foo.a2m -f > foo.a2m_with_dots

Most conversion programs have no trouble with the a2m_with_dots format.

Here is an example of a small multiple alignment:

>2crd
.XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS.
>gi|786430|bbs|159192 potassium channel blocking toxin 15-1 [Leiurus quinquestriatus=scorpions, ssp. hebreus, venom, Peptide Partial, 32 aa]
.-----SCTASNQCWSICKRLHNTNRGKCMNKKCRCYS.
>gi|2500706|sp|P55928|SCKB_PANIM POTASSIUM CHANNEL BLOCKING TOXIN PITX-K-BETA
t----ISCTNEKQCYPHCKKETGYPNAKCMNRKCKCFGr