Sequence Alignment and Modeling System

SAM-T02 HMM WWW Servers

SAM 3.5 (July 2005) is available!
The SAM documentation (the 175 page, manual is also available in PDF and PS) discusses the changes from previous versions.

If you are a college, university, U.S. government lab, or nonprofit, you can download the software from the SAM distribution page. If you are interested in SAM for commercial use, please request more information from sam-info@cse.ucsc.edu

Martin Madera and Julian Gough have written a perl converter between SAM and HMMer 2.0 formats. You can get it from them (be sure to read their excellent documentation!) or download a 10/24/2000 copy.

Please read the ISMB99 tutorial on using HMMs

A linear hidden Markov model is a sequence of nodes, each corresponding to a column in a multiple alignment. In our HMMs, each node has a match state (square), insert state (diamond) and delete state (circle). Each sequence uses a series of these states to traverse the model from start to end. Using a match state indicates that the sequence has a character in that column, while using a delete state indicates that the sequence does not. Insert states allow sequences to have additional characters between columns. In many ways, these models correspond to profiles.

The primary advantage of these models over standard methods of sequence search is their ability to characterize an entire family of sequences. Thus, each position has a distribution of bases, as do transitions between states. That is, these linear HMMs have position-dependent character distributions and position-dependent insertion and deletion gap penalties. The alignment of each of a family to a trained model automatically yields a multiple alignment among those sequences.

The SAM software system is a collection of tools for creating and using these models.

The algorithms and methods used by SAM and other HMM systems were initially described in several papers from the University of California, Santa Cruz. These papers, several of which are described below, are available in the UCSC Computational Biology group's Protein FTP directory.

The complete SAM documentation is available in compressed (.gz) postscript and as a series of WWW pages. We also have a 2-page overview of SAM in postscript.

SAM runs on Unix workstation. Building a model using SAM can require minutes to several hours on a workstation depending on the length of the model, the number of sequences, and other factors.

SAM makes use of UCSC's Dirichlet mixture regularizer research.

The creation and distribution of SAM has been supported in part by NSF grants CDA-9115268, IRI-9123692, DBI-9408579 and DBI-9808007; ONR grant N00014-91-J-1162; NIH grants GM17129 and 1 R01 GM068570-01; DOE grant DE-FG03-95ER62112; a grant from the Danish Natural Science Research Council; and the UCSC Center for Biomolecular Science and Engineering;

Sean Eddy has written another program suite based on these methods called HMMER, which may also be of interest. SAM includes conversion programs between the two systems' formats.

Hidden Markov models are used extensively in speech recognition.

UCSC Specific papers of interest (Click here to see abstracts as well)

Hidden Markov models in computational biology: Applications to protein modeling. A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Journal of Molecular Biology , 235:1501--1531, February 1994. The original journal article.
Hidden Markov models for sequence analysis: Extension and analysis of the basic method. R. Hughey and A. Krogh, CABIOS 12(2): 95-107, 1996. (HTML version) or (POSTSCRIPT version)
Experimental evaluation of noise methods and regularizers, with discussions of surgery, the parallel SAM code, and finding motifs.
Hidden Markov Models for Detecting Remote Protein Homologies K. Karplus, C. Barrett, and R. Hughey, Bioinformatics 14(10):846--856, 1998. (HTML version) or (postscript).
Detailed discussion of the SAM-T98 method we applied to CASP3 to predict protein structure.
Predicting protein structure using hidden Markov models K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Hold, C. Sander, Proteins: Structure, Function, and Genetics. Pp. 134--139, Supplement 1, 1997 (HTML version)
Discussion of our CASP2 methods for using hidden Markov models to predict protein structure.
Weighting Hidden Markov Models for Maximum Discrimination. R. Karchin and A. Hughey, Bioinformatics, 14(9):772--782, 1998. (HTML version with mangled table headings) and postscript.
Adding internal weighting to SAM to create SAM Version 2.0. Includes a comparison of SAM to HMMer, Meta-MEME, and Probabistic Smith Waterman (from Agarawal and States paper) based on 67 discrimination tests from Pearson.
C. Tarnas and R. Hughey Reduced space hidden Markov model training 14(5):401--406, 1998. Also available in postscript and pdf.
Discussion and analysis of the implementation of the checkpoint method (see Grice, below) in SAM.
Transparencies from our CASP2 talk, at which UCSC's hidden Markov model methods were among the very top overall scores among threading-based predictions of protein structure.
Scoring Hidden Markov Models C. Barrett and R. Hughey and K. Karplus CABIOS 13(2):191-199, 1997. Available in postscript and compressed (.gz) postscript as well. Experimental evaluation of several different scoring methods using both SAM and HMMer.
Tutorial: Stochastic Modeling Techniques: Understanding and using hidden Markov models. L. Grate, R. Hughey, K. Karplus, K. Sjölander. University of California, Santa Cruz, CA, June 1996. SAM and HMMER tutorial used at ISMB last June 1996. (compressed postscript (.ps.Z))
"A Flexible Motif Search Technique based on Generalized Profiles" (compressed postscript) Philipp Bucher, Kevin Karplus, Nicolas Moeri, and Kay Hoffman, Computers and Chemistry Jan 1996, 20(1) 3--24. ( postscript). An evaluation of search techniques for linead hidden Markov models and generalized profiles.
J Alicia Grice, Richard Hughey, and Don Speck Reduced Space Sequence Alignment CABIOS 13(1):45-53, 1997. To be part of SAM2.0, this checkpoint method has many advantages over the divide-and-conquer method.
SAM : Sequence alignment and modeling software system. R. Hughey and A. Krogh, Technical Report UCSC-CRL-95-7, University of California, Santa Cruz, CA, January 1995. (Regularly updated.) The SAM documentation.
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology. Sjolander, K, Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., and Haussler, D. The most up-to-date discussion of Dirichlet Mixtures. The method is an option in SAM.
Using Dirichlet mixture priors to derive hidden Markov models for protein families. M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjolander, and D. Haussler. In L. Hunter, D. Searls, and J. Shavlik, editors, Proc. of First Int. Conf. on Intelligent Systems for Molecular Biology , pages 47--55, Menlo Park, CA, July 1993. AAAI/MIT Press. The original Dirichlet paper.
Massively parallel biosequence analysis. R. Hughey. Technical Report UCSC-CRL-93-14, University of California, Santa Cruz, CA, April 1993. (HTML version) or (POSTCRIPT version)
Parallel sequence analysis on specialized hardware, and the parallel SAM code.

Other papers and pointers of interest (please email new pointers!)

"Profile Hidden Markov Models" Sean R. Eddy (1998) Bioinformatics 14(9), review of HMMs.
"Maximum Discrimination Hidden Markov Models of Sequence Consensus" Sean R. Eddy, Graeme Mitchison, and Richard Durbin (1995). J. Computational Biology 2:9-23. PostScript; 30 pages. Describes an alternative to maximum likelihood parameter optimization for HMMs which compensates for the biased sequence representation caused by phylogenetic relationships.
"Multiple Alignment Using Hidden Markov Models" Sean R. Eddy (1995). Proc. Third Int. Conf. Intelligent Systems for Molecular Biology, C. Rawlings et al., eds. AAAI Press, Menlo Park. pp. 114-120. PostScript; 7 pages. Describes a simulated annealing algorithm for HMM training and a probabilistic suboptimal alignment algorithm. Compares HMM-based multiple alignment to CLUSTALW.
Parameterization studes for the SAM and HMMER methods of hidden Markov model generation Marcella A. McClure, Chris Smith, and Pete Elton. Proc. Fourth Int. Conf. Intelligent Systems for Molecular Biology, D. States et al., eds. AAAI Press, Menlo Park. pp. 155-164. A detailed comparison of HMM training methods for constructing multiplie alignments.
"Fitting a mixture model by expectation maximization to discover motifs in biopolymers" , Timothy L. Bailey and Charles Elkan, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, (28-36), AAAI Press, 1994, and an associated MEME server.
"Meta-MEME: Motif-based Hidden Markov Models of Protein Families". Grundy, William N., Timothy L. Bailey, Charles P. Elkan and Michael E. Baker. Computer Applications in the Biosciences, 3(4):397-406, 1997, and an associated Meta-MEME server.
Searching for statistically significant regulatory modules. Timothy L. Bailey and William Stafford Noble Bioinformatics (Proceedings of the European Conference on Computational Biology)., 19(Suppl. 2):ii16-ii25, 200 and an associated MCAST server.

sam-info@cse.ucsc.edu
UCSC Computational Biology Group