batchseq2.gp ClustalW (EBI)
JalView Reference card
JalView Docs
More JalView Docs
ClustalW/X default colors ClustalW original paper ClustalW/X 2.0 paper

LAB4: Multiple Sequence Alignment

The primary objectives are:

Learn to do multiple sequence alignments.
Learn to interpret multiple sequence alignments.
Use tools for viewing multiple sequence alignments.

The most widely used multiple alignment algorithim is CLUSTAL, whose current incarnation is CLUSTALX and CLUSTALW. CLUSTALW can be used via the internet or CLUSTALX can be downloaded to your computer.

Question 3 deals with material briefly mentioned in the book that can best be found by reading the CLUSTALW manuscript (and the short Clustal 2.0 manuscript). CLUSTAL is an example of a heuristic algorithm for progressive multiple alignment. Basically, even dynamic programming finds optimal alignment of 10 sequences much over 100 residues long an uncomputably hard problem. So, a quicker algorithm is used.

Basically, one aligns each sequence pairwise to every other sequence. Then, you build the multiple alignment by first aligning the most closely related pair. The next sequence is added to the alignment, and new gaps may be introduced. Some refinements of the CLUSTALW/X algorithm are that nearly identical sequences are actually downweighted as they presumably don't represent the entire evolutionary spectrum. Gap penalties are also adjusted dynamically, with it easier to add to a gap than to open a new one.

1. We will evaluate, using a "stepwise diversity" approach some of the prominent issues that can be addressed by multiple alignment. I will try to guide you through these manipulations in what I think is the easiest method. The proteins we will examine are members of the yeast SIR2 family. In mammals these proteins are called sirtuins and regulate processes in the cell through NAD+-dependent deacetylation of histones and other protein.

Open the bachseq2.gp file. It contains several protein sequences in Genbank format. The Clustal web program wants sequences in FASTA format. There are several ways to convert the sequences to FASTA. Using the Readseq is the easiest (http://thr.cit.nih.gov/molbio/readseq/).

Use Clustal to align the proteins, using the default options. Clustal produces several output files, an output file with scores, a guide tree file, and the alignment.

Use JalView to view Clustal alignments. Click on the 'JalView' button near the top of the page and a new window will open.

a. Align ONLY the three mammalian sequences using ClustalW (Use defaults except set ITERATION to "alignment" and NUMITER to "3"). One of the sequences has a shorter N-terminal region. What is the most N-terminal amino acid in where all three sequences begin to align (in alignment coordinates)?
b. In the middle of the alignment there are a few imperfectly conserved aa. Describe one of these residues (aa and alignment postion, given like G177 if the answer is Glycine residue 177). Is this a conservative substitution? (These three proteins are very similar. What does this alignment tell us? Can you make a good hypothesis for the location of the active site?)
c. Add the chicken sequence to the alignment. You will probably notice that the alignment is still extensive. There is an insertion in the chicken sequence at alignment position 537. If you were to examine the crystal structure of the chicken enzyme, do you believe that this serine would likely be on the outside of the protein or tucked away within the protein's core? Outside vs inside, one sentence rationale.
d. Add the fish sequence to the alignment. Does this alignment seem informative enough to be able to make judgments about probable active site residues? Yes/no, simple rationale (If, for example, your grad student had the patience to mutate only 10 residues in a search for the active site, would you know which ones to tell her to change?)

2. Add the remaining proteins: C. elegans (worm), D. melanogaster (fly), S. cerevisiae and Ashbya gossypii (yeast). Now you should see a much more limited set of identically-conserved amino acids. You might guess that the longest uninterrupted stretch of these is part of the active site, for example. Use the "color" options of JalView to examine the alignment.

a. Explore the different color options. What color scheme makes it easiest to find conserved regions in this alignment?
b. There is a large insertion in the middle of the protein found in three sequences. In which organisms is this insertion found? Briefly give a hypothesis to explain this.
c. The alignment shows two highly conserved regions in these proteins. Give the alignment aa numbers of the two regions you select. Example: aa's 101-151. Use the 'Edit' menu tools to remove the sequence before and after the C-terminal most highly conserved region. Paste a screen shot of this region (as much as will fit on the screen). If you were doing a careful alignment of these proteins you would now take this smaller conserved region and re-run Clustal. You do not need to do this for this lab.

Go back to the full alignment (so the class is all using the same aa numbering). For question parts d-g examine the conserved region you selected in part c.

d. Look for highly conserved hydrophobic amino acids. What are the most highly conserved hydrophobic amino acids?
e. Use the Zappo colorscheme to look at the types of amino acids present in the this alignment. Find examples of conserved amino acid substitutions seen in this alignment, one acidic, one basic, one hydrophobic, one hydrophilic. Answer like this D48->E48.
f. Use the Zappo colorscheme to look at the types of amino acids present in the this alignment. Look for residues conserved in most of the sequences, but changed in one or two seqeunces. Give the three changes likely to be disruptive. Answer with sequence ID:AA pairs like this, mouse:G48.
g. A friend conducts an affinity labeling experiment of the mouse, fish, and S. cerevisiae enzymes with a substrate analog. This sort of experiment normally reacts the affinity label with active site residues critical to enzyme function. The affinity label reacted with a cysteine residue in these three enzymes. Which cysteine residue or residues (given like C101) in the mouse sequence do you think is the most likely labeled residue?

3. Functional information about these proteins.

a. It is known that the SIR2/situins bind NAD+ and a Zn atom. The Zn is bound by two pairs of cysteines, both CXXC. The sites are at aa 371 and 395 in the human sequence. How well conserved is the Zn binding site?
b. The NAD binding site is formed by several loops of the protein. In the human protein, residues G261, V264, S265, G440, R466, D481, and C482 are known from crystal structures to surround the NAD+. Which of these aa are perfectly conserved?

4. Questions about the way Clustal builds alignments.

a. The options for CLUSTAL include a choice of type of scoring matrix but not a choice of an exact matrix. Why is this? How does CLUSTAL choose a scoring matrix to use?

b. The iteration option was introduced in Clustal version 2.0. Briefly explain the difference in how Clustal builds alignments when this option is used. Provide a brief answer in your own words.

BIO520

Site maintained by Jim Lund