BIO520 Exam 1 Spring 2010
Please email this lab to Yeshi (tgyeshi@uky.edu) with a subject line "BIO520 Exam 1" and name the document like so: "LundJ_exam1" or hand in written answers. Fill in your name on the exam!
You may use any books, notes, web pages, software programs, or related materials to complete this exam. You MAY NOT consult with any person regarding the exams intellectual content.
1. The caspase 12 gene is found in two variants in humans. The most common form is a pseudogene with a premature stop codon, but a second ancestral form with a full length active protein is found with a frequency of ~20% in Africans. It is thought that the inactive form provides protection from bacterial sepsis.
- a. (1 pt) Examine the linked Genbank entry for human caspase 12. Is this a curated or automatically generated RefSeq entry?
- b. (1 pt) Which human chromosome is the caspase 12 gene located on?
- c. (1 pt) How large is this gene? Approximately how many bps does it span on the chromosome? (Answers accurate to with 10% of the correct value are sufficient.)
- d. (1 pt) Find the RefSeq protein entry for the mouse ortholog. ANSWER=Genbank accession.
- e. (1 pt) What is the offical gene name or gene symbol for the mouse caspase 12?
- f. (1 pt) Give the name of a protein domain found in this protein.
2. Scoring matrices
- a. (2 pts) Use the linked BLOSUM45 scoring matrix with gap creation and extension costs of -15, -2 to score the alignment shown below:
YDCGSLK
YE-GTLK
- b. (1 pt) Which scoring matrix would you use to search for bacterial proteins related to human opsins?
- c. (1 pt) Give an example of an aspect of sequence similarity not captured with either the PAM or BLOSUM scoring matrices. That is, a way for proteins to be more (or less) similar than indicated by the alignment score.
- d. (1 pt) PAM stands for 'point accepted mutation' and the number i.e. '70' in PAM70 indicates it was made from aligned proteins with 70 changes out of 100 amino acids. How can the PAM250 matrix then have 250 mutations in 100 amino acids?
3. BLAST
- a. (2 pts) The BLAST algorithm uses word matches between sequences to seed alignments. Megablast uses a longer word than blastn. How does this difference affect 1) the BLAST search and 2) the matches found by BLAST?
- b. (2 pts) Why is low complexity filtering used by default in BLAST searches?
- c. (1 pt) What is an HSP?
- d. (2 pt) The BLAST E-value depends on several factors. You perform a BLAST search, using the same query, against the “non‐redundant protein sequences (nr)” database and the “Protein Data Bank proteins (pdb)” database. Will you get the same E‐value? Explain.
- e. (1 pt) To find the corresponding gene (ortholog) in other vertebrates is it better to use the genomic sequence, mRNA, or protein as BLAST input? Explain.
4. BLAST search. Refer to the linked BLAST search of the human forkhead box P2 gene (NP_055306) to answer this question. link.
- a. (1 pt) Which BLAST program was used?
- b. (1 pt) Which database was searched?
- c. (1 pt) What organisms is this gene found in, i.e. what is its phylogenetic range?
- d. (2 pts) Examine the match to the zebrafish gene with accession NP_001025253.1. What percent identity and percent positives are found in this alignment? (Update: I re-ran this BLAST search with slightly different parameters and the answer for this part has changed. Give either the original or new values for your answer.)
- e. (1 pt) Give the E-value for this alignment to NP_001025253.1 and indicate whether it indicates a strong, moderate, or weak match.
- f. (1 pt) Do the results for this search show all the matches to this human gene in this database? Give a yes or no answer and a brief explanation.
5. PSSMs.
- a. (2 pts) Why are pseudocounts in a PSSM given less weight when more input sequences are present in the alignment?
- b. (2 pts) Why is it better to weight pseudocounts by their similarity to the consensus rather than give them each the same value?
- c. (1 pt) What determines how many columns a PSSM has (how long it is)?
- d. (2 pts) If you wanted to find all the members of the SIR2 protein family, why would searching databases using a PSSM be more sensitive than using one or two members of the family (for example, human SIRT1 and yeast SIR2)? Are there family members that might be found using a well-chosen (or lucky) simple search that a search using a PSSM would not discover?
7. Refer to the linked CLUSTALW multiple alignment of a set of genes containing the ancient T-BOX domain for the questions below. link, CLUSTAL input sequences
- a. (2 pts) Examine the CLUSTAL alignment output and list in order the first three sequences (or sub-alignments) that were aligned in building the final multiple sequence alignment.
- b. (1 pt) The guide tree shows anemone (an invertebrate) and horse in a sub-cluster rather than horse more tightly clustered with vertebrates lamprey and human. Give a reason why this wrong guide tree cluster could have occured.
- c. (1 pt) In position 86 of the alignment (95 His in the human sequence) both His and Tyr are found. How would you characterize the substitutions at this position?
- d. (1 pt) Describe the nature of the protein structure likely to be found at alignment position 81 to 83.
8. (2 pts) In an ultrametric tree, branch length is directly proportional to time. The assumptions underlying ultrametric trees are not always valid. Describe two reasons or cases where observed sequence changes are not proportional to time.
9. Answer true or false for the following questions:
- a. (1 pt) The Minimum Evolution method produces a rooted tree.
- b. (1 pt) The Maximum Parsimony (MP) method is based on finding the most parsimonious tree by maximizing the number of evolutionary changes within the tree.
- c. (1 pt) The Neighbor Joining algorithm is an extremely fast algorithm compared to the Maximum Likelihood (ML) algorithm.
- d. (1 pt) By using the Maximum Likelihood algorithm we are guaranteed to find the "true" biological tree ‐ that is, one that accurately represents the evolutionary history of the taxa.
- e. (1 pt) The Neighbor Joining method is a distance based method.
- f. (1 pt) For the Maximum Likelihood method increasing the number of taxa from 25 to 50 would increase the running time more than increasing the alignment length from 250 to 500 amino acids.
BIO520
|