BIO520 Final Exam part 1 Spring 2009


This exam is due Friday, May 8 at noon.
Please email this lab to Jim Lund (jiml@uky.edu) with a subject line "BIO520 Final Exam" and name the document like so: "LundJ_final" or hand in written answers.

Fill in your name on the exam!

You may use any books, notes, web pages, software programs, or related materials to complete this exam. You MAY NOT consult with any person regarding the exams intellectual content.

1. (18 pts) Fill in the blank with one, two, or (rarely) three words.

a. The NCBI __________ sequence database is a curated, derivative database.

b. The ________________ is a measure of the likelihood of finding an HSP with the given score in a sequence database calculated by the BLAST program.

c. ______________ multiple alignments are often computationally prohibitive so ______________ multiple alignment programs are much more commonly used.

d. __________________ are used to compensate for zero counts (i.e., missing values) when calculating scores in a PSSM.

e. ORFs in prokaryotic genomes have no ___________________ and intergenic regions are typically small making de novo gene finding easier than in eukaryotes.

f. ORFs with an unusual bp composision, flanked by _________________ are likely to be scored as internal exons by eukaryotic de novo gene finding programs.

g. An advantage of NMR is that __________________ are determined and they can provide insight into protein dynamics.

h. Finding voids (holes) in an experimentally determined protein structure is __________________.

i. ____________________ is a method for finding the structure of proteins with no BLAST matches in the protein databases.

j. In an RNA structure, bases with many different pairings in the set of predicted structures are likely to be __________________.

k. The number of observed changes in a DNA or protein sequence ________________________ the number of mutations that has occurred in the sequence since its divergence from an ancestral sequence.

l. _____________________ is used to assess support for nodes in a phylogenetic tree.

m. In ___________________ genome sequencing, an entire genome is sheared into small fragments that are then cloned and sequenced from one or both ends.

n. In genome sequencing, _________________ is the total number of bps sequenced divided by the total genome size.

o. In __________________ clustering, at each step the two genes or clusters most similar are joined.

p. The goal of _______________________ genomics projects is gathering information on the location, expression pattern, enzymatic activity, or binding partner of a large set of genes.

q. Genes found on the same order on the same chromosomal segment in a pair of species show _______________________.


2. (4 pts) Briefly outline the major steps in the BLAST algorithm. Explicitly number the steps in the algorithm.


3. Scoring matrices

a. (2 pts) Use the linked BLOSUM80 scoring matrix with gap creation and extension costs of -11, -1 to score the alignment shown below:

      ILDAG-SR
      VLECGLSR
b. (1 pt) Which scoring matrix would you use to search for proteins in Archaea related to human myoglobin?

c. (1 pt) Which family of scoring matrices is constructed from ungapped protein alignments?

d. (1 pt) PAM stands for 'point accepted mutation' and the number i.e. '40' in PAM40 indicates it was made from aligned proteins with 40 out of 100 mutated amino acids. How can the PAM250 matrix then have 250 mutations in 100 amino acids?


4a. (1 pt) Why is low complexity sequence filtered out BLAST input sequences by default?

4b. (1 pt) Describe a situation where you would not filter out low complexity sequences.


5a. (2 pts) One of the final steps in experimentally determining a protein structure is energy minimization. Why is this done?

5b. (1 pt) Give another different context in which energy minimization is used on a protein structure.


6. (3 pts) Give three reasons why gene finding in mammals is difficult and error-prone.


7. (3 pts) There is one Tree of Life, the phylogeny describing the relationships of every species. After biologists have determined it will phylogenetic methods retain any usefulness? List three uses of phylogenetic methods in this context.


8. (2 pts) You have been funded to sequence the blue whale genome. After the initial phase of 7X BAC and small clone high throughput sequencing, automated assembly is done and contigs are generated. What deficiencies would you expect in an assembly of this sequence?


9. You are working on the panda genome project. As an early part of the annotation process you will have to find and annotate repetitive elements.

a. (2 pts) For known repetitive elements this process is well established. Briefly describe it.

b. (2 pts) There may be repetitive elements only found in pandas or other Ursidae. How would you discover and find these new repetitive elements?


10. (2 pts) Microarray transcription analysis allows a biologist to quickly determine the expression level (or relative expression level) of every known gene. There are limitations to microarray analysis. Describe four aspects of gene and protein expression that typical microarrays fail to capture.


11. (2 pts) Describe two things that can be learned from comparing genomes that have diverged 100 million years (horse-seal, for example) that can’t be learned from comparing recently diverged genomes (chimp-human) or very old splits (tobacco-algae).


12. (2 pt) Why are replicate microarray samples normalized before they are combined for analysis? Describe the consequences and benefits of this process.


University of Kentucky  BIO520

Site maintained by Jim Lund