cdc15 pub
Cluster pub
cdc15_50.txt
Cluster v2.11
Java TreeView
Java TreeView home page
Cluster/Treeview manuals
Eisen lab software
Yeast MA data
Saccharomyces genome database (SGD)
| |
LAB 10: Transcriptome and Microarray analysis
The primary objectives
are:
- Basic questions about gene discovery.
- Understanding microarray data properties and basic manipulations.
- A look at microarray expression data using clustering tools.
For this lab we'll explore data analysis using clustering tools. The software used for this lab are the freely available programs, Cluster written by the Eisen lab and Java TreeView. These programs are already installed on the lab computers.
1.Microarray data visualization and quality assessment. We'll use the data in the file cdc15_50.txt for this analysis.
This is the data file from a yeast spotted microarray experiment. Yeast cells synchronized by alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant were collected, mRNA was prepared, fluorescently labeled with Cy5 (red), and hybridized to spotted microarrays containing nearly all the yeast genes. A common reference mRNA sample labeled with Cy3 (green) was hybridized along with all the time point experimental samples. A few time point samples were labeled with Cy3 and Cy5 and compared directly.
The cdc15_50.txt file contains one of the microarrays, the 15 min. sample in channel 1 and the 50 min. sample in channel 2. Open the data file in Excel. It is in tab-delimited format. Column CH1D gives channel 1 intensity, CH2D gives channel 2 intensity. CH2DN gives normalized channel 2 intensity. Intensity is usually a 16-bit value (0 - 65K) and is often plotted as log2(intensity) which ranges from 0 - 16. It will be easier to do this data analysis if you copy these three columns to a new Workbook or delete the other data columns. Include screenshots of each graph you prepare for this question in your answers.
1a.Plot intensity in channel 1 vs. channel 2 as a 'XY plot'. To create the graph select the columns to be graphed, then go to the menu Insert->Chart. Observe that most of the plotted points have low intensity values and fall in the bottom left of the graph.
1b.Let's explore the data distribution. We'll make a histogram of the data distribution. First set up the bins. In a new column enter the values 100, 200, 300...4900, 5000. In Excel, select a column, go to menu Data->Analysis->Data Analysis, select 'Histogram'. Do this twice, for channel 1 and 2, then plot the histogram data as a bar graph. As you can see most of the values fall below 1000. Two-channel microarray data must be normalized to equalize total expression between the two channels. The simplest normalization method is linear scaling--multiplying channel two intensity values by a normalization factor. There are also better but more complicated methods. Normalization allows data from different microarrays to be combined.
1c.To spread out the microarray data, plot each channel as log2(intensity). To do this in Excel enter an equation like this: '=log(A2,2)' if the data value is in column A, row 2. Then copy this equation for every data value to log transform the entire row. Then plot this data as a XY scatter plot. Genes equally expressed in channel 1 and 2 fall along a central line--you can see that most genes are not differentially expressed.
1d.To view microarray data and assess data quality a M vs A plot is often used. Construct M vs A plots for the chip cdc15_50 using the unnormalized and normalized expression intensity values. This plots the ratio of the two columns ( M = log2(CH1D / CH2D) vs. total intensity ( A=0.5*log2(CH1D * CH2D) ). Make two graphs, one using CHID and CH2D and another using the normalized values, CH1D and CH2DN. What is the main difference between the two graphs--how has normalization effected the data?
The cluster and TreeView programs used in the next lab can be found in the program menus under the 'Instructional Software' menu.
Download the yeast microarray data file. This file contains 54 yeast microarray hybridizations on spotted microarrays containing nearly all the yeast genes. The experiments are yeast cell cycle, sporulation, and diauxic shift, the shift from anaerobic fermentation of glucose to aerobic respiration of ethanol. More detail on the data set is available here: YeastCols.
This data set has been normalized and log-transformed.
2a. Open the yeast data file ('Yeast MA data', file yeastall_public.txt) in Excel. Clustering programs require specific formats, usually tab-delimited text files with particular gene and array annotations. Notice the EWEIGHT row and GWEIGHT columns. What are these for?
The EWEIGHT and GWEIGHT are rarely used--in most microarray clustering they are not altered. They are useful as an indicator of a data file prepared for clustering.
2b. Now start Cluster and load in the yeast data file. Cluster has tools for data transformation and filtering and supports several clustering algorithms. Most genes change relatively little in a microarray experiment, so including them in a cluster isn't very informative. Spotted microarray data often has missing spots. Let's start by only considering genes with 80% good data. How many genes are left?
2c. Now let's remove genes with little expression change over the set of microarrays. Select for genes with at least 4X change on 2 arrays. Remember to translate 4X change into its log value. Click 'Accept' to make this the current data file, and save it with a new to a folder you create on the Desktop. How many genes are selected? Reload the new file you created to set the 'Job Name', the name new files will have to reflect your new filtered data set. Use this filtered set of genes for clustering in all the following questions.
2d. Cluster these genes using Hierarchical Clustering with Correlation (center) as the similarity measure and Complete Linkage Clusering. Cluster on genes only, the arrays are in experimental order already. The Cluster program saves two new file .cdt and .gtr. A .atr file is made if the arrays are clustered as well. What are the four columms in the .gtr file?
2e. Start TreeView and open the cluster file (.cdt). You will have to move the 'invisible dividers' between the tree and the sections to see the full set of arrays. By default genes are linked to SGD, the yeast genome database. Clicking on a gene in the right most panel will bring up the entry for this gene in the yeast genome database. Red/green are the most common colors used for viewing clusters. The intensity of the colors can be changed in the setting menu. This is why color scale bars are needed when displaying cluster images. Grey is typically used to indicate missing data points. What do red, green, and black indicate?
2f. Find the gene ENO2. In which biological process is it involved?
2g. Select the cluster containing ENO2 and genes with very similar expression. Summarize the expression signature for this cluster--how many genes, how does their expression change and in what experiments, and do they have annotations in common?
2h. One Hierarchical Clustering option is to use 'Absolute correlation'. How does this differ from 'Correlation'?
3. K-means clustering.
3a. K-means clustering. Use the genes you selected in 2c and cluster the genes using k-means. Cluster genes using k-medoids. Create 10 nodes and run 50 iterations. Approximately how many genes were moved on the last iteration?
3b. Now let's visualize the k-means clusters. The output file created by the k-means method has a line containing 'NONE' between the clusters. We can visualize this output by modifying this file to be a file Cluster can read, have Cluster make a Heirarchical Cluster with no clustering. To do this, open the k-means output file in Excel. Find lines containing 'NONE' and make this line look like a gene--"None", "None", and 1 in the first three columns. Save this file as a tab-delimited text file. Now open it in Cluster. Go to Heirarchical Clustering, unselect Arrays->Cluster and unselect Genes->Cluster. Then run the clustering and it will create .cdt and .gtr output files. Open the output files in TreeView. Examine the k-means clusters driven by sporulation. Are these genes typically also cell cycle genes?
3c. Find the k-means cluster containing ENO2. This is now a larger cluster. What functions are common for genes in this cluster? For the purpose of this question it is enough to examine the provided annotations; you don't have to look every gene up in SGD.
The following question does not require the Cluster/TreeView programs.
4.The horse genome is being sequenced to 6.8X using WGS and the first assembly was released in 2008. You are designing a long oligo spotted microarray for the horse. You base your design on a set of 323,000 EST sequences along with gene predictions made from the assembled horse genome sequence. Your goal is to design an array with one oligo per gene. Briefly describe the steps you would take to design this array. Give your answer as a numbered series of steps.
BIO520
|