How to use Cluster and Treeview
May 6, 2009
One of my colleagues was interested in visualizing some data based on CpG data using heatmap approaches adopted by researchers in gene expression microarray. I pointed to him to Cluster and Treeview, one of the the earliest free standalone softwares developed for heatmap visualization in gene expression studies and developed in Mike Eisen’s group. You can apply the same approach to any normally distributed variable instead of CpG or gene expression data.
Here is a quick tutorial on using the data.
1. Download and install Cluster and Treeview from http://rana.lbl.gov/EisenSoftware.htm. For Cluster, you will need to download the zip file and decompress it to find the SETUP.EXE file.
2. Format your input dataset. Here is a simple example. The file needs to be tab-separated with missing values coded as empty cells. (see the File Format Help on the Cluster software for further info).
| UID | NAME | GWEIGHT | asthma1 | asthma2 | asthma3 | healthy1 | healthy2 | healthy3 | healthy4 |
| EWEIGHT | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
| CPG1 | AAA | 1 | 0.23 | -1.79 | -1.29 | -1.56 | -0.27 | -0.38 | |
| CPG2 | BBB | 1 | 0.41 | -0.89 | -1.06 | -1.6 | -1.84 | -1.6 | |
| CPG3 | CCC | 1 | 0.61 | -0.07 | -1.29 | -1.29 | -2 | -1.84 | -2.25 |
| CPG4 | DDD | 1 | 0.16 | -0.15 | -0.76 | -1.25 | -1.89 | -1.74 | -1.6 |
| CPG5 | EEE | 1 | 0.03 | 1.39 | -0.84 | -1.64 | -2.84 | -2.47 | -2.4 |
| CPG6 | FFF | 1 | -0.18 | -0.18 | -0.62 | -1.32 | -1.69 | -1.43 | -1.7 |
3. Launch Cluster (try START -> Cluster -> Cluster). Press ‘Load file’. Check the number of rows and columns matches the number of CpG islands and subjects.
4. (optional) In the Cluster software, you can “filter” out potentially uninteresting CpG islands by some criteria (e.g. missingness, variance) if you wish.
5. (optional) If your input file is arranged by asthmatics followed by non-asthmatics, then you should untick the cluster arrays in the “heirarchical clustering” tab.
6. Press the ‘Average Linkage Clustering’ button (or complete or single linkage) at the bottom of “Hierarchical Clustering” tab. This should produce 3 files (including cdt, gtr).
7. Start Treeview (try START -> EisenSoftare -> Treeview). Load the cdt file to see the plot. Click on the dendrograms, CpG islands to navigate and zoom etc.
8. (optional) You might want to change the X, Y pixel sizes (Settings -> Options) to get a bigger picture.
I appears that these softwares are no longer being actively developed anymore but that is fine since they do a limited amount of analysis extremely well.
Alternatives options:
- You can use R to generate similar plots (but not zoomable and requires command line programming) or any other main statistical software
- I have heard good stuff about the dChip software but I have not tried it myself.
- There are also a couple of free webtools where you can upload your data to generate these plots. For example [1], [2]
Automating backward selection – an alternative stepAIC() followed by iterative dropterm() and update() functions
November 20, 2008
The stepAIC() function from the R package MASS can automate the submodel selection process. The authors state, on page 176 of their bookModern Applied Statistics with S (ISBN 0387954570), that “… selecting terms on basis of of AIC can be somewhat permissive in its choice of termsm being roughly equivalent to choosing an F-cutoff of 2″, and thus one have to proceed manually with iterative application of the dropterm() and update() function.
There is no doubt that the current implementation inspires good practice in model checking and thinking, but there are times where I want to completely automate the process. An example is when I want to quickly select the minimal submodel for many tens of phenotypes.
Here is a function that is capable of doing this. The speed is comparable with stepAIC. Use it at your own risk.
backstepAIC.glm <- function(fmla, data, family, AIC.p.cut=0.05, verbose=TRUE, …){
## this will only work reliably on datasets without missing values (same issue with stepAIC)
data <- data[ , all.vars(fmla)]
data[ complete.cases(data), ]
dt <- data.frame(Pr=1) # to initiate the loop
while( max(dt$Pr, na.rm=T) > AIC.p.cut ){
fit <- eval( substitute( glm( fmla, data=data, family=family, …) ), parent.frame() )
if( length( coefficients(fit) ) == 0 ) break()
## intercept only model, cannot drop anymore term so exit while loop
dt <- dropterm(fit , test=”Chisq” )
if(verbose) cat(“Attempting to drop term”, rownames(dt)[ which.max(dt$Pr) ], “\n”)
nm <- setdiff( rownames(dt)[ -which.max(dt$Pr) ], “<none>” )
nm <- c( “1″, nm )
fmla <- as.formula(paste(as.character(fmla)[2], “~”, paste(nm, collapse= “+”)))
}
return( fit )
}
For Cox models, replace the model fitting line with this:
fit <- eval( substitute( coxph( fmla, data=data, x=T, y=T, … ) ), parent.frame() )
Hello world!
November 26, 2006
Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!