Category Archives: Uncategorized

How to use Cluster and Treeview

One of my colleagues was interested in visualizing some data based on CpG data using heatmap approaches adopted by researchers in gene expression microarray. I pointed to him to Cluster and Treeview, one of the the earliest free standalone softwares developed for heatmap visualization in gene expression studies and developed in Mike Eisen’s group. You can apply the same approach to any normally distributed variable instead of CpG or gene expression data.

Here is a quick tutorial on using the data.

1. Download and install Cluster and Treeview from For Cluster, you will need to download the zip file and decompress it to find the SETUP.EXE file.

2. Format your input dataset. Here is a simple example. The file needs to be tab-separated with missing values coded as empty cells. (see the File Format Help on the Cluster software for further info).

UID NAME GWEIGHT asthma1 asthma2 asthma3 healthy1 healthy2 healthy3 healthy4
EWEIGHT 1 1 1 1 1 1 1
CPG1 AAA 1 0.23 -1.79 -1.29 -1.56 -0.27 -0.38
CPG2 BBB 1 0.41 -0.89 -1.06 -1.6 -1.84 -1.6
CPG3 CCC 1 0.61 -0.07 -1.29 -1.29 -2 -1.84 -2.25
CPG4 DDD 1 0.16 -0.15 -0.76 -1.25 -1.89 -1.74 -1.6
CPG5 EEE 1 0.03 1.39 -0.84 -1.64 -2.84 -2.47 -2.4
CPG6 FFF 1 -0.18 -0.18 -0.62 -1.32 -1.69 -1.43 -1.7

3. Launch Cluster (try START -> Cluster -> Cluster). Press ‘Load file’. Check the number of rows and columns matches the number of CpG islands and subjects.

4. (optional) In the Cluster software, you can “filter” out potentially uninteresting CpG islands by some criteria (e.g. missingness, variance) if you wish.

5. (optional) If your input file is arranged by asthmatics followed by non-asthmatics, then you should untick the cluster arrays in the “heirarchical clustering” tab.

6. Press the ‘Average Linkage Clustering’ button (or complete or single linkage) at the bottom of “Hierarchical Clustering” tab. This should produce 3 files (including cdt, gtr).

7. Start Treeview (try START -> EisenSoftare -> Treeview). Load the cdt file to see the plot. Click on the dendrograms, CpG islands to navigate and zoom etc.

8. (optional) You might want to change the X, Y pixel sizes (Settings -> Options) to get a bigger picture.

I appears that these softwares are no longer being actively developed anymore but that is fine since they do a limited amount of analysis extremely well.

Alternatives options:

  1. You can use R to generate similar plots (but not zoomable and requires command line programming) or any other main statistical software
  2. I have heard good stuff about the dChip software but I have not tried it myself.
  3. There are also a couple of free webtools where you can upload your data to generate these plots. For example [1], [2]

Automating backward selection – an alternative stepAIC() followed by iterative dropterm() and update() functions

The stepAIC() function from the R package MASS can automate the submodel selection process. The authors state, on page 176 of their bookModern Applied Statistics with S (ISBN 0387954570), that “… selecting terms on basis of of AIC can be somewhat permissive in its choice of termsm being roughly equivalent to choosing an F-cutoff of 2”, and thus one have to proceed manually with iterative application of the dropterm() and update() function.

There is no doubt that the current implementation inspires good practice in model checking and thinking, but there are times where I want to completely automate the process. An example is when I want to quickly select the minimal submodel for many tens of phenotypes.

Here is a function that is capable of doing this. The speed is comparable with stepAIC. Use it at your own risk.

backstepAIC.glm <- function(fmla, data, family, AIC.p.cut=0.05, verbose=TRUE, …){

## this will only work reliably on datasets without missing values (same issue with stepAIC)
data <- data[ , all.vars(fmla)]
data[ complete.cases(data), ]

dt  <- data.frame(Pr=1)   # to initiate the loop

while( max(dt$Pr, na.rm=T) > AIC.p.cut ){

fit <- eval( substitute( glm( fmla, data=data, family=family, …) ), parent.frame() )

if( length( coefficients(fit) ) == 0 ) break()
## intercept only model, cannot drop anymore term so exit while loop

dt <- dropterm(fit , test=”Chisq” )
if(verbose) cat(“Attempting to drop term”, rownames(dt)[ which.max(dt$Pr) ], “\n”)
nm <- setdiff( rownames(dt)[ -which.max(dt$Pr) ], “<none>” )
nm <- c( “1”, nm )

fmla <- as.formula(paste(as.character(fmla)[2], “~”, paste(nm, collapse= “+”)))


return( fit )

For Cox models, replace the model fitting line with this:

fit <- eval( substitute( coxph( fmla, data=data, x=T, y=T, … ) ), parent.frame() )

Nice animation of the fall of man

Hello world!

Welcome to This is your first post. Edit or delete it and start blogging!