How to use Cluster and Treeview

One of my colleagues was interested in visualizing some data based on CpG data using heatmap approaches adopted by researchers in gene expression microarray. I pointed to him to Cluster and Treeview, one of the the earliest free standalone softwares developed for heatmap visualization in gene expression studies and developed in Mike Eisen’s group. You can apply the same approach to any normally distributed variable instead of CpG or gene expression data.

Here is a quick tutorial on using the data.

1. Download and install Cluster and Treeview from For Cluster, you will need to download the zip file and decompress it to find the SETUP.EXE file.

2. Format your input dataset. Here is a simple example. The file needs to be tab-separated with missing values coded as empty cells. (see the File Format Help on the Cluster software for further info).

UID NAME GWEIGHT asthma1 asthma2 asthma3 healthy1 healthy2 healthy3 healthy4
EWEIGHT 1 1 1 1 1 1 1
CPG1 AAA 1 0.23 -1.79 -1.29 -1.56 -0.27 -0.38
CPG2 BBB 1 0.41 -0.89 -1.06 -1.6 -1.84 -1.6
CPG3 CCC 1 0.61 -0.07 -1.29 -1.29 -2 -1.84 -2.25
CPG4 DDD 1 0.16 -0.15 -0.76 -1.25 -1.89 -1.74 -1.6
CPG5 EEE 1 0.03 1.39 -0.84 -1.64 -2.84 -2.47 -2.4
CPG6 FFF 1 -0.18 -0.18 -0.62 -1.32 -1.69 -1.43 -1.7

3. Launch Cluster (try START -> Cluster -> Cluster). Press ‘Load file’. Check the number of rows and columns matches the number of CpG islands and subjects.

4. (optional) In the Cluster software, you can “filter” out potentially uninteresting CpG islands by some criteria (e.g. missingness, variance) if you wish.

5. (optional) If your input file is arranged by asthmatics followed by non-asthmatics, then you should untick the cluster arrays in the “heirarchical clustering” tab.

6. Press the ‘Average Linkage Clustering’ button (or complete or single linkage) at the bottom of “Hierarchical Clustering” tab. This should produce 3 files (including cdt, gtr).

7. Start Treeview (try START -> EisenSoftare -> Treeview). Load the cdt file to see the plot. Click on the dendrograms, CpG islands to navigate and zoom etc.

8. (optional) You might want to change the X, Y pixel sizes (Settings -> Options) to get a bigger picture.

I appears that these softwares are no longer being actively developed anymore but that is fine since they do a limited amount of analysis extremely well.

Alternatives options:

  1. You can use R to generate similar plots (but not zoomable and requires command line programming) or any other main statistical software
  2. I have heard good stuff about the dChip software but I have not tried it myself.
  3. There are also a couple of free webtools where you can upload your data to generate these plots. For example [1], [2]

WGA Viewer software from Duke University

A genome-wide association (GWA) study often involves analyzing the effects of 100,000s of single nucleotide polymorphisms (SNPs) on a disease outcome or trait. Visualizing such high density data can often prove tricky, especially if the investigator is interested in specific regions.

I have recently discovered a free tool called WGAviewer from the Duke University ( that can greatly help with the visualization part (it does not perform any analysis). The software is based on Java so should be platform independent (I only used and tested it for Windows so far). Some of the key features includes:

  1. QQ plots
  2. Manhattan plots with *interactive zoom* in and out
  3. Zoom to a region by gene name or region easily and visualize results
  4. Ablity to select and annotate the top N snps
  5. Automatic update of annotation on Ensembl and HapMap data
  6. Calculate LD linkage for a particular region etc
  7. Take publication quality snapshot pictures

One of the hassles I found was formatting the data for input. The documentations suggest several ways of making the data input using MAP files etc in the manual. However, the easiest way I found was to simply create a space-separated ASCII file (using R or even Excel) with the following columns: rsid, chromosome (1-22, X, Y, XY, M), Map (coordinate on the chromosome) and -logP (log base 10 of p-values).

SNP chromosome Map -logP
MitoA10045G M 10045 2.04858284222835
MitoT9900C M 9900 0.233064674990652
MitoT9951C M 9951 0.0641728753170715
rs1000000 12 125456933 1.16139248878691
rs10000010 4 21227772 0.149317624784192
rs10000023 4 95952929 1.15832462919552
rs10000030 4 103593179 0.106028436059944
rs10000041 4 165841405 0.221366644208304
rs1000007 2 237416793 0.213983677592946

You will need to create and load one file per analysis which is bit annoying if you have many analyses to visualize. I hope they add new features to visualize and (even better) compare different results in the near future. Imagine being able to superimpose manhattan plots from two different studies or techniques together!

The Web is 20 years old and its impact on journal reference management

I got this email from a certain Anderson Brown on the BioConductor mailing list. It actually lists some useful tools after a slightly wordy introduction and before a sales pitch (actually WizFolio offers a free account limited to 100MB or 200 items). Enjoy!


March 2009 marks the 20th anniversary of the invention of the Web.   Like all great inventions, it arises out of an unmet need that badly needed a solution.  Tim Berners-Lee foresaw the great potential that can be unlocked by connecting data across disparate operating systems.  You can see the full talk by Tim Berners-Lee as he explains it at:

Fast forward to 2004 when the term “Web 2.0” was first coined.  This term now generally has the connotation of instant “read-write” and increased connectivity on the Web as exemplified by applications like Facebook, MySpace and Twitter. How will such technology impact upon the busy scientists’ workflow in terms of searching, compiling, organizing, sharing and analyzing peer reviewed journal articles?  A new crop of journal reference management applications have emerged within the last 18 months.  The term “journal reference management” is used here as opposed to the older term “bibliographic management” to emphasize the importance of managing and linking the bibliographic data with the PDFs.

The biggest frustration for the busy scientists is the difficulty of locating and managing the PDFs from a set of bibliographic data.  I have listed a number of recently released journal reference management applications that addresses to a certain degree this frustration.

Zotero – A research tool that helps you gather, organize, and analyze sources.

Labmeeting – Organize, search, and store your paper collection and lab protocols.

Pubget – Similar to Pubmed, except you get the PDFs right away.

Mendeley – Academic software for managing & sharing your research papers.

At WizFolio, we started 2 years ago with a vision of creating a web based application that would manage bibliographic data and PDFs with the same ease that you would MP3 files.  Tightly coupled with the application is a citation tool that the user can customize on-the-fly with instantaneous preview.  We invite you to give WizFolio Web 2.0 a try at and appreciate any feedback and comments that will make the application better.

Automating backward selection – an alternative stepAIC() followed by iterative dropterm() and update() functions

The stepAIC() function from the R package MASS can automate the submodel selection process. The authors state, on page 176 of their bookModern Applied Statistics with S (ISBN 0387954570), that “… selecting terms on basis of of AIC can be somewhat permissive in its choice of termsm being roughly equivalent to choosing an F-cutoff of 2”, and thus one have to proceed manually with iterative application of the dropterm() and update() function.

There is no doubt that the current implementation inspires good practice in model checking and thinking, but there are times where I want to completely automate the process. An example is when I want to quickly select the minimal submodel for many tens of phenotypes.

Here is a function that is capable of doing this. The speed is comparable with stepAIC. Use it at your own risk.

backstepAIC.glm <- function(fmla, data, family, AIC.p.cut=0.05, verbose=TRUE, …){

## this will only work reliably on datasets without missing values (same issue with stepAIC)
data <- data[ , all.vars(fmla)]
data[ complete.cases(data), ]

dt  <- data.frame(Pr=1)   # to initiate the loop

while( max(dt$Pr, na.rm=T) > AIC.p.cut ){

fit <- eval( substitute( glm( fmla, data=data, family=family, …) ), parent.frame() )

if( length( coefficients(fit) ) == 0 ) break()
## intercept only model, cannot drop anymore term so exit while loop

dt <- dropterm(fit , test=”Chisq” )
if(verbose) cat(“Attempting to drop term”, rownames(dt)[ which.max(dt$Pr) ], “\n”)
nm <- setdiff( rownames(dt)[ -which.max(dt$Pr) ], “<none>” )
nm <- c( “1”, nm )

fmla <- as.formula(paste(as.character(fmla)[2], “~”, paste(nm, collapse= “+”)))


return( fit )

For Cox models, replace the model fitting line with this:

fit <- eval( substitute( coxph( fmla, data=data, x=T, y=T, … ) ), parent.frame() )

JabRef – a free and powerful citation manager

JabRef is clean and powerful bibtex manager that works well with LaTeX writing. There are even some guide on how to use it with Word 2003 or Word 2007 and dedicated softwares for Windows.

One of the best features is that it can do a fetch Medline (function F5) when connected to the internet. You just need to enter the PubMed ID and it automagically extracts and stores the data in the bibtex format. All you have to do then is to give it a key.

And best of all, it is free as in beer (i.e. zero cost) and free as in speech (i.e. open source).

Review of IBM Lenovo X41 tablet PC

I been using my tablet PC laptop – IBM Lenovo X41 – for a couple of months now and though I should write a review on it. I believe the X61 sports a similar design except with a faster dual core processor.


  • Weighing around 1.6kg, it is classified as an ultraportable and it certainly feels that way.
  • The ink and pen experience is fantastic. I am not too worried about scratching the screen now.
  • Good build design and solid feel. The swivel hinge is solid even when using it on a moving vehicle. The extra girth is good for gripping in tablet mode.
  • Full sized keyboard and very responsive.
  • Built-in utilities software for backing up, restoring etc. I haven’t used it much but many do compliment IBM on this.
  • (minor) Does not run too hot which is nice especially when using it in tablet mode.
  • (minor) The microphone is of a decent quality and works well (except in tablet mode where the lid covers the mic)


  • No mousepad! Why? Can anyone explain to me why IBM / Lenovo abandoned the track pad in favour of only the track point (aka the “nipple”), and thus alienating a large number of laptop users?
  • When the laptop is under heavy use, the cursor randomly jumps about.
  • The speakers are placed at the bottom of the laptop. Solution: place the laptop on top of a book or something hard, so that the sound is projected better.
  • No integrated webcam. OK, I can understand my refurbished X41 was released in 2005 when integrated webcam were not standard. However, Lenovo could have upgraded this feature in the X61 which was released in 2007.
  • No in-built optical CD or DVD drive. Not a big deal if you are willing to work from the USB or network drives.
  • Hard disk clicking problem – this is a well known issue with the X series.
  • (minor) The keyboard layout is such that the bottom left most key is function (FN) key instead of Control (CTRL) key and there is no Windows key. Solution: There are softwares to remap the keyboard.

Verdict: If Lenovo integrates a track pad (with or without the track point), a webcam, improves the keyboard layout deficiences and perhaps an integrated DVD burner in the next version of their tablet PC, then I will be placing an order when it comes out.

Nice animation of the fall of man