How to use Cluster and Treeview
May 6, 2009
One of my colleagues was interested in visualizing some data based on CpG data using heatmap approaches adopted by researchers in gene expression microarray. I pointed to him to Cluster and Treeview, one of the the earliest free standalone softwares developed for heatmap visualization in gene expression studies and developed in Mike Eisen’s group. You can apply the same approach to any normally distributed variable instead of CpG or gene expression data.
Here is a quick tutorial on using the data.
1. Download and install Cluster and Treeview from http://rana.lbl.gov/EisenSoftware.htm. For Cluster, you will need to download the zip file and decompress it to find the SETUP.EXE file.
2. Format your input dataset. Here is a simple example. The file needs to be tab-separated with missing values coded as empty cells. (see the File Format Help on the Cluster software for further info).
| UID | NAME | GWEIGHT | asthma1 | asthma2 | asthma3 | healthy1 | healthy2 | healthy3 | healthy4 |
| EWEIGHT | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
| CPG1 | AAA | 1 | 0.23 | -1.79 | -1.29 | -1.56 | -0.27 | -0.38 | |
| CPG2 | BBB | 1 | 0.41 | -0.89 | -1.06 | -1.6 | -1.84 | -1.6 | |
| CPG3 | CCC | 1 | 0.61 | -0.07 | -1.29 | -1.29 | -2 | -1.84 | -2.25 |
| CPG4 | DDD | 1 | 0.16 | -0.15 | -0.76 | -1.25 | -1.89 | -1.74 | -1.6 |
| CPG5 | EEE | 1 | 0.03 | 1.39 | -0.84 | -1.64 | -2.84 | -2.47 | -2.4 |
| CPG6 | FFF | 1 | -0.18 | -0.18 | -0.62 | -1.32 | -1.69 | -1.43 | -1.7 |
3. Launch Cluster (try START -> Cluster -> Cluster). Press ‘Load file’. Check the number of rows and columns matches the number of CpG islands and subjects.
4. (optional) In the Cluster software, you can “filter” out potentially uninteresting CpG islands by some criteria (e.g. missingness, variance) if you wish.
5. (optional) If your input file is arranged by asthmatics followed by non-asthmatics, then you should untick the cluster arrays in the “heirarchical clustering” tab.
6. Press the ‘Average Linkage Clustering’ button (or complete or single linkage) at the bottom of “Hierarchical Clustering” tab. This should produce 3 files (including cdt, gtr).
7. Start Treeview (try START -> EisenSoftare -> Treeview). Load the cdt file to see the plot. Click on the dendrograms, CpG islands to navigate and zoom etc.
8. (optional) You might want to change the X, Y pixel sizes (Settings -> Options) to get a bigger picture.
I appears that these softwares are no longer being actively developed anymore but that is fine since they do a limited amount of analysis extremely well.
Alternatives options:
- You can use R to generate similar plots (but not zoomable and requires command line programming) or any other main statistical software
- I have heard good stuff about the dChip software but I have not tried it myself.
- There are also a couple of free webtools where you can upload your data to generate these plots. For example [1], [2]
WGA Viewer software from Duke University
March 18, 2009
A genome-wide association (GWA) study often involves analyzing the effects of 100,000s of single nucleotide polymorphisms (SNPs) on a disease outcome or trait. Visualizing such high density data can often prove tricky, especially if the investigator is interested in specific regions.
I have recently discovered a free tool called WGAviewer from the Duke University (http://people.genome.duke.edu/~dg48/WGAViewer/) that can greatly help with the visualization part (it does not perform any analysis). The software is based on Java so should be platform independent (I only used and tested it for Windows so far). Some of the key features includes:
- QQ plots
- Manhattan plots with *interactive zoom* in and out
- Zoom to a region by gene name or region easily and visualize results
- Ablity to select and annotate the top N snps
- Automatic update of annotation on Ensembl and HapMap data
- Calculate LD linkage for a particular region etc
- Take publication quality snapshot pictures
One of the hassles I found was formatting the data for input. The documentations suggest several ways of making the data input using MAP files etc in the manual. However, the easiest way I found was to simply create a space-separated ASCII file (using R or even Excel) with the following columns: rsid, chromosome (1-22, X, Y, XY, M), Map (coordinate on the chromosome) and -logP (log base 10 of p-values).
SNP chromosome Map -logP MitoA10045G M 10045 2.04858284222835 MitoT9900C M 9900 0.233064674990652 MitoT9951C M 9951 0.0641728753170715 rs1000000 12 125456933 1.16139248878691 rs10000010 4 21227772 0.149317624784192 rs10000023 4 95952929 1.15832462919552 rs10000030 4 103593179 0.106028436059944 rs10000041 4 165841405 0.221366644208304 rs1000007 2 237416793 0.213983677592946 ...
You will need to create and load one file per analysis which is bit annoying if you have many analyses to visualize. I hope they add new features to visualize and (even better) compare different results in the near future. Imagine being able to superimpose manhattan plots from two different studies or techniques together!
I got this email from a certain Anderson Brown on the BioConductor mailing list. It actually lists some useful tools after a slightly wordy introduction and before a sales pitch (actually WizFolio offers a free account limited to 100MB or 200 items). Enjoy!
Hi,
March 2009 marks the 20th anniversary of the invention of the Web. Like all great inventions, it arises out of an unmet need that badly needed a solution. Tim Berners-Lee foresaw the great potential that can be unlocked by connecting data across disparate operating systems. You can see the full talk by Tim Berners-Lee as he explains it at: http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
Fast forward to 2004 when the term “Web 2.0” was first coined. This term now generally has the connotation of instant “read-write” and increased connectivity on the Web as exemplified by applications like Facebook, MySpace and Twitter. How will such technology impact upon the busy scientists’ workflow in terms of searching, compiling, organizing, sharing and analyzing peer reviewed journal articles? A new crop of journal reference management applications have emerged within the last 18 months. The term “journal reference management” is used here as opposed to the older term “bibliographic management” to emphasize the importance of managing and linking the bibliographic data with the PDFs.
The biggest frustration for the busy scientists is the difficulty of locating and managing the PDFs from a set of bibliographic data. I have listed a number of recently released journal reference management applications that addresses to a certain degree this frustration.
Zotero – A research tool that helps you gather, organize, and analyze sources. www.zotero.org
Labmeeting – Organize, search, and store your paper collection and lab protocols. www.labmeeting.com
Pubget – Similar to Pubmed, except you get the PDFs right away. www.pubget.com
Mendeley – Academic software for managing & sharing your research papers. www.mendeley.com
At WizFolio, we started 2 years ago with a vision of creating a web based application that would manage bibliographic data and PDFs with the same ease that you would MP3 files. Tightly coupled with the application is a citation tool that the user can customize on-the-fly with instantaneous preview. We invite you to give WizFolio Web 2.0 a try at www.wizfolio.com and appreciate any feedback and comments that will make the application better.
Automating backward selection – an alternative stepAIC() followed by iterative dropterm() and update() functions
November 20, 2008
The stepAIC() function from the R package MASS can automate the submodel selection process. The authors state, on page 176 of their bookModern Applied Statistics with S (ISBN 0387954570), that “… selecting terms on basis of of AIC can be somewhat permissive in its choice of termsm being roughly equivalent to choosing an F-cutoff of 2″, and thus one have to proceed manually with iterative application of the dropterm() and update() function.
There is no doubt that the current implementation inspires good practice in model checking and thinking, but there are times where I want to completely automate the process. An example is when I want to quickly select the minimal submodel for many tens of phenotypes.
Here is a function that is capable of doing this. The speed is comparable with stepAIC. Use it at your own risk.
backstepAIC.glm <- function(fmla, data, family, AIC.p.cut=0.05, verbose=TRUE, …){
## this will only work reliably on datasets without missing values (same issue with stepAIC)
data <- data[ , all.vars(fmla)]
data[ complete.cases(data), ]
dt <- data.frame(Pr=1) # to initiate the loop
while( max(dt$Pr, na.rm=T) > AIC.p.cut ){
fit <- eval( substitute( glm( fmla, data=data, family=family, …) ), parent.frame() )
if( length( coefficients(fit) ) == 0 ) break()
## intercept only model, cannot drop anymore term so exit while loop
dt <- dropterm(fit , test=”Chisq” )
if(verbose) cat(“Attempting to drop term”, rownames(dt)[ which.max(dt$Pr) ], “\n”)
nm <- setdiff( rownames(dt)[ -which.max(dt$Pr) ], “<none>” )
nm <- c( “1″, nm )
fmla <- as.formula(paste(as.character(fmla)[2], “~”, paste(nm, collapse= “+”)))
}
return( fit )
}
For Cox models, replace the model fitting line with this:
fit <- eval( substitute( coxph( fmla, data=data, x=T, y=T, … ) ), parent.frame() )
JabRef – a free and powerful citation manager
October 15, 2008
JabRef is clean and powerful bibtex manager that works well with LaTeX writing. There are even some guide on how to use it with Word 2003 or Word 2007 and dedicated softwares for Windows.
One of the best features is that it can do a fetch Medline (function F5) when connected to the internet. You just need to enter the PubMed ID and it automagically extracts and stores the data in the bibtex format. All you have to do then is to give it a key.
And best of all, it is free as in beer (i.e. zero cost) and free as in speech (i.e. open source).
Review of IBM Lenovo X41 tablet PC
October 14, 2008
I been using my tablet PC laptop – IBM Lenovo X41 – for a couple of months now and though I should write a review on it. I believe the X61 sports a similar design except with a faster dual core processor.
PROS:
- Weighing around 1.6kg, it is classified as an ultraportable and it certainly feels that way.
- The ink and pen experience is fantastic. I am not too worried about scratching the screen now.
- Good build design and solid feel. The swivel hinge is solid even when using it on a moving vehicle. The extra girth is good for gripping in tablet mode.
- Full sized keyboard and very responsive.
- Built-in utilities software for backing up, restoring etc. I haven’t used it much but many do compliment IBM on this.
- (minor) Does not run too hot which is nice especially when using it in tablet mode.
- (minor) The microphone is of a decent quality and works well (except in tablet mode where the lid covers the mic)
CONS:
- No mousepad! Why? Can anyone explain to me why IBM / Lenovo abandoned the track pad in favour of only the track point (aka the “nipple”), and thus alienating a large number of laptop users?
- When the laptop is under heavy use, the cursor randomly jumps about.
- The speakers are placed at the bottom of the laptop. Solution: place the laptop on top of a book or something hard, so that the sound is projected better.
- No integrated webcam. OK, I can understand my refurbished X41 was released in 2005 when integrated webcam were not standard. However, Lenovo could have upgraded this feature in the X61 which was released in 2007.
- No in-built optical CD or DVD drive. Not a big deal if you are willing to work from the USB or network drives.
- Hard disk clicking problem – this is a well known issue with the X series.
- (minor) The keyboard layout is such that the bottom left most key is function (FN) key instead of Control (CTRL) key and there is no Windows key. Solution: There are softwares to remap the keyboard.
Verdict: If Lenovo integrates a track pad (with or without the track point), a webcam, improves the keyboard layout deficiences and perhaps an integrated DVD burner in the next version of their tablet PC, then I will be placing an order when it comes out.
Randy Pausch – Time Management
April 12, 2008
Just as the Randy’s Last Lecture was inspiring, I found his lecture on time management even more useful, perhaps because I had such poor time management skills. The lecture contains lots of nuggets of wisdom that are practical and easy to follow. You might want to also download the power point slides to accompany when watching the lecture.
Randy Pausch – Living life to the fullest
April 12, 2008
Randy Pausch is lecturer at Carnegie Mello University. He is full of life, exuberance and humor. And he is dying from pancreatic cancer. Watch his uplifting and inspiring “Last Lecture” to find out about his message on living life to the fullest, a video that has been downloaded several millions times over the web. Alternatively, you can watch the shortened version of Randy’s talk on the Oprah Winfrey show.
Ant and Grasshopper – The Malaysian version
March 19, 2008
I got this interesting forward from a friend:
Older Version
The Ant works hard in the withering heat all summer building its house and laying up supplies for the winter. The Grasshopper thinks the Ant is a fool and laughs, dances and plays the summer away. Come winter, the Ant is warm and well fed. The grasshopper has no food or shelter so he dies out in the cold.
Modern Version
The Ant works hard in the withering heat all summer building its house and laying up supplies for the winter. The Grasshopper thinks the Ant is a fool and laughs, dances and plays the summer away.
Come winter, the shivering Grasshopper calls a press conference and demands to know why the Ant should be allowed to be warm and well fed while others are cold and starving.
TV1, TV2 & TV3 show up to provide pictures of the shivering Grasshopper next to a video of the Ant in his comfortable home with a table filled with food.
The majority of the Malaysian Parliment is stunned by the sharp contrast. How can this be that this poor Grasshopper be allowed to suffer so?
Khairy stages a demonstration in front of the Ant’s house. Nazri goes on a fast along with other Grasshoppers demanding thatGrasshoppers be relocated to warmer climates during winter. Most of the related people criticize the Malaysian Government for not upholding the fundamental rights of the Grasshopper.
The local newspaper & the Internet are flooded with online petitions seeking support for the Grasshopper (many promising Heaven and Everlasting Peace for prompt support or the wrath of God for non-compliance) .
Deputy Minister immediately passes a law preventing Ants from working hard in the heat so as to bring about equality of poverty among Ants and Grasshoppers.
Hishammudin makes ‘More Special Reservation’ for Grasshoppers in Educational Institutions & in Government Services.
The Ant is fined for failing to comply with 30% sharing and having nothing left to pay his retroactive taxes; its home is confiscated by the Government and handed over to the Grasshopper in a ceremony covered by maju ———.
Prime Minister announces to the whole Malaysia that this is part of the NEP and all have to respect, no questions asked and have to follow it.
Many years later…..
The Ant has since migrated to the US and set up a multi-billion dollar company
100s of Grasshoppers still die of starvation despite reservation somewhere in Malaysia and because of losing a lot of hard-working Ants and feeding the Grasshoppers, Malaysia is still a developing country!
All because the ANTS are still doing their work.