Category Archives: Statistics

WGA Viewer software from Duke University

A genome-wide association (GWA) study often involves analyzing the effects of 100,000s of single nucleotide polymorphisms (SNPs) on a disease outcome or trait. Visualizing such high density data can often prove tricky, especially if the investigator is interested in specific regions.

I have recently discovered a free tool called WGAviewer from the Duke University ( that can greatly help with the visualization part (it does not perform any analysis). The software is based on Java so should be platform independent (I only used and tested it for Windows so far). Some of the key features includes:

  1. QQ plots
  2. Manhattan plots with *interactive zoom* in and out
  3. Zoom to a region by gene name or region easily and visualize results
  4. Ablity to select and annotate the top N snps
  5. Automatic update of annotation on Ensembl and HapMap data
  6. Calculate LD linkage for a particular region etc
  7. Take publication quality snapshot pictures

One of the hassles I found was formatting the data for input. The documentations suggest several ways of making the data input using MAP files etc in the manual. However, the easiest way I found was to simply create a space-separated ASCII file (using R or even Excel) with the following columns: rsid, chromosome (1-22, X, Y, XY, M), Map (coordinate on the chromosome) and -logP (log base 10 of p-values).

SNP chromosome Map -logP
MitoA10045G M 10045 2.04858284222835
MitoT9900C M 9900 0.233064674990652
MitoT9951C M 9951 0.0641728753170715
rs1000000 12 125456933 1.16139248878691
rs10000010 4 21227772 0.149317624784192
rs10000023 4 95952929 1.15832462919552
rs10000030 4 103593179 0.106028436059944
rs10000041 4 165841405 0.221366644208304
rs1000007 2 237416793 0.213983677592946

You will need to create and load one file per analysis which is bit annoying if you have many analyses to visualize. I hope they add new features to visualize and (even better) compare different results in the near future. Imagine being able to superimpose manhattan plots from two different studies or techniques together!


Minimum salary required to maintain a family in Malaysia

What is the minimum monthly salary a Malaysian person needs to earn to sustain a young family? Here are my estimates assuming the young family consists of a working father, a housewife and two dependent children.

  • RM1200 – Meals (30 days x 4 people x RM10 per person per day)
  • RM750 – Monthly house repayment (low cost housing for 30 years)
  • RM500 – Monthly car repayment (proton sage for 7 years)
  • RM300 – Petrol and road tax
  • RM300 – Children’s education fees
  • RM200 – Monthly insurance and medical bills
  • RM100 – Electricity and water bill
  • RM100 – Phone bill
  • RM150 – Miscellaneous

That totals to RM3600 per month. Assuming the EPF rate of 11% and approximately around 9% income tax, the working father has to earn a gross salary of RM4500 to have a take home salary of RM3600.

When trying to figure out the income distribution, I came across this article from Aliran published in 2004 that tries to address the same question I posed here. While I believe their estimates are more conservative and simplistic, the conclusion is similar.

Using their household income pyramid with my estimates shows that only slightly over 10% of the population earns RM4500 and above.

Visualizing missing values

Much of statistic analysis of data containing missing values requires the assumption that missing values are missing at random. Yet, it is common for many people to simply assume this without any check. It could even be difficult with high dimensional data where eyeballing a small section at a time could miss out consistent pattern.

Here I will show an easy way to quickly visualize the missing values for a matrix or dataframe in R. It also helps to identify any alignment issue in the data.

plot.missing <- function(mat, sort=FALSE, main=”Location of missing values”, …){

image2 <- function(m, …) image( t(m)[ , nrow(m):1 ], … )

mat <- 1*
if(sort) mat <- mat[ order(rowSums(mat)), order(colSums(mat)) ]

image2( mat, col=c(0,1), xaxt=”n”, yaxt=”n”, main=main, … )
box(); grid(col=4)
ticks <- c(0,0.2,0.4,0.6,0.8,1.0)
axis( side=1, at=ticks, labels=round(quantile(1:ncol(mat), ticks)) )
axis( side=2, at=rev(ticks), labels=round(quantile(1:nrow(mat), ticks)) )

Next, we create some artifical data to run the codes through it.

m <- matrix( rnorm(500000, mean=1.5), nc=50 ) # 10,000 individuals and 50 questions
m[ m < 0 ] <- NA
m[ sample(1:nrow(m), 7000), 11 ] <- NA
m[ sample(1:nrow(m), 8000), 35 ] <- NA
for(i in sample(1:nrow(m), 500)) m[ i , sample(1:ncol(m), 40) ] <- NA

and execute the codes:

plot.missing(m, sub=”Original positions”,ylab=”Individuals”, xlab=”Questionaire number”)
plot.missing(m, sort=T, sub=”A sorted view”, ylab=”Individuals”, xlab=”Questions asked”)

The picture on the left shows the location of the missing values (denoted as black ticks) with respect to the matrix that was used to generate it. So here you can clearly see the question number 11 and 35 have large amounts of missingness, perhaps these are of a sensitive nature. The horizontal strips show that some individuals were less responsive than others. The causes of missingness should be investigated. whatever the reasons, the poor responses will interfere with downstream analysis and some may want to remove them from further analysis.

Location of missingness

After identifying which variables and samples are least responsive, the next question is what is the information or missingness pattern after removing these. And that question is answered by the picture on the left. Note that the axis numbers have been removed as they are meaningless here.

Using hexbin to better visualize a dense two-dimensional plot

Anyone working with high dimensional data would have tried to plot two variables at some point. The problem is that if even a small proportion of the data is noisy, this translated to a large number of points which can obscure good visualization. Here is an example:

x <- runif(100000)
y <- -10 + 20*x + rnorm(100000, sd=2)
y[1:20000] <- rnorm(20000, sd=10) # 20% noise
plot(x, y, pch=”O”)

which produces the following


which I don’t think is very informative as the very strong linear trend is now obscured by 20% of the noisy data. An alternative at better visualizing such plot will be to use the hexbin package.

library( hexbin)
plot( hexbin(x, y, xbins=50) )


The hexbin package estimates the density (number of points in) the neighbourhood of predefined grid centres and uses varying shades of grey to represent the density. You can install the hexbin package in R using the following commands:


Generating an ordered run of binary events in R

I needed to generate a sequence of possible binary events in an ordered fashion. For example, I needed to generate the following sequence for three binary variables: 000, 100, 010, 001, 110, 101, 011, 111

I tried to accomplish this in R using the following command:

> expand.grid( rep( list( 0:1 ), 3 ) )
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1

but notice that the results is not ordered for my need. So here is my little function to generate what I want to do:

expand.grid.binary <- function(N){
## generates an ORDERED sequence of N-binary events
## with all the possible combinations
## expand.grid( rep( list( 0:1 ), N ) ) does not order the results desirably

f <- function(n, r){

v <- t( combn( n, r) )
ind <- cbind( row=rep( 1:nrow(v), ncol(v) ), column=as.vector(v) ) out <- matrix( 0, nrow=nrow(v), ncol=n )
out[ind] <- 1


out <- “rbind”, sapply( 0:N, f, n=N ) )

> expand.grid.binary(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 0 1
[5,] 1 1 0
[6,] 1 0 1
[7,] 0 1 1
[8,] 1 1 1

The next step is how to generalize this to situations when the variables are non-binary. Any ideas?

Sorting Excel data by by anniversary (i.e. date and month ignoring year)

OK, so you have an Excel file with the date of births (DOB) or any other kind of dates with day, month and year combination for many people and you want to sort them to find out the order of the anniversaries or birthdays. Sounds easy enough. Except when try to sort, it sort by DOB column and it sorts it chronologically – by year, then month and date (e.g. 30th July 1950 comes before 25th July 1990).

One way around this problem is to “standardize” the year, so that Excel effectively ignores the year. Here are more details.

  1. Make sure your dates are stored as Date type not text etc in Excel. (highlight the column and check format cells).
  2. Find out the birthdays (i.e. the year is now irrelevant) for all cases. You can do this in many ways. For example, by creating an adjacent column with the formula =DATE(2400,MONTH(A1),DAY(A1)) assuming that you stored your date of births in column A1. It does not matter if this column is formatted as either general, number or date.
  3. Sort the entire sheet using the birthdays column.

Notice that I have standardized all the years to year 2400 AD. Why? Because 2400 is a leap year and thus allows for 29th February. And why did I choose 2400 and not 2000 or any other leap year? Well, it is a simple precaution in case someone accidentally uses this as the true date of births!

Reasons to use R

I use a statistical software called R ( I would highly recommend anyone who needs to manipulate, analyse and visualise data to check the software out.

Here are some reasons to use R over other statistical softwares:

  1. It is completely free to download and use without restrictions.
  2. It is also open source which means you can read the source code of all functions and modify it yourself.
  3. It has good documentation for most functions. There are a number of significantly well written introductory R books, tutorials, reference cards, reference manuals, vignettes and newsletters.
  4. A sensible default values and error messages is available for most functions. This is because majority of the functions and packages are written by experts in the field who are aware of the common and frequent pitfalls that a naive user might fall into.
  5. Large number of packages dealing with many areas of applications. A lot of statistical procedures are already available, so you do not need to waste time re-inventing the wheel everytime. Plus the many different ways to store, manipulate, analyse and visualise your data, though a bit overwhelming to new users at first, is intended to make you think about your data.
  6. Active, responsive and vibrant R community as seen in the helpful R mailing lists. But please do check the helpful posting guide before posting in order to ellicit the informative responses.
  7. Ability to work in an interactive and Integrated Development Environment which means you are able to debug errors and visualise results on the fly without having to first compile and then execute it.
  8. Cross platform compatability means that you are able to develop it on one environment and then implement it in many different platforms with nominal changes (if any). I usually develop and test the codes on either Windows of Mac with small datasets before running the codes on bigger datasets or bigger simulations on large UNIX or on Linux clusters.
  9. Various benchmark shows that R is just as fast or faster than many statistical softwares for various tasks. Here is an example of such a benchmark test which is outdated but still indicative. Many personal experience of people who have used multiple statistical softwares suggest that R codes are much easier to understand (but this depends on how one writes it).
  10. One can use Sweaveto automate reports that need to be compiled on regular or frequent basis.

BioConductor contains a collection of many R packages that specifically deal with analysis of genomic data (e.g. SAGE, SNP, sequence, microarrays, array CGH, proteomics, biological annotations and ontologies). This is a the fist choice of tools for many statisticians and leading experts in the field. Thus this is where the software for many proposed new methods becomes first available.