# Visualizing missing values

Much of statistic analysis of data containing missing values requires the assumption that missing values are missing at random. Yet, it is common for many people to simply assume this without any check. It could even be difficult with high dimensional data where eyeballing a small section at a time could miss out consistent pattern.

Here I will show an easy way to quickly visualize the missing values for a matrix or dataframe in R. It also helps to identify any alignment issue in the data.

plot.missing <- function(mat, sort=FALSE, main=”Location of missing values”, …){

image2 <- function(m, …) image( t(m)[ , nrow(m):1 ], … )

mat <- 1*is.na(mat)
if(sort) mat <- mat[ order(rowSums(mat)), order(colSums(mat)) ]

image2( mat, col=c(0,1), xaxt=”n”, yaxt=”n”, main=main, … )
box(); grid(col=4)
ticks <- c(0,0.2,0.4,0.6,0.8,1.0)
axis( side=1, at=ticks, labels=round(quantile(1:ncol(mat), ticks)) )
axis( side=2, at=rev(ticks), labels=round(quantile(1:nrow(mat), ticks)) )
}

Next, we create some artifical data to run the codes through it.

m <- matrix( rnorm(500000, mean=1.5), nc=50 ) # 10,000 individuals and 50 questions
m[ m < 0 ] <- NA
m[ sample(1:nrow(m), 7000), 11 ] <- NA
m[ sample(1:nrow(m), 8000), 35 ] <- NA
for(i in sample(1:nrow(m), 500)) m[ i , sample(1:ncol(m), 40) ] <- NA
mean(is.na(m))

and execute the codes:

par(mfrow=c(1,2))
plot.missing(m, sub=”Original positions”,ylab=”Individuals”, xlab=”Questionaire number”)
plot.missing(m, sort=T, sub=”A sorted view”, ylab=”Individuals”, xlab=”Questions asked”)

The picture on the left shows the location of the missing values (denoted as black ticks) with respect to the matrix that was used to generate it. So here you can clearly see the question number 11 and 35 have large amounts of missingness, perhaps these are of a sensitive nature. The horizontal strips show that some individuals were less responsive than others. The causes of missingness should be investigated. whatever the reasons, the poor responses will interfere with downstream analysis and some may want to remove them from further analysis.

After identifying which variables and samples are least responsive, the next question is what is the information or missingness pattern after removing these. And that question is answered by the picture on the left. Note that the axis numbers have been removed as they are meaningless here.