metAnnotate

By DoxeyLab. Search, classify, and compare metagenomes.


MetAnnotate output files can be further analyzed offline using some of the following R scripts.

Loading and processing output files


Set your working directory. For example, this may be the directory containing the .tsv MetAnnotate results file. E.g.,
setwd("~/Desktop/MetAnnotateResults/")
Load your dataset (here, it is a tab-separated file called data.tsv):
tb <- read.delim("data.tsv",sep='\t',header=T)
You may now want to look at the counts (# hits for all HMMs) for each dataset. To normalize these counts, divide them by the total number of reads (DNA) in each dataset.
sort(table(tb[,1]))
If there are datasets with very few (e.g., 50 or less) counts, you may want to remove them:
print("Removing...")
which(table(tb[,1]) < 50)

for (d in names(which(table(tb[,1]) < 50)))
{
	tb <- tb[-which(tb[,1] == d),]
}

tb[,1] <- as.character(tb[,1])

Plotting

Barplots

The first thing you need to decide on is the data column you want to analyze. For order level taxon assignments predicted using the tree-based method ...
k <- "Closest.Homolog.Order" # can be Species, Genus, Family, Order, Class...
Now run the following code to make your barplot

# a color scheme
pal12 = c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", 
"#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", 
"#FFFF99", "#B15928") 
cols <- colorRampPalette(pal12)(length(levels(as.factor(tb[,k]))))

par(mfrow=c(1,2))

tb.2 <- t(table(tb[,c(1,which(colnames(tb) == k))]))
for (i in 1:ncol(tb.2))
{
	tb.2[,i] <- tb.2[,i] / sum(tb.2[,i])
}

tb.2 <- t(na.omit(t(tb.2)))


## suppose you want to reorder the barplot and show only a subset of datasets in a specific order. Uncomment the following two lines
# datasetOrder <- c("4446153","4477804","4537195","SRR908273a")  # a hypothetical list of datasets
# tb.2 <- tb.2[,match(datasetOrder,colnames(tb.2))]

barplot(tb.2,col=cols,las=3, cex.names=0.5)
plot.new()

legend("left", inset=.05, title="Class",legend =
  	c(levels(as.factor(tb[,k]))), fill = cols,cex=0.2)

Clustered Heatmap

You will need to have the pheatmap package installed.
library(pheatmap)
As before, you need to pick a taxonomic level. This time it is called sColumn
k <- "Closest.Homolog.Genus" #species column for which to do the species breakdown/analysis
Since showing all taxa may be overwhelming, we can choose to show only the most frequent species -- those present greater than some threshold in at least one dataset. This is done with the mpt parameter
mpt <- 0.02 #max fraction of a taxa in any dataset required for inclusion in heat map
The following parameter can also be set to less to reduce the size of labels in the heatmap
cexVal <- 1
Now run...

matr <- table(tb[,k],tb[,1])
matr <- sweep(matr, 2, colSums(matr), "/")
maxPerTaxa <- apply(matr,1,max)
cexVal <- 1 #sets size of labels
ll <- pheatmap((matr[which(maxPerTaxa > mpt),]),cex=cexVal)

Principal coordinates analysis (PCOA)

PCOA is really easy in R. First load the vegan and ape packages
library(vegan)
library(ape)
Just run:
D <- vegdist(t(matr[which(maxPerTaxa > mpt),]))   ## works better when rare species are removed
biplot(pcoa(D))

Generating an OTU table

otu.tb1 <- table(tb[,k],tb[,2])
otu.tb2 <- data.frame(rownames(otu.tb1),as.data.frame.matrix(otu.tb1))
colnames(otu.tb2)[1] <- "OTU"
write.table(otu.tb2,file="otutable.tsv",sep="\t",row.names=F)