By DoxeyLab. Search, classify, and compare metagenomes.

MetAnnotate output files can be further analyzed offline using some of the following R scripts.

Loading and processing output files

Set your working directory. For example, this may be the directory containing the .tsv MetAnnotate results file. E.g.,
Load your dataset (here, it is a tab-separated file called data.tsv):
tb <- read.delim("data.tsv",sep='\t',header=T)
You may now want to look at the counts (# hits for all HMMs) for each dataset. To normalize these counts, divide them by the total number of reads (DNA) in each dataset.
If there are datasets with very few (e.g., 50 or less) counts, you may want to remove them:
which(table(tb[,1]) < 50)

for (d in names(which(table(tb[,1]) < 50)))
	tb <- tb[-which(tb[,1] == d),]

tb[,1] <- as.character(tb[,1])



The first thing you need to decide on is the data column you want to analyze. For order level taxon assignments predicted using the tree-based method ...
k <- "Closest.Homolog.Order" # can be Species, Genus, Family, Order, Class...
Now run the following code to make your barplot

# a color scheme
pal12 = c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", 
"#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", 
"#FFFF99", "#B15928") 
cols <- colorRampPalette(pal12)(length(levels(as.factor(tb[,k]))))


tb.2 <- t(table(tb[,c(1,which(colnames(tb) == k))]))
for (i in 1:ncol(tb.2))
	tb.2[,i] <- tb.2[,i] / sum(tb.2[,i])

tb.2 <- t(na.omit(t(tb.2)))

## suppose you want to reorder the barplot and show only a subset of datasets in a specific order. Uncomment the following two lines
# datasetOrder <- c("4446153","4477804","4537195","SRR908273a")  # a hypothetical list of datasets
# tb.2 <- tb.2[,match(datasetOrder,colnames(tb.2))]

barplot(tb.2,col=cols,las=3, cex.names=0.5)

legend("left", inset=.05, title="Class",legend =
  	c(levels(as.factor(tb[,k]))), fill = cols,cex=0.2)

Clustered Heatmap

You will need to have the pheatmap package installed.
As before, you need to pick a taxonomic level. This time it is called sColumn
k <- "Closest.Homolog.Genus" #species column for which to do the species breakdown/analysis
Since showing all taxa may be overwhelming, we can choose to show only the most frequent species -- those present greater than some threshold in at least one dataset. This is done with the mpt parameter
mpt <- 0.02 #max fraction of a taxa in any dataset required for inclusion in heat map
The following parameter can also be set to less to reduce the size of labels in the heatmap
cexVal <- 1
Now run...

matr <- table(tb[,k],tb[,1])
matr <- sweep(matr, 2, colSums(matr), "/")
maxPerTaxa <- apply(matr,1,max)
cexVal <- 1 #sets size of labels
ll <- pheatmap((matr[which(maxPerTaxa > mpt),]),cex=cexVal)

Principal coordinates analysis (PCOA)

PCOA is really easy in R. First load the vegan and ape packages
Just run:
D <- vegdist(t(matr[which(maxPerTaxa > mpt),]))   ## works better when rare species are removed

Generating an OTU table

otu.tb1 <- table(tb[,k],tb[,2])
otu.tb2 <- data.frame(rownames(otu.tb1),
colnames(otu.tb2)[1] <- "OTU"