The ToolsForCoDa package was originally created in order to provide functions for a canonical correlation analysis with compositional data (Graffelman et al. (2018)), based on the centred logratio (clr) transformation of the compositions. Posteriorly, it has been extended with additional tools for the multivariate analysis of compositional data in the R environment. Currently, this package (version 1.1.0) provides functionality for
Both CCO and LDA rely on the inversion of a covariance matrix. The
covariance matrix of the clr transformed compostions is structurally
singular. The programs lrcco
and lrlda
resolve
this with the use of a generalized inverse. Functionality for the
analysis of compositional data in the R environment can be found in the
packages compositions (Boogaart,
Tolosana-Delgado, and Bren (2024)),
robCompositions (Templ, Hron,
and Filzmoser (2024)), easyCODA (Greenacre (2024)) and
zCompositions (Palarea-Albaladejo and Martin-Fernandez (2024)).
For further reading on compositional data, see the seminal text on
compositional data by Aitchison (Aitchison
(1986)) and several recent statistical textbooks (Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado
(2015), Filzmoser, Hron, and Templ
(2018), Greenacre (2018), Boogaart and Tolosana-Delgado (2013)).
The remainder of this vignette shows an example session showing how to perform the three aforementioned types of analysis.
We consider the composition of 37 Pinot Noir samples, consisting of the concentrations of Cd, Mo, Mn, Ni, Cu, Al, Ba, Cr, Sr, Pb, B, Mg, Si, Na, Ca, P, K and an evaluation of the wine’s aroma. (Frank and Kowalski (1984)).
data(PinotNoir)
head(PinotNoir)
#> Cd Mo Mn Ni Cu Al Ba Cr Sr Pb B Mg Si
#> 1 0.005 0.044 1.510 0.122 0.830 0.982 0.387 0.029 1.230 0.561 2.63 128 17.3
#> 2 0.055 0.160 1.160 0.149 0.066 1.020 0.312 0.038 0.975 0.697 6.21 193 19.7
#> 3 0.056 0.146 1.100 0.088 0.643 1.290 0.308 0.035 1.140 0.730 3.05 127 15.8
#> 4 0.063 0.191 0.959 0.380 0.133 1.050 0.165 0.036 0.927 0.796 2.57 112 13.4
#> 5 0.011 0.363 1.380 0.160 0.051 1.320 0.380 0.059 1.130 1.730 3.07 138 16.7
#> 6 0.050 0.106 1.250 0.114 0.055 1.270 0.275 0.019 1.050 0.491 6.56 172 18.7
#> Na Ca P K Aroma
#> 1 66.8 80.5 150 1130 3.3
#> 2 53.3 75.0 118 1010 4.4
#> 3 35.4 91.0 161 1160 3.9
#> 4 27.5 93.6 120 924 3.9
#> 5 76.6 84.6 164 1090 5.6
#> 6 15.7 112.0 137 1290 4.6
Aroma <- PinotNoir[,c("Aroma")]
We apply closure to the chemical concentrations by division by their
total, and use lrpca
to do perform LR-PCA.
We study the decomposition of compositional variance, and the decay of the LR-PCA eigenvalues by means of a screeplot
round(out.lrpca$decom,2)
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
#> la 1.44 1.00 0.59 0.47 0.25 0.25 0.18 0.14 0.12 0.07 0.04 0.04 0.02 0.01 0.01
#> laf 0.31 0.22 0.13 0.10 0.05 0.05 0.04 0.03 0.03 0.01 0.01 0.01 0.00 0.00 0.00
#> lac 0.31 0.53 0.66 0.76 0.81 0.87 0.90 0.93 0.96 0.97 0.98 0.99 1.00 1.00 1.00
#> PC16 PC17
#> la 0 0
#> laf 0 0
#> lac 1 1
plot(1:ncol(out.lrpca$decom),
out.lrpca$decom[1,],type="b",main="Scree-plot",
xlab="PC",ylab="Eigenvalue")
We construct a covariance biplot, using jointlim
to
establish sensible limits for the x and y axes. Column markers for the
clr transformed variables are multiplied by a constant (2.5) for a
better visualization, and the amount of explained variance is indicated
on the coordinate axes.
lims <- jointlim(out.lrpca$Fs,2.5*out.lrpca$Gp)
bplot(out.lrpca$Fs,2.5*out.lrpca$Gp,rowlab="",collab=colnames(Xc),rowch=1,colch=NA,
xl=lims$xlim,yl=lims$ylim,main="Covariance biplot")
pc1lab <- paste("PC1 (",toString(round(100*out.lrpca$decom[2,1],1)),"%)",sep="")
pc2lab <- paste("PC2 (",toString(round(100*out.lrpca$decom[2,2],1)),"%)",sep="")
text(1,-0.1,pc1lab,cex=0.75)
text(0.1,1.5,pc2lab,cex=0.75,srt=90)
This biplot reveals that the logratio \(\ln{(Na/Pb)}\) has a large variance and is tightly correlated to the first principal component.
The variable Aroma
correlates with the first principal
components
and as the biplot suggests, Aroma
correlates positively
with the logratio \(\ln{(Cr/Sr)}\)
We note function lrpca
also calculates condition
indices, which may prove useful for detecting proportionality or
one-dimensional relationships (Graffelman
(2021)).
Two examples of LR-CCO are given below. The first example concerns a small artificial data set, where both the X and Y set are compositional, and is described in Section 3.1 of Graffelman et al. (2018). The second example concerns major oxides compositions of bentonites, where the X set is compositional and Y set is not.
We first load two artificial 3-part compositions.
data("Artificial")
Xsim.com <- Artificial$Xsim.com
Ysim.com <- Artificial$Ysim.com
colnames(Xsim.com) <- paste("X",1:3,sep="")
colnames(Ysim.com) <- paste("Y",1:3,sep="")
We make the ternary diagrams of the two sets of compositions
opar <- par(mfrow=c(1,2),mar=c(2,2,1,0)+0.5,pty="s")
par(mfg=c(1,1))
ternaryplot(Xsim.com,pch=1)
par(mfg=c(1,2))
ternaryplot(Ysim.com,pch=1)
We perform the compositional canonical analysis:
And we reproduce the results in Table 1 of Graffelman et al. (2018). The canonical correlations are obtained as
The canonical weights of the X set and the Y set are obtained by:
out.lrcco$A
#> [,1] [,2] [,3]
#> [1,] 0.0008130933 3.847198 -1.110223e-15
#> [2,] -0.7985815849 -3.446655 6.522560e-16
#> [3,] 0.7977684917 -0.400543 1.665335e-16
out.lrcco$B
#> [,1] [,2] [,3]
#> [1,] 0.7624647 -0.05038131 -9.714451e-17
#> [2,] -0.7165761 -0.52116661 3.608225e-16
#> [3,] -0.0458886 0.57154792 -2.775558e-16
The canonical loadings of the X set and the Y set are obtained by
out.lrcco$Rxu
#> [,1] [,2] [,3]
#> X1 -0.8857398 0.4641822 -0.9035794
#> X2 -0.9828511 -0.1844012 -0.4438392
#> X3 0.9940477 -0.1089461 0.6849272
out.lrcco$Ryv
#> [,1] [,2] [,3]
#> Y1 0.8522677 -0.5231058 0.2439545
#> Y2 -0.6097840 -0.7925676 0.9387752
#> Y3 -0.3033098 0.9528920 -0.8183778
The adequacy coefficients of the X set and the Y set:
out.lrcco$fitXs
#> [,1] [,2] [,3]
#> AdeX 0.9128873 0.08711271 0.4941914
#> cAdeX 0.9128873 1.00000000 1.4941914
out.lrcco$fitYs
#> [,1] [,2] [,3]
#> AdeY 0.3967312 0.6032688 0.5368516
#> cAdeY 0.3967312 1.0000000 1.5368516
The redundancy coefficients of the X set and the Y set
out.lrcco$fitXp
#> [,1] [,2] [,3]
#> RedX 0.8132984 0.001442577 0.06638809
#> cRedX 0.8132984 0.814740980 0.88112907
out.lrcco$fitYp
#> [,1] [,2] [,3]
#> RedY 0.3534509 0.009990066 0.1440308
#> cRedY 0.3534509 0.363441013 0.5074718
Finally, we make the full set of biplots for LR-CCO given in Figure 2 (Graffelman et al. (2018)). In each biplot, the canonical variates are multiplied by a convenient scalar to facilitate the visualization.
opar <- par(mfrow=c(2,2),mar=c(2,2,1,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
#
# Figure A
#
Z <- rbind(out.lrcco$Fs,out.lrcco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="A")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fs[,1],out.lrcco$Fs[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gp[,1],out.lrcco$Gp[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.15
points(fa*out.lrcco$U[,1],fa*out.lrcco$U[,2])
par(mfg=c(1,2))
#
# Figure B
#
Z <- rbind(out.lrcco$Fp,out.lrcco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="B")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fp[,1],out.lrcco$Fp[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gs[,1],out.lrcco$Gs[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*out.lrcco$V[,1],fa*out.lrcco$V[,2])
#
# Standardizing the clr transformed data
#
Xstan.clr <- scale(clrmat(Xsim.com))
Ystan.clr <- scale(clrmat(Ysim.com))
res.stan.cco <- canocov(Xstan.clr,Ystan.clr)
par(mfg=c(2,1))
#
# Figure C
#
Z <- rbind(res.stan.cco$Fs,res.stan.cco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="C")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fs[,1],res.stan.cco$Fs[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gp[,1],res.stan.cco$Gp[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.2
points(fa*res.stan.cco$U[,1],fa*res.stan.cco$U[,2])
circle()
par(mfg=c(2,2))
#
# Figure D
#
Z <- rbind(res.stan.cco$Fp,res.stan.cco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="D")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fp[,1],res.stan.cco$Fp[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gs[,1],res.stan.cco$Gs[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*res.stan.cco$V[,1],fa*res.stan.cco$V[,2])
circle()
Panel A shows the logratios \(\log{(x_2/x_3)}\) and \(\log{(y_1/y_2)}\) to have long links that run parallel to the first canonical variate with the largest canonical correlation; these logratios are highly correlated. The canonical biplot shows the association between the two sets of compositions, which is not visible in the ternary diagrams above.
In this subsection we treat the canonical analysis of bentonites. The X set concerns the concentrations of 9 major oxides, measured in 14 samples in the US (Cadrin et al. (1996)). A canonical analysis of this data set has been previously described (Reyment and Savazzi (1999)), and is extended here with biplots. The Y set concerns two isotopes, \(\delta D\) and \(\delta 18O\).
data("bentonites")
head(bentonites)
#> Si Al Fe Mn Mg Ca K Na H20 dD d18O
#> 1 51.17 19.18 2.09 0.001 4.54 1.30 2.30 0.93 9.88 93 13.5
#> 2 50.66 19.01 1.67 0.001 2.70 1.70 0.39 0.67 9.99 92 21.9
#> 3 54.38 20.03 2.04 0.001 3.54 2.04 0.10 0.20 10.26 93 21.9
#> 4 55.58 18.76 0.56 0.001 4.51 2.00 0.31 0.90 9.54 100 24.6
#> 5 54.43 22.58 0.69 0.001 3.75 1.47 0.75 0.15 9.14 111 21.7
#> 6 62.79 20.75 1.14 0.060 4.26 0.14 0.42 0.40 8.28 114 24.1
We clr-transform and column-center the major oxides, after deletion of MnO which is outlying and had many zeros, which were replaced with 0.001. We standardize the isotopes.
X <- bentonites[,1:9]
X <- X[,-4]
X <- X/rowSums(X)
Y <- scale(bentonites[,10:11])
Xclr <- clrmat(X)
cco <- canocov(Xclr,Y)
The two canonical correlations are large:
We construct a biplot of the data:
plot(cco$Fs[,1],cco$Fs[,2],col="red",pch=NA,xlab="First principal axis",
ylab="Second principal axis",xlim=c(-1,1),ylim=c(-1,1),asp=1)
textxy(cco$Fs[,1],cco$Fs[,2],colnames(X),cex=0.75)
arrows(0,0,cco$Gp[,1],cco$Gp[,2],angle=10,col="blue",length=0.1)
arrows(0,0,cco$Fs[,1],cco$Fs[,2],angle=10,col="red",length=0.1)
text(cco$Gp[,1],cco$Gp[,2],colnames(Y),pos=c(3,1))
fa <- 0.45
points(fa*cco$U[,1],fa*cco$U[,2])
textxy(fa*cco$U[,1],fa*cco$U[,2],1:14)
We overplot the biplot with the canonical X-variates, which allows one to inspect the original samples (Graffelman (2005)). For plotting, the canonical variate is scaled with a convenient scaling factor (here 0.45). This factor does not affect the interpretation of the biplot, but gives the samples a convenient spread.
The logratio \(\log{(Na/Mg)}\) (among others) almost coincides with the first canonical variate, which correlates with \(\delta 18O\). However, interpretation should proceed with care because of the small sample size.
We use archeological data from the UK (Tubb, Parker, and Nickless (1980)) to illustrate LR-LDA. This dataset consists of measurements of nine oxides on 48 archeological samples from three regions in the UK. We first prepare the data:
data(Tubb)
head(Tubb)
#> Sample Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO site
#> 1 GA1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015 G
#> 2 GA2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018 G
#> 3 GA3 18.2 7.64 1.82 0.77 0.40 3.07 0.98 0.087 0.014 G
#> 4 GA4 17.4 7.48 1.71 1.01 0.40 3.16 0.03 0.084 0.017 G
#> 5 GA5 16.9 7.29 1.56 0.76 0.40 3.05 1.00 0.063 0.019 G
#> 6 GB1 17.8 7.24 1.83 0.92 0.43 3.12 0.93 0.061 0.019 G
site <- factor(Tubb$site)
Oxides <- as.matrix(Tubb[,2:10])
rownames(Oxides) <- Tubb$Sample
Oxides <- Oxides/rowSums(Oxides)
Next, we carry out LR-LDA by passing the compositions in
Oxides
to the function lrlda
. Internally,
lrlda
applies the clr transformation of the data.
The group sizes are obtained with:
The group mean vectors of the clr transformed compositions are given by:
out.lrlda$Mclr
#> Group.1 Al2O3 Fe2O3 MgO CaO Na2O K2O
#> 1 G 3.011200 2.187485 0.7873925 0.09020565 -0.9765462 1.315820
#> 2 NF 4.348294 1.899077 1.0256504 -2.12623244 -1.5607332 2.175829
#> 3 W 2.807714 2.115195 1.8167997 -1.29407136 -1.3695227 1.622022
#> TiO2 MnO BaO
#> 1 -0.03735507 -2.485644 -3.892557
#> 2 1.47214116 -4.560993 -2.673032
#> 3 -0.08569854 -1.777495 -3.834943
The scores of the linear discriminant function are obtained by:
head(out.lrlda$LD)
#> LD1 LD2
#> GA1 1.09082644 4.920326
#> GA2 0.05755742 4.460343
#> GA3 0.67388420 4.715621
#> GA4 1.55665675 4.892781
#> GA5 -0.38244898 4.436551
#> GB1 0.03938665 4.029670
The confusion matrix for the training observations is:
Posterior probabilities for the classifications are obtained by
head(round(out.lrlda$prob.posterior,4))
#> G NF W
#> GA1 1 0 0
#> GA2 1 0 0
#> GA3 1 0 0
#> GA4 1 0 0
#> GA5 1 0 0
#> GB1 1 0 0
We extract biplot coordinates for group centers, individual observations and variables, and construct the LDA biplot.
Fp <- out.lrlda$Fp
Gs <- out.lrlda$Gs
LD <- out.lrlda$LD
colvec <- rep(NA,nrow(Oxides))
colvec[site=="G"] <- "red"
colvec[site=="NF"] <- "green"
colvec[site=="W"] <- "blue"
lims <- jointlim(LD,Fp)
opar <- par(bty="n",xaxt="n",yaxt="n")
plot(Fp[,1],Fp[,2],asp=1,pch=17,xlab="",ylab="",col=c("red","green","blue"),
xlim=lims$xlim,ylim=lims$ylim,cex=1.25)
points(LD[,1],LD[,2],col=colvec)
origin()
arrows(0,0,10*Gs[,1],10*Gs[,2],angle = 10, length = 0.1)
textxy(10*Gs[,1],10*Gs[,2],colnames(Oxides))
ld1lab <- paste("LD1 (",toString(round(100*out.lrlda$decom[2,1],1)),"%)",sep="")
ld2lab <- paste("LD2 (",toString(round(100*out.lrlda$decom[2,2],1)),"%)",sep="")
text(7,-0.25,ld1lab,cex=0.75)
text(0.25,7,ld2lab,cex=0.75,srt=90)
par(opar)
legend("topleft",c("G","NF","W"),col=c("red","green","blue"),pch=1,cex=0.5)
The LR-LDA biplot shows perfect separation of the three UK regions and suggests that a single logratio like \(\log{(MgO/Al2O3)}\) (among other possibilities) is capable of discriminating the three regions. A boxplot of this logratio confirms this.