Multivariate Analysis

Author

ANSALDI Lucile, BARREAU Julie, BONGIBAULT Baptiste, GUINEBRETIERE Aldric, MOREAU Cyrielle

Introduction to Multivariate Analysis

Biologists’ data, whether on individuals, populations or communities, are usually presented in the form of rectangular tables, with observations (n) in the rows and variables (p) in the columns.

The graphical representation of these n observations and p variables is easily achieved when there are only 2 or 3 variables (dimension). However, when the number of variables increases, the graphical representation becomes complicated, and Multivariate Analyses come into their own!

The aim of these analyses is to reduce the number of dimensions and examine the structure of the data by answering the following questions:

  • Which observations are similar?

  • Are there observations that stand out? Subgroups?

  • Which variables are correlated?

  • Are there particular links between certain observations and/or variables?

Multivariate analysis methods are therefore used to describe the data and generate hypotheses that can then be tested.

PART 1: Principal Component Analysis (PCA)

Introduction

PCA can be used to process a measurement table with :

  • In row: the n observations.

  • In column: the p quantitative variables.

The 2 sets (observations and variables) are totally distinct and non-interchangeable. In other words, if you interchange the rows and columns (with the variables in the rows and the observations in the columns), the table no longer has the same meaning.

In PCA tables, the mean of a column has a meaning, while the mean of a row does not.

The aim of PCA is to achieve the best geometric representation of individuals and variables. To achieve this, we seek to reduce the dimension by finding the best projection plane (subspace) for “best” visualization of the point cloud in reduced space.

However, this reduction must :

  • Two individuals who resemble each other must be close in the representation space.

  • Preserve correlations between variables:

    • Two variables that are correlated must be represented by vectors forming an acute angle.

    • Two independent variables are represented by orthogonal vectors.

Mathematics

Principal Component Analysis (PCA) is a powerful method for simplifying the complexity of highly correlated data. This approach aims to reduce dimensionality in a linear manner, thereby providing a clearer perspective on the underlying structure of the data.

In essence, PCA performs a linear transformation of the original variables to new variables called principal components. These components are carefully selected to capture the majority of the variance present in the dataset, thereby maximizing the separation between different observations.

The space of principal components can be expressed as :

\[ Z_p=\Sigma^p_{j=1}ø_j\times X_j \]Where :

  • \(Z_p\) is the principal component \(p\)

  • \(Ø_j\) is the weight vector comprising the \(j\) weights for principal component \(p\), i.e., the coefficients of the linear combination of the original variables from which the principal components are constructed.

  • \(X_j\) is the standardized predictor, meaning it has a mean equal to 0 and a standard deviation of 1.

The process begins with a matrix \(Y\), where each row represents an observation and each column represents a continuous variable.

First, we normalize the observations as follows :

\[ Y_{std} = \frac{y_i-\overline{y}}{\sigma_y} \] ; which is equivalent to centering as in:

\[ y_c= [y_i-\overline{y}] \] ; then scaling as in:

\[ y_s = \frac{y_i}{\sigma_y} \]

We can then calculate the variance-covariance matrix:

\[ R = cov(Y_{std}) \]The decomposition of this matrix provides the matrix \(U\) of eigenvectors, which contains the principal components.

The eigenvectors, defined as the principal components, have associated eigenvalues. The graphical representation of the distances of observations to these eigenvectors illustrates how each principal component captures the variation in the data.

The orthogonal aspect of the principal components becomes evident when observing the second principal component, which, in turn, maximizes the variance in the data. These principal components, each aligned along an orthogonal axis, form a new coordinate system in which the information is more clearly structured.

In addition to representing the variance of the principal components, the eigenvectors also provide a relative understanding of their influence. This influence is calculated by dividing the eigenvalues by the total sum of these values, thus providing context on the contribution of each principal component \(v_k\) to the overall information.

\[ Explained\ variance\ of\ v_k = \frac{\lambda_{v_k}}{\Sigma_{i=1}^p\lambda_v} \]The conclusion of the PCA involves obtaining the coordinate matrix \(F\). This matrix results from the multiplication of matrix \(U\) by the standardized matrix \(Y_{std}\), enabling a rotation of the new data space. This process aligns the data along the principal components, providing a simplified yet informative view of the underlying structure of the initial data.

Interpretation

Here’s an example of how PCA can be applied. dataset presentation package needed

a) Importing the dataset :

# Loading the "iris" dataset available on R :
library(factoextra)
Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data("decathlon2") 
data=decathlon2[decathlon2$Competition=="OlympicG",c(1:10,12)]
summary(data)
     X100m         Long.jump        Shot.put       High.jump    
 Min.   :10.44   Min.   :6.990   Min.   :13.07   Min.   :1.880  
 1st Qu.:10.72   1st Qu.:7.303   1st Qu.:14.52   1st Qu.:1.940  
 Median :10.88   Median :7.475   Median :14.86   Median :2.045  
 Mean   :10.82   Mean   :7.474   Mean   :14.90   Mean   :2.017  
 3rd Qu.:10.94   3rd Qu.:7.688   3rd Qu.:15.29   3rd Qu.:2.112  
 Max.   :11.14   Max.   :7.960   Max.   :16.36   Max.   :2.150  
     X400m        X110m.hurdle       Discus        Pole.vault   
 Min.   :46.81   Min.   :13.97   Min.   :40.11   Min.   :4.400  
 1st Qu.:48.56   1st Qu.:14.07   1st Qu.:43.90   1st Qu.:4.625  
 Median :49.05   Median :14.23   Median :44.73   Median :4.900  
 Mean   :49.01   Mean   :14.29   Mean   :45.43   Mean   :4.843  
 3rd Qu.:49.41   3rd Qu.:14.36   3rd Qu.:47.66   3rd Qu.:5.000  
 Max.   :50.79   Max.   :14.95   Max.   :51.65   Max.   :5.400  
    Javeline         X1500m          Points    
 Min.   :51.53   Min.   :264.4   Min.   :7926  
 1st Qu.:55.43   1st Qu.:270.5   1st Qu.:8088  
 Median :58.11   Median :276.3   Median :8236  
 Mean   :59.58   Mean   :275.1   Mean   :8317  
 3rd Qu.:62.92   3rd Qu.:278.6   3rd Qu.:8396  
 Max.   :70.52   Max.   :287.6   Max.   :8893  

In this data set, we have 14 individuals (athletes) on whom 10 quantitative variables (performances in different sports disciplines) have been recorded:

  • X100m

  • Long.jump

  • Shot.put

  • X400m

  • X110m.hurdle

  • Discus

  • Pole.vault

  • Javelin

  • X1500m

If NAs are present in the data. In order to carry out the PCA without any problems with NAs, they can be omitted from the analysis with the following command:

# data=na.omit(object = data)

We want to find out whether particular links between certain observations and/or variables can be observed in this data set.

b) Study of correlations :

Obtain the correlation matrix between variables using the cor() function:

cor(data)
                   X100m  Long.jump    Shot.put   High.jump       X400m
X100m         1.00000000 -0.8350966 -0.21746312 -0.26333512  0.54802600
Long.jump    -0.83509656  1.0000000  0.31407521  0.24623935 -0.51919384
Shot.put     -0.21746312  0.3140752  1.00000000  0.77045776 -0.18546890
High.jump    -0.26333512  0.2462394  0.77045776  1.00000000 -0.17357206
X400m         0.54802600 -0.5191938 -0.18546890 -0.17357206  1.00000000
X110m.hurdle  0.38366645 -0.4914562  0.03157435  0.01126452  0.35162059
Discus       -0.51341585  0.5136697  0.88239852  0.71362867 -0.36823138
Pole.vault    0.07021566  0.1578300 -0.41715958 -0.65024394  0.19459315
Javeline     -0.06085398  0.2256900  0.53582882  0.20440490  0.04744169
X1500m       -0.47201188  0.5394643  0.10753385 -0.03189887  0.15026787
Points       -0.64440656  0.7809521  0.76009926  0.60096524 -0.52293108
             X110m.hurdle     Discus  Pole.vault    Javeline      X1500m
X100m          0.38366645 -0.5134159  0.07021566 -0.06085398 -0.47201188
Long.jump     -0.49145618  0.5136697  0.15783003  0.22568995  0.53946433
Shot.put       0.03157435  0.8823985 -0.41715958  0.53582882  0.10753385
High.jump      0.01126452  0.7136287 -0.65024394  0.20440490 -0.03189887
X400m          0.35162059 -0.3682314  0.19459315  0.04744169  0.15026787
X110m.hurdle   1.00000000 -0.1755752  0.05716187  0.23674145 -0.41440555
Discus        -0.17557519  1.0000000 -0.46586569  0.47246648  0.12824707
Pole.vault     0.05716187 -0.4658657  1.00000000  0.08916975  0.42631317
Javeline       0.23674145  0.4724665  0.08916975  1.00000000  0.01073007
X1500m        -0.41440555  0.1282471  0.42631317  0.01073007  1.00000000
Points        -0.23729481  0.8596594 -0.13740583  0.60192747  0.24700780
                 Points
X100m        -0.6444066
Long.jump     0.7809521
Shot.put      0.7600993
High.jump     0.6009652
X400m        -0.5229311
X110m.hurdle -0.2372948
Discus        0.8596594
Pole.vault   -0.1374058
Javeline      0.6019275
X1500m        0.2470078
Points        1.0000000

Variables are correlated if their value is greater than 0.9 (as a general rule). Here, we can see that no variables are correlated.

Here’s another way of visualizing correlations between variables: (useful when you have a lot of variables, as here):

abs(cor(data))>0.9
             X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
X100m         TRUE     FALSE    FALSE     FALSE FALSE        FALSE  FALSE
Long.jump    FALSE      TRUE    FALSE     FALSE FALSE        FALSE  FALSE
Shot.put     FALSE     FALSE     TRUE     FALSE FALSE        FALSE  FALSE
High.jump    FALSE     FALSE    FALSE      TRUE FALSE        FALSE  FALSE
X400m        FALSE     FALSE    FALSE     FALSE  TRUE        FALSE  FALSE
X110m.hurdle FALSE     FALSE    FALSE     FALSE FALSE         TRUE  FALSE
Discus       FALSE     FALSE    FALSE     FALSE FALSE        FALSE   TRUE
Pole.vault   FALSE     FALSE    FALSE     FALSE FALSE        FALSE  FALSE
Javeline     FALSE     FALSE    FALSE     FALSE FALSE        FALSE  FALSE
X1500m       FALSE     FALSE    FALSE     FALSE FALSE        FALSE  FALSE
Points       FALSE     FALSE    FALSE     FALSE FALSE        FALSE  FALSE
             Pole.vault Javeline X1500m Points
X100m             FALSE    FALSE  FALSE  FALSE
Long.jump         FALSE    FALSE  FALSE  FALSE
Shot.put          FALSE    FALSE  FALSE  FALSE
High.jump         FALSE    FALSE  FALSE  FALSE
X400m             FALSE    FALSE  FALSE  FALSE
X110m.hurdle      FALSE    FALSE  FALSE  FALSE
Discus            FALSE    FALSE  FALSE  FALSE
Pole.vault         TRUE    FALSE  FALSE  FALSE
Javeline          FALSE     TRUE  FALSE  FALSE
X1500m            FALSE    FALSE   TRUE  FALSE
Points            FALSE    FALSE  FALSE   TRUE

If TRUE (outside the diagonal), then both variables are correlated. If two variables are correlated, one must be removed. The choice of deleting one of the two correct variables is arbitrary and depends on the question being asked.

To remove a variable from the data set, use the following function: data=data[,-(column_of_the_variable_to_remove)}]

c) Performing PCA :

To run the PCA, you need to load the following two packages: factoextra and FactoMineR

Then, to perform the PCA on R, you can use the function : PCA(). Remember to store the result of this PCA in a new variable, so that you can easily retrieve the PCA information you need for subsequent interpretation.

d) Interpretation of outputs :

1) Inertia and choice of axes :

# Recover the eigenvalues of each axis and the inertia carried by the principal components
PCA$eig
          eigenvalue percentage of variance cumulative percentage of variance
comp 1  4.753038e+00           4.320944e+01                          43.20944
comp 2  2.418289e+00           2.198444e+01                          65.19388
comp 3  1.551340e+00           1.410310e+01                          79.29698
comp 4  9.863587e-01           8.966897e+00                          88.26388
comp 5  5.554931e-01           5.049937e+00                          93.31381
comp 6  3.216961e-01           2.924510e+00                          96.23832
comp 7  2.203570e-01           2.003246e+00                          98.24157
comp 8  1.149733e-01           1.045212e+00                          99.28678
comp 9  6.178365e-02           5.616696e-01                          99.84845
comp 10 1.666416e-02           1.514924e-01                          99.99994
comp 11 6.323601e-06           5.748728e-05                         100.00000
# Graphical display of the inertia of each axis : 
fviz_eig(PCA, addlabels = TRUE)

To determine the number of axes used in the PCA analysis, we need to identify the jump in variance explained by the different axes. Here we can see that the first axis represents 43.2% of the variance, the second axis 22%, the third axis 14.1%, …

We can therefore see that the jump in variance explained by the different axes is between the first and second axes. The difference between the variance explained by the axes other than the first is negligible compared to the difference between the first axis and the others.

For the PCA interpretation, we therefore retain the first two axes, which together explain 65.19% of the variance.

2) Interpretation of the biological meaning of the axes :

In PCA, the axes are interpreted according to the columns (i.e. variables) via the correlation circle and the absolute contributions of the variables.

To obtain this information in R, use the following commands:

# Obtain the absolute contributions of variables :
PCA$var$contrib
                 Dim.1       Dim.2       Dim.3      Dim.4       Dim.5
X100m        10.824776 10.03651625  1.63515357  0.3755928 23.39362182
Long.jump    12.418510 13.24601095  0.18397545  1.0039891  1.29845908
Shot.put     13.118558  8.89903353  2.86005522  2.7322519  1.12053773
High.jump    10.101229 10.94386007  1.34856164  9.6446306  5.75676065
X400m         6.050422  1.75575069 17.68011201 32.4590377  0.09123132
X110m.hurdle  2.265809 13.96580459 10.80010469  7.4256988 54.75065458
Discus       17.620323  2.99067030  0.06343601  0.5423102  0.71456580
Pole.vault    2.249144 16.38731995 23.04479888  4.8414968  0.03149013
Javeline      4.133763  3.25169636 31.78461582  9.3858284 10.87089875
X1500m        1.779257 18.51325372  8.53578854 28.5491707  1.51933507
Points       19.438208  0.01008361  2.06339817  3.0399930  0.45244507
# Obtain the circle of correlations :
fviz_pca_var(PCA)

# Graphical representation of absolute variable contribution values for an axis: 
# Axis 1 case : 
fviz_contrib(PCA, choice = "var", axes = 1)

# Axis 2 case: 
fviz_contrib(PCA, choice = "var", axes = 2)

The absolute contribution shows how the initial variable contributes to the formation of the axis. To find out from these results which variables contribute to the formation of the synthetic axis, we use the threshold of \(\frac{1}{nb\_variables}\times100\) (in general). For a variable to contribute, its contribution value must be greater than this threshold. This threshold value corresponds to the red dotted line on the graphs showing the absolute contributions of the variables along the axes.

From these results, we can see that axis 1 is mainly represented by the following variables:

  • Points, Discus, Shotput, Long.jump, X100m, High.jump

Axis 2 is represented by the variables:

  • X1500m, Pole.vault, X110m.hurdle, X100m

The correlation circle also shows that some variables are related and others independent. Two related variables are represented by vectors forming an acute angle. Two independent variables are represented by orthogonal vectors. Two negatively related variables are represented by two opposite vectors.

3) Projection of individuals to create a typology based on the axes:

For this, we use the relative contributions of the individuals. These relative contributions can be used to create a typology of individuals based on the axes. The relative contribution of an individual corresponds to the share of information on the axis explained by the point and supported on the axis (in percentage).

To obtain the relative contributions of individuals on R, use the following commands:

# Relative contributions of individuals :
PCA$ind$cos2
                Dim.1        Dim.2        Dim.3        Dim.4        Dim.5
Sebrle     0.68963196 6.540767e-12 0.1505401735 3.421119e-03 0.0919816558
Clay       0.68982961 7.883438e-02 0.1191186204 1.593953e-04 0.0009981983
Karpov     0.74957187 2.476108e-02 0.1608118952 2.542731e-03 0.0008147725
Macey      0.12167773 7.112491e-01 0.0847580597 1.109719e-04 0.0239527703
Warners    0.03513216 7.035171e-01 0.2043199790 2.553907e-02 0.0041832424
Zsivoczky  0.01766196 7.411226e-01 0.0583141713 1.304529e-02 0.1454252042
Hernu      0.23531031 2.767503e-01 0.1923125481 1.038527e-01 0.0748120286
Bernard    0.03767052 2.359745e-02 0.4446099398 2.664423e-01 0.0658970688
Schwarzl   0.74103094 1.190975e-01 0.0087379756 1.268006e-06 0.0098608763
Pogorelov  0.12049107 2.472670e-02 0.0687218609 7.285021e-01 0.0045896049
Schoenbeck 0.47746580 2.724861e-02 0.2569700383 5.580082e-02 0.0190441531
Barras     0.32981991 4.127756e-01 0.0005470956 1.535366e-02 0.1765554931
Nool       0.23451399 1.310712e-01 0.2420363184 2.449685e-01 0.1242846825
Drews      0.53811419 2.784934e-01 0.1528997775 1.306683e-02 0.0056915740
# Obtaining the projection of individuals : 
fviz_pca_ind(PCA)

The individuals who most explain Axis 1 are :

  • Sebrle, Clay, Karpov, Schwarzl, Drews, Schoenbeck

Those who most explain axis 2 are :

  • Macey, Warners, Zsivoczky, Barras

e) Biological conclusion :

Thus, thanks to the results of the absolute contributions of the variables and the relative contributions of the individuals, we can represent the projection of the individuals and variables as follows:

fviz_pca_biplot(PCA)

Knowing the biological interpretation of the axes :

PCA interpretation

We can therefore see that the individuals with the highest scores at this tournament are those who had the best results in the discus, shot put, long jump and high jump events. In addition, having good results in the 100m race doesn’t give you a good ranking.

Thanks to PCA, we can see that the final score is more linked to the score of certain disciplines than others. And therefore, that links between certain variables and individuals can be observed.

PART 2: Correspondence Analysis (CA)

Introduction

CA deals with tables that cross two categorical variables with several modalities, such as contingency tables, frequency tables, …

In these tables, rows and columns are interchangeable and the variables are all in the same units.

The aim of CA is to reveal structures and associations between these qualitative variables by representing them in a multidimensional space. It allows us to visualize relationships between categories of different variables, and to highlight trends, groupings or disparities. It is particularly useful for analyzing categorical data, when you want to understand the underlying structure of the data and identify associations between categories.

Unlike PCA, for a CA, we’re not interested in the distances between points, but in the distances between profiles of different modalities. A CA is therefore simply a PCA on the profiles, by equipping the space with a suitable distance: the \(\chi^2\) distance. With this distance, the weight of rows (or columns) is relativized, but not cancelled out, and the symmetry between rows and columns is preserved.

Mathematics

Correspondence Analysis (CA) represents a statistical method aimed at exploring the relationship between two sets of categories. In contrast to PCA, which applies to continuous variables, CA is designed for categorical variables, allowing the analysis of the structure of a contingency table.

In essence, Correspondence Analysis (CA) seeks to graphically represent associations between rows and columns in a contingency table. It helps reveal trends, groupings, or oppositions within categories, providing insight into the structure of dependence between variables.

The process begins with a contingency table showing the observed frequencies of co-occurrence between different categories. These data are then transformed to obtain a probability table where the frequency of each co-occurrence is calculated as follows:

\[ f_{ij} = \frac{X_{ij}}{n} \]With \(f_{ij}\) representing the probability that events \(i\) and \(j\) occur jointly, \(X_{ij}\) the number of times this co-occurrence has been observed, and \(n\) the total number of observations.The marginal probabilities of the columns \((f_{.j})\) and rows \((f_{i.})\) are calculated as follows:

\[ f_{.j} = \sum_{i=1}^{I}f_{ij}\\\\\\ f_{i.}=\sum_{j=1}^{J}f_{ij} \]

- Profiles

  • Row profiles

    The row profile is \(L_i = (\frac{f_{i1}}{f_{i.}},...,\frac{f_{ij}}{f_{i.}})\). In matrix terms, we will set up \(D_n=diag(f_{i.})\) such that \(L = D_n^{-1}F\). We can then define the average row profile, obtained by summing the columns of our frequency table :

    \[ (\sum_{i=1}^nf_{i.}\frac{f_{i1}}{f_{i.}},...,\sum_{i=1}^nf_{i.}\frac{f_{ij}}{f_{i.}})=(f_{.1},...,f_{.j}) \]

  • Column profiles

    The column profile is \(C_j=(\frac{f_{1j}}{f_{.j}},...,\frac{f_{ij}}{f_{.j}})\). As with row profiles, it can be expressed and calculated in matrix terms by setting: \(D_p=diag(f_{.j})\), such that \(C=D_p^{-1}F^T\). We can then define the average column profile by summing the rows of our frequency table:

    \[ (\sum_{j=1}^pf_{.j}\frac{f_{1j}}{f_{.j}},...,\sum_{j=1}^pf_{.j}\frac{f_{ij}}{f_{i.}})=(f_{1.},...f_{i.}) \]

- Distances between profiles

One can measure the distance between two row profiles by\[d^2(i,i')=\sum_{j=1}^p(\frac{f_{i j}}{f_{i.}}-\frac{f_{i'j}}{f_{i'.}})^2\]

But this distance does not take into account the importance of each column. A more sensible choice is to use the \(\chi^2\) distance, which does take the importance of each column into consideration.

\(d^2(i,i')=\sum_{j=1}^p\frac{1}{f_{.j}}(\frac{f_{i j}}{f_{i.}}-\frac{f_{i'j}}{f_{i'}})^2\)

Let’s use the notation \(x\) to refer to the vector of distances between two rows \(L_{11}-L_{21}\) or between two columns \(C_{11}-C_{12}\). Adopting matrix notation, the square distance of the \(\chi^2\) is of the form:

\[ x^TD_p^{-1}x = \sum_{j=1}^p\frac{x^2_j}{f_{.j}} \]for a row point \(x\ \epsilon\ \mathbb{R}^n\) .

- Independence

In the theory of tests, the objective is to determine whether there is a link between the two qualitative variables. The null hypothesis \(H_0\) is the independence between the two variables (no effect). The alternative hypothesis \(H_1\) is the presence of a link between the two variables.

The association between the two categorical variables is assessed through the discrepancy between the observed data and the independence model.This independence model is calculated based on the following equation:

\[ P(A\ \cap\ B) = P(A) \times P(B) \]Therefore, the joint probability is the product of the marginal probabilities:

\[ f_{ij} = f_{i.} \times f_{.j} \]One can then calculate the discrepancy between the observed data \((f_{ij})\) and the independence model:

  • The significance of the association is measured with a \(\chi^2\) test

    \[ \chi_{obs}^2 = \sum_{i=1}^I\sum_{j=1}^J\frac{(obs.headcount - th.headcount)^2}{th.headcount}=\sum_{i=1}^I\sum_{j=1}^J\frac{(n\times f_{ij} - n\times f_{i.}\times f_{.j})^2}{n\times f_{i.}\times f_{.j}} \]

  • The strength of the association \(Ø^2\) is measured by the discrepancy between theoretical and observed probabilities.

\[ Ø^2 = \sum_{i=1}^I\sum_{j=1}^J\frac{(obs.prob - th.prob)^2}{th.prob} \]

- Binary Correspondence Analysis

By conducting a correspondence analysis, we aim to simultaneously represent the row profiles belonging to \(\mathbb{R}^p\) and the column profiles belonging to \(\mathbb{R}^n\) of a frequency table. As this representation is done in the Cartesian plane, it will be obtained through a double principal component analysis.

The two principal component analyses are not directly performed on the variables but rather on the row profiles (direct analysis) and the column profiles (dual analysis) presented earlier. Moreover, binary correspondence analysis incorporates the notion of column (or row) weights and the \(\chi^2\) distance.

- Direct Analysis (Rows)

The direct analysis is performed on the row profiles. The rows of \(L = D_n^{-1}F\) are elements of \(\mathbb{R}^p\). We aim to represent them in this space using the distance function \(x^TD_p^{-1}x\).

In the direct analysis, we seek the vector \(u\ \epsilon\ \mathbb{R}^p\) such as :

\[ (u^TD_p^{-1}F^TD_n^{-1})D_n(D_n^{-1}FD_p^{-1}u) \]

With \(F\) : the observed frequency matrix

is maximal, given that \(u^TD_p^{-1}u=1\).

The solution is given by the principal eigenvector of

\[ D_pD_p^{-1}F^TD_n^{-1}FD_p^{-1}=F^TD_n^{-1}FD_p^{-1}\equiv S \]

- Dual Analysis (Columns)

Dual analysis is performed on the column profiles. The rows of \(C = D_p^{-1}F^T\) are elements of \(\mathbb{R}^n\). We aim to represent them in this space using the distance function \(x^TD-n^{-1}x\).

In dual analysis, we seek the vector \(v\ \epsilon\ \mathbb{R}^n\) such as

\[ (v^TD_n^{-1}FD_p^{-1})D_p(D_p^{-1}F^TD-n^{-1}v) \]is maximal, given that \(v^TD_n^{-1}v=1\).

The solution is given by the principal eigenvector of

\[ D_n(D_n^{-1}FD_p^{-1}F^TD_n^{-1})=FD_p^{-1}F^TD_n^{-1}\equiv T \]

- Projection of both analyses onto the same plane

The usual goal of correspondence analysis is to produce a 2-dimensional graph that summarizes the information contained in the frequency table and highlights different interesting associations. In other words, we want to somehow overlay the figures resulting from dual and direct analyses.

To present both analyses in the same plane, it is essential to ensure that both PCAs project the data into the same dimensions. The transition relationships and the concept of the center of gravity presented in the following sections allow us to ensure that we are indeed projecting the “individuals” from dual and direct analyses into the same dimensions.

- Transition Relationships

The following relationships

\[ D_n^{-1}F\varphi_j=\sqrt{\lambda_j}\Psi_j\equiv \hat{\Psi}_j \]

\[ D_p^{-1}F^T\Psi_j=\sqrt{\lambda_j}\varphi_j\equiv \hat{\varphi}_j \]

With \(\varphi_j\) : the eigenvector associated with the eigenvalue \(\lambda_j\) of the matrix F; \(\Psi_j\) : the modality coordinates of the row variable in the row space defined by the eigenvalues; \(\hat{\Psi}_j\) the contrinution of the component \(j\) to the row variable in the row space and \(\hat{\varphi}_j\) the contrinution of the component \(j\) to the column variable in the column space

establish an almost barycentric link between the two types of analysis, meaning that the coordinates of points in one space are proportional to the components of the factor from the other space corresponding to the same eigenvalue.

Verification: The first relationship is confirmed as follows.

\[ D_n^{-1}F\varphi_j=D_n^{-1}(FD_p^{-1}u_j) = D_n^{-1}(\sqrt{\lambda_j}v_j)=\sqrt{\lambda_j}\Psi_j \]The second identity is demonstrated in a similar manner.

The consequences of the preceding transition relationships are that…

  1. \(1=\lambda_1 \geq \lambda_2 \geq … \geq \lambda_p \geq 0\)

  2. \((f_1,…,f_p)^T\) is an eigenvector associated with the eigenvalue \(\lambda_1=1\) of \(S=F^TD_n^{-1}FD_p^{-1}\).

    Note that since the first eigenvalue is always equal to one, many software packages only mention the other \(p-1\) eigenvalues.

- Center of Gravity

To project the observations onto the same plane, we need to determine the origin of the graph.

The center of gravity of the rows is defined by \(G_L=(g_1,…,g_p)^T\), where

\[ g_j=\sum_{i=1}^nf_i\frac{f_{ij}}{f_i}=\sum_{i=1}^nf_{ij}=f_j, 1\leq j \leq p \]Similarly, the center of gravity of the columns is defined by

\[ G_C=(f_1,...,f_n)^T \]Centering of the rows is obtained by calculating

\[ \frac{f_{ij}}{f_i}-g_j=\frac{f_{ij}}{f_i}-f_j=\frac{f_{ij}-f_if_j}{f_i} \]so that \(\sum_{j=1}^p\frac{f_{ij}-f_if_j}{f_i}=0\)

for all \(i\ \epsilon\) \({1,…,n}\).

The consequence of centering the rows is that the analysis is no longer conducted on \(S=F^TD_n^{-1}FD_p^{-1}\) but rather on \(S*=(s^*_{jj'})\), where

\[ s^*_{jj'}=\sum_{i=1}^n\frac{(f_{ij}-f_if_j)(f_{ij'}-f_if_j)}{f_if_{j'}} \]To detect associations between rows and columns, one must establish a connection with the statistic of the \(\chi^2\) test.

By definition, \(trace(S*)=\sum_{j=1}^p\sum_{i=1}^n\frac{(f_{ij}-f_if_j)^2}{f_if_j}\)

Interpretation

Here’s an example of how CA can be applied.

You can import R package using the code

# Creating the dataset :
## Step 1: Create eye and hair color vectors
colors_of_eyes <- c("Blue", "Brown", "Green", "Gray", "Black", "Yellow")
hair_colors <- c("Black", "Brown", "Blond", "Redhead", "Chestnut", "Blues", "White", "Pink")

## Step 2: Create a data array with fixed values
fixed_values <- matrix(c(
  50, 35, 32, 24, 0, 0,
  15, 84, 34, 18, 16, 0,
  7, 4, 1, 20, 10, 0,
  0, 18, 0, 21, 34, 6,
  5, 3, 1, 0, 0, 3,
  8, 4, 2, 5, 13, 24,
  9, 10, 0, 12, 0, 10,
  1, 20, 5, 46, 2, 5
), nrow = length(hair_colors), ncol = length(colors_of_eyes))

## Store the matrix in an array variable : 
data <- as.data.frame(fixed_values)

## Name rows and columns
colnames(data) <- colors_of_eyes
rownames(data) <- hair_colors


data
         Blue Brown Green Gray Black Yellow
Black      50    34    10    5     2      0
Brown      35    18     0    3     5     10
Blond      32    16     0    1    13      1
Redhead    24     0    18    0    24     20
Chestnut    0     7     0    0     9      5
Blues       0     4    21    3    10     46
White      15     1    34    8     0      2
Pink       84    20     6    4    12      5

This table shows the number of people with each eye color (in columns) and hair color (in rows), for a total of 617 people sampled.

By performing a CA on this dataset, we seek to show the correspondences/oppositions between the different modalities of a variable.

a) Performing CA :

To run the CA, such as the PCA, you need to load the following two packages: factoextra and FactoMineR.

Then, to perform the CA on R, you can use the function : CA(). Remember to store the result of this PCA in a new variable, so that you can easily retrieve the PCA information you need for subsequent interpretation.

In a CA, rows are called rows and columns called col.

b) Interpretation of outputs :

1) Inertia and choice of axes :

# Recover the eigenvalues of each axis and the inertia carried by the principal components
CA$eig
       eigenvalue percentage of variance cumulative percentage of variance
dim 1 0.415178377             55.3969063                          55.39691
dim 2 0.215052153             28.6942303                          84.09114
dim 3 0.073125570              9.7570841                          93.84822
dim 4 0.043902740              5.8579062                          99.70613
dim 5 0.002202466              0.2938732                         100.00000
# Graphical display of the inertia of each axis : 
fviz_eig(CA, addlabels = TRUE)

To determine the number of axes used in the CA analysis, we need to identify the jump in variance explained by the different axes. It’s the same as for PCA. Here we can see that the first axis represents 55.4% of the variance, the second axis 28.7%, the third axis 9.8%, …

We can therefore see that the jump in variance explained by the different axes is between the first and second axes. The difference between the variance explained by the axes other than the first is negligible compared to the difference between the first axis and the others.

For the CA interpretation, we therefore retain the first two axes, which together explain 65.19% of the variance.

2) Interpretation of the biological meaning of the axes :

In CA, in contrast to PCA, axes are interpreted either column-wise or row-wise. The choice depends on the initial biological question.

Here we’ll interpret the axes according to the lines, i.e. according to the hair colors. To find out which variables contribute to the synthetic axis, we use the threshold of : \(\frac{1}{nb\_variables}\)

The threshold in our case is: \(\frac{1}{8}\).

To obtain this information in R, use the following commands:

# Obtaining absolute line contribution values : 
Row_contrib=CA$row$contrib
Row_contrib>(1/8)*100
         Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
Black     TRUE FALSE  TRUE FALSE  TRUE
Brown    FALSE FALSE FALSE FALSE  TRUE
Blond    FALSE FALSE FALSE FALSE FALSE
Redhead  FALSE FALSE  TRUE FALSE  TRUE
Chestnut FALSE  TRUE FALSE  TRUE FALSE
Blues     TRUE FALSE  TRUE FALSE FALSE
White    FALSE  TRUE FALSE FALSE FALSE
Pink      TRUE FALSE FALSE  TRUE FALSE

According to these results, the lines explaining the first axis are :

  • Black, Blues, Pink

The lines explaining the second axis are :

  • Chestnut, White

3) Interpretation of relative contributions :

Relative contributions represent the quality of representation of the column/row by the axis.

Either the rows or the columns are interpreted, depending on the choice made in the next step. In our case, we chose the columns, since we interpreted the axes with the absolute contributions of the rows.

# Obtain relative column contribution values :
CA$col$cos2
            Dim 1      Dim 2      Dim 3       Dim 4        Dim 5
Blue   0.85575259 0.01448471 0.00950636 0.120205825 5.051631e-05
Brown  0.57556811 0.05455375 0.17954769 0.189337969 9.924835e-04
Green  0.45208660 0.52672933 0.01096464 0.008743165 1.476265e-03
Gray   0.01081361 0.82960368 0.08394864 0.020223487 5.541059e-02
Black  0.06314083 0.46740988 0.41166914 0.056819078 9.610773e-04
Yellow 0.74157647 0.18279275 0.05802235 0.017605743 2.692628e-06

Blue and yellow eye colors are well represented by axis 1. Grey eye color is represented by axis 2.

# Relative line contribution values : 
CA$row$cos2
              Dim 1       Dim 2      Dim 3       Dim 4        Dim 5
Black    0.69781032 0.070862823 0.15315848 0.066726581 1.144180e-02
Brown    0.52754846 0.167378322 0.26792495 0.020035304 1.711296e-02
Blond    0.69002130 0.182751826 0.08463182 0.041935390 6.596638e-04
Redhead  0.49239795 0.083209727 0.41251121 0.007108659 4.772456e-03
Chestnut 0.03638017 0.564361623 0.02919618 0.365627649 4.434383e-03
Blues    0.87482351 0.049760467 0.07057401 0.004824928 1.708337e-05
White    0.15364853 0.827793121 0.00721218 0.010354558 9.916087e-04
Pink     0.78192585 0.000285946 0.02156349 0.195417315 8.073975e-04

This verifies that the modalities that contribute most to the formation of the first two axes all have a good representation quality (not always the case).

c) Biological conclusion :

For the question of links between modalities in rows or columns, you need to know that :

  • If we have a positive scalar product, i.e. less than 90 degree difference between the arrows of two different modalities. Then we have a conjunction representing an affinity between these two modalities.

  • If we have a negative scalar product, i.e. more than 90? difference between the arrows of two different modalities. Then we have an opposition of these two modalities. They reject each other.

  • If we have a zero scalar product, i.e. the arrows between two different modalities are perpendicular. Then there is no link between these two modalities.

fviz_ca_biplot(CA)

On this graph, the rows are shown in blue and the columns in red. In other words, hair color in blue and eye color in red.

From this analysis, we can see :

  • A conjunction between black, pink, brown or blond hair and blue or brown eyes.

  • A conjunction between yellow eyes and blue hair

  • An opposition between these two groups on axis 1.

In an analysis where the data are not created randomly, but come from different individuals from different populations. This could reveal a particular typology of one population in relation to another.

PART 3: Multiple Correspondance Analysis (MCA)

Introduction

MCA is a statistical method adapted to table of type “individuals x quatitative variable”. Eingen Values correspond to means of squared correlation ratios. It could be used a pre-processing before a classification or a coinertia analysis on tables with quantitative.

Mathematics

The MCA stands out from other multivariate analyzes with a single table (PCA, CA or mix between MCA and PCA, etc.) because it only takes qualitative variables as input. Its goal is to find the relationships between modalities by visualizing the possible associations and producing a quantitative indicator of their relationships.

The data taken by an MCA is a table comprising \(J\) qualitative variables to describe \(I\) statistical individuals. All individuals have the same weight \(\frac{1}{I}\). The value \(V_{i,j}\) in the table corresponds to the modality taken by individual \(i\) for variable \(j\).

To obtain a complete disjunctive table (CDT) and carry out the statistical analysis, it is necessary to subdivide each variable \(j\) into \(n_{j}\) modalities corresponding to all of the responses \(V_{.,j}\) of the \(I\) individuals for this variable. The table will always have \(I\) rows but \(\sum_{j=1}^J {n_j}\) columns (we will denote K columns), in which each of the \(V_{i,k}\) values is either 1 or 0 depending on the presence or absence of the modality.

The value \(V_{i,k}\) is transformed again : the old value of \(V_{i,k}\) (1 or 0) is divided by the probability \(p_{k}\) of having this modality (i.e. \(p_k=\frac{\sum_{i=0}^I V_{i,k}}{I}\), result from which we subtract 1 to center the result.

\[V'_{i,k}=\frac{\sum_{i=1}^I V_{i,k}}{I}-1\]

We define the distance D between 2 individuals \(i\) and \(i’\) as : \[D=\frac{1}{J}\sum_{k=1}^K \frac{1}{p_{k}}(V'_{i,k}-V'_{i',k})^2\]

So 2 individuals taking the all same modalities are at a distance D = 0, the more similar the responses are over a large number of modalities, the lower the distance D will be and conversely. If two individuals share a rare modality, the distance will be reduced to take into account the common specificity.

The inertia of the point cloud : \(N = \frac{K}{J} – 1\). \(η^2(x,y)\) is the correlation ratio between 2 variables \(x\) and \(y\) and \(η^2(F_s,k)\) is the correlation ratio between the modality \(k\) and l’axe \(F_s\). The axis \(F_1\) is the one which maximize its correlation ratio with all the modalities \(k\) : \[F_1=\max(\sum_{k=1}^K η^2(F_s,V_{.,k}))\]

Interpretation

This part is a simple example of MCA using R. You can import R package using the code. Let’s have a look to fictive the data set we will be working on :

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hippo <- read.table(file = "https://raw.githubusercontent.com/AnsaldiL/MODE_reproduciblescience/master/hp.csv", sep=';', header=TRUE)
hippo=hippo[,-1]
str(hippo)
'data.frame':   143 obs. of  4 variables:
 $ Age      : int  20 21 23 24 23 35 26 22 22 20 ...
 $ Sex      : chr  "F" "M" "F" "F" ...
 $ Frequency: chr  "regularly" "rarely" "regularly" "regularly" ...
 $ Lake     : chr  "G1" "G3" "G1" "G3" ...

Water consumption behaviour of Hippopotamus was observed in Penjari National Park, Benin. Drinking frequency was evaluated by a technician and rated “rarely” or “regularly”. Sex is indicated with F for female and M for male. As there are 3 lakes in the park which are noted “G1”, “G2” and “G3”.

We are performing MCA with the package FactoMineR.

Let’s have a look to Eigen Values :

res.mca$eig
      eigenvalue percentage of variance cumulative percentage of variance
dim 1  0.4348479               32.61359                          32.61359
dim 2  0.3391468               25.43601                          58.04960
dim 3  0.3304681               24.78511                          82.83470
dim 4  0.2288706               17.16530                         100.00000
fviz_eig(res.mca, addlabels = TRUE)

The total variance of the data set is divided between four dimensions. The first dimension concentrates 32.6% of the total variance. The first plan gathers 58% of the variance. You can better visualize the repartition of variance with the plot.

Here, you can see the results for the individuals, their coordinates (\(coord\)), absolute contributions (\(contrib\)), and the quality of their projection (\(cos2\)).

head(res.mca$ind$coord)
       Dim 1      Dim 2        Dim 3      Dim 4
1 -0.2750115 -0.3282379  0.953675653  0.2177425
2  1.1910863 -0.2970654 -0.817927621  0.0463482
3 -0.2750115 -0.3282379  0.953675653  0.2177425
4 -0.5234147 -0.6299231 -0.447176405  0.1119000
5 -0.2750115 -0.3282379  0.953675653  0.2177425
6 -0.6770613  0.6031855  0.005825436 -0.2073689
head(res.mca$ind$cos2)
       Dim 1      Dim 2        Dim 3        Dim 4
1 0.06632695 0.09448565 7.976083e-01 0.0415790594
2 0.65134489 0.04051620 3.071527e-01 0.0009862576
3 0.06632695 0.09448565 7.976083e-01 0.0415790594
4 0.31017449 0.44925117 2.263977e-01 0.0141766670
5 0.06632695 0.09448565 7.976083e-01 0.0415790594
6 0.52978429 0.42047948 3.921931e-05 0.0496970050
head(res.mca$ind$contrib)
      Dim 1     Dim 2        Dim 3       Dim 4
1 0.1216265 0.2221539 1.924579e+00 0.144863948
2 2.2814613 0.1819619 1.415677e+00 0.006563565
3 0.1216265 0.2221539 1.924579e+00 0.144863948
4 0.4405736 0.8181846 4.231479e-01 0.038259024
5 0.1216265 0.2221539 1.924579e+00 0.144863948
6 0.7371954 0.7502017 7.181106e-05 0.131389632

Coordinates of the individuals are their position on the first plan. \(cos2\) represents the quality of the representation of the individuals on the first plan. Contribution is how the point contributes to the creation of different axis.

You can access the same information for the variable (instead of the individuals):

res.mca$var$coord
                Dim 1       Dim 2       Dim 3      Dim 4
F         -0.47746388 -0.21701783  0.15112124 -0.3434173
M          1.14818694  0.52187620 -0.36341060  0.8258369
rarely     1.23505546 -0.11004338 -0.08731625 -0.8834510
regularly -0.53107385  0.04731865  0.03754599  0.3798839
G1         0.46448522 -0.40376182  1.45603286  0.2760402
G2        -0.33088584  1.22351803 -0.17862073 -0.3340851
G3        -0.02692834 -0.93083264 -0.95986346  0.1241336
res.mca$var$contrib
                Dim 1      Dim 2       Dim 3      Dim 4
F         12.34263168  3.2693937  1.62699252 12.1316114
M         29.68109047  7.8621135  3.91252964 29.1736370
rarely    35.15982729  0.3578914  0.23124469 34.1811065
regularly 15.11872574  0.1538933  0.09943522 14.6978758
G1         4.62603750  4.4819347 59.81574581  3.1042541
G2         3.05186316 53.5031214  1.17025431  5.9111247
G3         0.01982416 30.3716520 33.14379781  0.8003904
res.mca$var$cos2
                 Dim 1       Dim 2       Dim 3       Dim 4
F         0.5482177869 0.113256440 0.054919059 0.283606714
M         0.5482177869 0.113256440 0.054919059 0.283606714
rarely    0.6559056566 0.005207104 0.003278375 0.335608865
regularly 0.6559056566 0.005207104 0.003278375 0.335608865
G1        0.0837850567 0.063310139 0.823313280 0.029591524
G2        0.0625631068 0.855426501 0.018231638 0.063778755
G3        0.0004019773 0.480314347 0.510741640 0.008542035

Let’s have a look to the plot of this MCA analysis :

plot(res.mca)

As this is a factice data set, some individuals are overlapping. We will not focus on this artefact of the data set.

How to interprete these graphs?

General description : The individuals are in black. The variables are in red. The percentage of variance of the first two axis is written on them. Individuals in the center of the cloud are individuals taking a mean value for all of their caracteristics, unlikely individuals far from the middle which are specific individuals, very different from others. Close individuals present close characteristics.

Interpretation: Here, we can see that the first axis separates individuals drinking regularly from individuals drinking rarely. The second axis separates individuals drinking at G2 from individuals drinking at G3. Male and female seem to be separated by the first axis. As we can gather individuals sharing close properties, females seems to drink regularly and male more rarely. It is a bit less clear but male are drinking preferably in pound 1.

Note that supplementary quantitaive data could be added to the analysis to help interpretation. As they are quantitative, they would not participate to the creation of axis but they may be ploted on the final graph.

PART 4: Canonical Correspondence Analysis (CCA)

Introduction

Two table analysis is used when the scientist possesses two data set with expecting the second to explain the first. The first one is called Y, the response variable and the second one is called X, the explanatory variable. For example :

Table Y | Table X
a data set with temperature data | a data set with cities caracteristics such as heigth of buildings, concreted area, number of cars… |
a data set with the quantity of chemicals in soil | a data set with % of pesticide applied |
a data set with functional traits of an animal | a data set with the quantities of different food given to the animal |

RDA is used when expecting a linear response from \(X\) to \(Y\). Only the variables of \(X\) explaining \(Y\) would be kept. The canonical axis are a linear combinaison of the explicative variable (\(Y\)). The \(Y\) table will be ordered with a PCA. Then, a multiple regression of \(Y\) on \(X\) is done and stored in a new table, \(Y'\). The residual matrix correspond to \(Y-Y'\), what was not explained by \(X\).

Interpretation

Here are the library you will need to perfor the RDA, make sure your import them at the begining of your working document.

Let’s import the data, they are from @thilenius1963synecology. However, the data of abundance have been changed to facilitated the interpretation. For access to real data and metadata, please see the JDBakker Git page: appliedmultivariatestatistics/Oak_data_47x216.csv at main · jon-bakker/appliedmultivariatestatistics (github.com)

Oak <- read.csv("https://raw.githubusercontent.com/AnsaldiL/MODE_reproduciblescience/master/Oak_data.csv", sep=';', header = TRUE) 

Topo <- Oak %>%
select(LatAppx, LongAppx, Elev.m, Slope.deg) %>%
decostand("range")

Oak_sp=Oak[,28:83]

We can explore the tables

head(Oak_sp)
  Abgr.s Abgr.t Acar Acgld.t Acma.s Acma.t Acmi Adbi Agha Agre Agse Agte Aica
1      0      0    0       0      0      0    0    0    0    0    0    1    1
2      5      0    0       0      1      0    0    0    0    0    0    0    0
3      4      0    2       0      1      0    0    0    0    0    0    0    0
4      0      0    0       0      0      0    0    0    0    0    0    0    1
5      0      0    0       0      0      0    0    0    0    0    0    0    0
6      0      0    0       1      0      0    0    0    0    0    0    0    0
  ALL Alpr Amal.s Amal.t Apan Aqfo Arel Arme.s Arme.t Avfa Beaq.s Brco Brla
1   0    2      0      0    0    0    0      0      0    0      0    0    1
2   0    0      1      0    0    0    0      0      0    0      0    0    0
3   0    0      0      0    0    0    0      0      0    0      0    0    0
4   0    0      0      0    0    2    0      0      0    0      0    0    0
5   0    0      0      0    0    0    1      0      0    0      0    0    0
6   0    0      0      0    0    1    0      0      0    0      0    0    0
  Brpu Brri Brse Brst Brvu Caqu CAR Cato Cear Ceum Ceve.s Cipa Civu Coco.s
1    1    0    0    0    0    1   0    0    0    0      0    0    0      0
2    0    0    0    0    0    1   0    0    0    0      0    0    0      0
3    0    0    0    0    0    0   0    0    0    0      0    0    0      0
4    1    0    0    0    0    3   0    0    0    1      1    0    0      0
5    0    0    0    0    0    1   0    0    0    0      1    0    0      0
6    0    0    0    0    0    0   0    0    0    0      1    0    0      0
  Coco.t Cogr Conu.s Conu.t CORY.t Cost Crca Crdo.s Crdo.t Crox.s Cyec Cyfo
1      0    0      0      0      0    0    0      0      0      0    0    0
2      0    0      0      0      0    0    0      0      0      0    0    0
3      0    0      0      0      0    0    0      0      0      0    0    0
4      0    0      0      0      0    0    0      0      0      0    0    0
5      0    0      0      0      0    0    0      0      0      0    0    0
6      0    0      0      0      0    0    0      0      0      0    0    0
  Cygr Daca Dacar Dagl
1    0    0     0    0
2    0    0     0    0
3    0    0     0    0
4    0    0     0    0
5    0    0     0    0
6    0    0     0    0
head(Topo)
    LatAppx  LongAppx     Elev.m  Slope.deg
1 0.3882609 1.0000000 0.06578947 0.03448276
2 0.5073864 0.6239120 0.13157895 0.24137931
3 0.5750381 0.7687530 0.33333333 0.20689655
4 0.7392188 0.6721803 0.06578947 0.17241379
5 0.8261238 0.9377551 0.33333333 0.34482759
6 0.3309702 0.9046193 0.06578947 0.17241379

So here, the first table (“Oak_sp”) represents a table of plant abundance for 47 sites (47 rows). We have 56 differents species. The second table (“Topo”) corresponds to 4 topographics variables (Latitute, Longitute of sites, the elevation and the slope). As it is not the same unit for each variable we have centred-reduced these four variables.

As explained above, the CCA objective is to link the abundance of plant by the topographic variables. Here, we have a priori hypothesis on the effect of these environmental variables on the plants.

We have seen in introduction of the part 3 that CCA was used when the respond have a non-linear distribution but rather a gaussian relationship.

So we begin by performing a CA on the species table:

caoak <- dudi.coa(Oak_sp,scannf=F,nf=2)

Then, we can perform the PCA on the CA that corresponds to the CCA with the ADE4 package:

ccaoak <- pcaiv(caoak,Topo,scannf=F,nf=2)

You can also realize the CCA with one code row only with the vegan package:

ccavegan=cca(Oak_sp,Topo,scan=F)

It is important to use the two methods because each allows us to have different information during the analyse of the result

And we already have completed our analyse, it was no difficult! But now, we need to analyse the result and test the significant.

To test the significant, we use a bootstrapping test thank you the ADE4 package.It is important to test if the result of the RLQ is not only due to random combination of values but that we have a real correlation between are different tables. To produce this, we perform a permutation test with the function randtest.

randtest(ccaoak)#with ADE4
Monte-Carlo test
Call: randtest.pcaiv(xtest = ccaoak)

Observation: 0.1270957 

Based on 99 replicates
Simulated p-value: 0.01 
Alternative hypothesis: greater 

     Std.Obs  Expectation     Variance 
2.8647335686 0.0880325650 0.0001859366 
plot(randtest(ccaoak))

The outputs above correspond to the permutation test. We see that the number of permutations of columns and rows was to 999 (default value).

The results of this test show that our result is significatively different to the permutation result with a threshold of 5% (p-value = 0.02)

Finally, the results of the CCA are plot in the distribution law calculated by the permutation. So we can conclude that our environmental variables explain a part of the plant distribution.

We have now to look the collinearity between the environmental variables. It corresponds to the correlation between them. For that, we use the Variance Inflation Factor (VIF).

test=vif.cca(ccavegan) 
test
  LatAppx  LongAppx    Elev.m Slope.deg 
 1.072988  1.115539  1.169013  1.111402 

If all the variables have a number lower than 10, you can conclude that you do not not have collinearity. If you have, you need to select some variables to remove with, for example, an ordistep function.

Now that we have tested all the conditions, we can look at the results.

ccaoak
Canonical correspondence analysis
call: pcaiv(dudi = caoak, df = Topo, scannf = F, nf = 2)
class: caiv pcaiv dudi 

$rank (rank)     : 4
$nf (axis saved) : 2

eigen values: 0.3446 0.179 0.1231 0.05905

 vector length mode    content                
 $eig   4      numeric eigen values           
 $lw    47     numeric row weigths (from dudi)
 $cw    56     numeric col weigths (from dudi)

 data.frame nrow ncol content                             
 $Y         47   56   Dependant variables                 
 $X         47   4    Explanatory variables               
 $tab       47   56   modified array (projected variables)

 data.frame nrow ncol content                               
 $c1        56   2    PPA Pseudo Principal Axes             
 $as        2    2    Principal axis of dudi$tab on PAP     
 $ls        47   2    projection of lines of dudi$tab on PPA
 $li        47   2    $ls predicted by X                    

 data.frame nrow ncol content                                  
 $fa        5    2    Loadings (CPC as linear combinations of X
 $l1        47   2    CPC Constraint Principal Components      
 $co        56   2    inner product CPC - Y                    
 $cor       4    2    correlation CPC - X                      
print(paste("The pourcentage of variance explained by topographic variables is: ",sum(ccaoak$eig)/sum(caoak$eig),"%"))
[1] "The pourcentage of variance explained by topographic variables is:  0.127095660396168 %"

The first output (by ADE4) allows us to see each part of the analyse that could be use to describe the result. We will use especially the last part afterward. You also have, at the beginning, the number of rank that corresponds to the number of eigenvalue with their score below. We can calculate the percentage of variance explained by topographic variables.

ccavegan
Call: cca(X = Oak_sp, Y = Topo, scan = F)

              Inertia Proportion Rank
Total          5.5529     1.0000     
Constrained    0.7057     0.1271    4
Unconstrained  4.8471     0.8729   38
Inertia is scaled Chi-square 
17 species (variables) deleted due to missingness

Eigenvalues for constrained axes:
  CCA1   CCA2   CCA3   CCA4 
0.3446 0.1790 0.1231 0.0590 

Eigenvalues for unconstrained axes:
   CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
0.7256 0.5739 0.4521 0.4240 0.3563 0.2448 0.2239 0.1857 
(Showing 8 of 38 unconstrained eigenvalues)

Here, it is the output with the vegan package. We can see the total inertia in the CCA analyse (5.55) and the inertia explained by the topographic variables (0.71) and by the residuals (4.85). We can therefore conclude that our topographic variables explain only 13% of the plant variances (same result that before). Below that, we have the eigenvalues for the constrained axes (variance part explained by topographic variables) and for the unconstrained axes (variance part explained by the residuals)

We can now, observe the complet plot of the CCA result:

plot(ccaoak)

Here, we can have a complete plot with the ADE4 package.

  • The “Loadings” part represents the canonical coefficients. For each axis, the arrows explain the relative weight of the topographic variables in the multiple regression calculation.

  • The “Correlation” part shows the correlation between topographic variables themselves and with CCA axes

  • The “Scores and Predictions” part allows us to understand, for each site, the real plant abundance (base of arrows) and the abundance predicted by multiple regressions with the topographic variables.

  • The “Species” part corresponds to the coordinates of each plant species in the CCA analyse

But this can also be decomposed in different plots and output.We can detail each part and observe the biological results.

We can begin by the absolute contribution of species:

contrib=inertia.dudi(ccaoak, col = TRUE, row = FALSE)
contrib$col.abs
               Axis1        Axis2
Abgr.s  6.021847e+00  1.212813788
Abgr.t  3.177359e+01 10.422451480
Acar    2.635049e+00  0.134855688
Acgld.t 7.493068e+00  1.910539707
Acma.s  2.515315e-02  1.689674506
Acma.t  8.936196e-01  1.898419377
Acmi    3.919403e-02  0.004891136
Adbi    9.970574e+00  5.143409094
Agha    8.209713e-01 17.598302739
Agre    8.571153e-01  1.890912149
Agse    1.299772e-01  8.924213094
Agte    4.233778e+00  1.403202784
Aica    3.007289e+00  0.000614369
ALL     1.076835e-01  6.374919825
Alpr    1.413603e-01  0.237212759
Amal.s  5.831519e-02  0.289243172
Amal.t  2.843090e-02  2.844815278
Apan    6.791621e+00  1.385883395
Aqfo    6.407529e+00 11.512960665
Arel    3.725579e+00  2.142662004
Arme.s  0.000000e+00  0.000000000
Arme.t  1.002592e-05  2.416236998
Avfa    1.389095e-01  1.919937151
Beaq.s  1.385684e+00  4.664898207
Brco    0.000000e+00  0.000000000
Brla    1.570857e+00  0.757779956
Brpu    4.750564e-01  0.266026605
Brri    8.163005e-01  1.287295464
Brse    1.292897e-01  0.143416542
Brst    0.000000e+00  0.000000000
Brvu    4.429482e+00  0.016067778
Caqu    1.841713e-02  1.290440906
CAR     6.964536e-01  0.036573989
Cato    1.206790e+00  0.522162462
Cear    9.936004e-01  0.303130230
Ceum    8.493802e-01  0.392276059
Ceve.s  1.019629e+00  2.122835911
Cipa    1.182233e-04  1.942686629
Civu    0.000000e+00  0.000000000
Coco.s  0.000000e+00  0.000000000
Coco.t  9.551865e-01  0.003571735
Cogr    5.754591e-02  1.919957503
Conu.s  0.000000e+00  0.000000000
Conu.t  0.000000e+00  0.000000000
CORY.t  0.000000e+00  0.000000000
Cost    0.000000e+00  0.000000000
Crca    0.000000e+00  0.000000000
Crdo.s  0.000000e+00  0.000000000
Crdo.t  5.203645e-02  0.143607303
Crox.s  0.000000e+00  0.000000000
Cyec    0.000000e+00  0.000000000
Cyfo    0.000000e+00  0.000000000
Cygr    4.350974e-02  2.829101566
Daca    0.000000e+00  0.000000000
Dacar   0.000000e+00  0.000000000
Dagl    0.000000e+00  0.000000000

The output, here, is the absolute contribution of each species for the axis 1 and the axis 2. This allows us to observe what species contributes the most of the axes.

ccaoak$fa
                      RS1           RS2
(Intercept) -1.448845e-16  1.304775e-16
LatAppx      5.546487e-01 -1.321836e-01
LongAppx     8.933635e-02  1.113464e-01
Elev.m      -2.378758e-02 -2.828042e-01
Slope.deg    2.306181e-02  3.754766e-01
s.corcircle(ccaoak$fa)

This line gives you the canonical coefficients on each axis. The circle is explain in the “Loadings” part of the main plot of the RDA. On axis 1, the variable with the main contribution of the predictive power is \(LatAppx\). On axis 2, \(Slope.deg\) and \(Elev.m\) contributes the most of the predictive power.

ccaoak$co
               Comp1        Comp2
Abgr.s  -0.344559610 -0.111446061
Abgr.t  -1.499805193  0.619091450
Acar    -0.405170072 -0.066061105
Acgld.t -0.657447384 -0.239263999
Acma.s   0.098964466 -0.584591822
Acma.t  -0.527599532  0.554232875
Acmi     0.093384329 -0.023775904
Adbi    -1.970348894 -1.019944979
Agha    -0.376925736  1.257752946
Agre    -0.408495885 -0.437292359
Agse     0.318149791  1.899989382
Agte     0.560360412  0.232504722
Aica     0.510110670  0.005254841
ALL      0.204765928  1.135503753
Alpr     0.135452232  0.126461872
Amal.s  -0.213102669  0.342056435
Amal.t  -0.210430331 -1.517079157
Apan     0.566164741  0.184326417
Aqfo     0.455971525 -0.440508302
Arel     0.525654360  0.287308446
Arme.s   0.000000000  0.000000000
Arme.t   0.002281469 -0.807217082
Avfa     0.328900132  0.881271477
Beaq.s   0.555259619  0.734265751
Brco     0.000000000  0.000000000
Brla     0.451534247  0.226027947
Brpu     0.384680895  0.207471640
Brri     0.797302436  0.721614627
Brse     0.259080296 -0.196661887
Brst     0.000000000  0.000000000
Brvu     0.656643322  0.028503562
Caqu    -0.043729888 -0.263818130
CAR     -0.520749860  0.086007708
Cato     0.969424379  0.459588372
Cear     0.470186239 -0.187174556
Ceum     0.575087756 -0.281674227
Ceve.s  -0.476305048 -0.495325172
Cipa    -0.009595104 -0.886477227
Civu     0.000000000  0.000000000
Coco.s   0.000000000  0.000000000
Coco.t   0.862466323 -0.038010697
Cogr     0.172846131 -0.719558962
Conu.s   0.000000000  0.000000000
Conu.t   0.000000000  0.000000000
CORY.t   0.000000000  0.000000000
Cost     0.000000000  0.000000000
Crca     0.000000000  0.000000000
Crdo.s   0.000000000  0.000000000
Crdo.t   0.201303763 -0.241020771
Crox.s   0.000000000  0.000000000
Cyec     0.000000000  0.000000000
Cyfo     0.000000000  0.000000000
Cygr     0.184073508 -1.069770155
Daca     0.000000000  0.000000000
Dacar    0.000000000  0.000000000
Dagl     0.000000000  0.000000000
s.label(ccaoak$co,boxes=F)

\(co\) gives you access to the coordinates of each species on the first plan. The table explicits the coordinates and the plot help you to find close species. On the first axis, two species have high negative value : Adbi (-1.97) & Abgr.t (-1.50). They are opposed at species as Coco.t (0.86) & Cato (0.96). You can realize the same for the second axis.

Here, we can highlight that there is a group formed by Agse (coord = 0.31 ; 1.90 respectively for the axis 1 and 2), Agha (coord = -0.38 ; 1.25) Avfa (coord = 0.33 ; 0.88) and ALL (coord = 0.20 ; 1.14). An other on with two outliers, Abgr.t and Adbi. These groups will be correlated with the environment data further in the analysis.

ccaoak$cor 
                 RS1         RS2
LatAppx    0.9867311 -0.09977934
LongAppx   0.3726389  0.33053940
Elev.m    -0.1505108 -0.45170572
Slope.deg  0.1235033  0.65338800
s.corcircle(ccaoak$cor)

Here, we have the correlation between axes and the topographic variables. We can conclude that the elevation (corr = -0.15[axis1] ; -0.45[axis2]) and the slope (corr = 0.12[axis1] ; 0.65[axis2]) are strongly negatively correlated. The Longitude seems to be positively correlated with the slope and negatively correlated with the elevation.

s.match(ccaoak$ls, ccaoak$li)

plot(ccavegan,scaling=1) 

plot(ccavegan,scaling=2) 

The first plot with the arrows represents the real abundances for each site (base of the arrow) and the predictions of abundance calculated by a multiple regression thanks to the topographic variables. We see here that there are some large differences between reality and prediction. Most of the sites are in a center group but some sites differ. It is the case with the sites n°7,40,41,42. We have another group with the sites n°43,44,45,46.

The second plot, with scaling = 1 preserves euclidian distances between stations.
The third plot, with scaling = 2 preserves correlations between species.

To conclude on the CCA analyse, we have seen that the topographic variables explain 13% of the plant variances. The first axis represents 34% of the constrained inertia and is explained by the Latitude (corr = 0.99). We can see that two species have a high coordinate in this axis (Adbi & Abgr.t).
The second axis (18% of the constrained inertia) is explained by the Elevation (corr = -0.45), the Slope (corr = 0.65) and the Longitude (corr = 0.33).

Finally, the sites n°43,44,45,46 are sites with a small latitude value and with a high abundance of species as Abgr.t.

PART 5: RLQ

Introduction

The RLQ analysis is a generalized co-inertia with 3 tables : environmental variables for each sites (R), species abundances for each sites (L) and trait variables for each species (Q). We will have to perform a CA on the table L and weighted analysis (MCA or PCA) on tablesR and Q. Then a function named rlq() perform 3 co-inertia analysis between R and L and L and Q and then between the results of each co-inertia.

Interpretation

The use of RLQ analysis is important in ecology to integrate the traits of species with the environmental variables. So here, we don’t have 2 tables (environment & specie) as RDA part but 3 tables:

  • environmental variables by sites (R)

  • abundance of species by sites (L)

  • trait values by species (Q)

To perform the RLQ, we need to decompose the analyse by three types of analyses already done in this chapter:

  • we will use a CA analyse on the abundance of species

  • we will use a MCA on the environmental table by taking the sites weight on the CA

  • we will use a MCA on the trait table by taking the species weight on the CA

Here, we use MCA for R and Q because our variables are factors but you can perform a PCA if your variables are quantitative. Warning, for R and Q you have the obligation to weight by the L table (see below).

You can import R package using the code

library(tidyverse)
library(ade4)
library(vegan)
library(ggplot2)
library(factoextra)
library(corrplot)
corrplot 0.92 loaded
library(RVAideMemoire)
*** Package RVAideMemoire v 0.9-83-7 ***
library(PerformanceAnalytics)
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: 'xts'
The following objects are masked from 'package:dplyr':

    first, last

Attaching package: 'PerformanceAnalytics'
The following object is masked from 'package:graphics':

    legend

Here, we work with a dataset of ADE4 package. The data comes from @dray2008testing, @legendre1997relating

#import the dataset
data(aviurba)

#create the three tables
summary(aviurba$mil)    #(R)
 farms    small.bui high.bui industry fields   grassland scrubby  deciduous
 yes:22   yes:13    yes: 8   yes:12   yes:22   yes:12    yes:30   yes:36   
 no :29   no :38    no :43   no :39   no :29   no :39    no :21   no :15   
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
 conifer  noisy      veg.cover
 yes: 8   yes:10   R37    :9  
 no :43   no :41   R5     :8  
                   R98    :7  
                   R93    :7  
                   R22    :6  
                   R87    :5  
                   (Other):9  
R<-aviurba$mil

summary(aviurba$fau)    #(L)
      Sp1               Sp2              Sp3               Sp4        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
 Mean   :0.05882   Mean   :0.1373   Mean   :0.07843   Mean   :0.7647  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :2.0000   Max.   :1.00000   Max.   :4.0000  
      Sp5              Sp6              Sp7              Sp8       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :3.000  
 Mean   :0.1176   Mean   :0.2353   Mean   :0.1569   Mean   :2.078  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:3.000  
 Max.   :2.0000   Max.   :1.0000   Max.   :1.0000   Max.   :4.000  
      Sp9              Sp10             Sp11             Sp12        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.5882   Mean   :0.6275   Mean   :0.3922   Mean   :0.03922  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :3.0000   Max.   :4.0000   Max.   :4.0000   Max.   :1.00000  
      Sp13              Sp14              Sp15             Sp16       
 Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.00000   Median :0.00000   Median :0.0000   Median :0.0000  
 Mean   :0.05882   Mean   :0.03922   Mean   :0.1961   Mean   :0.3725  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.00000   Max.   :2.0000   Max.   :1.0000  
      Sp17              Sp18             Sp19             Sp20       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :1.0000   Median :0.0000  
 Mean   :0.07843   Mean   :0.4118   Mean   :0.7059   Mean   :0.2157  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.00000   Max.   :2.0000   Max.   :2.0000   Max.   :2.0000  
      Sp21              Sp22             Sp23              Sp24        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.03922   Mean   :0.3725   Mean   :0.07843   Mean   :0.09804  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :2.0000   Max.   :1.00000   Max.   :1.00000  
      Sp25              Sp26             Sp27             Sp28        
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.03922   Mean   :0.1961   Mean   :0.2353   Mean   :0.03922  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.0000   Max.   :2.0000   Max.   :1.00000  
      Sp29           Sp30             Sp31             Sp32       
 Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :3.00   Median :0.0000   Median :0.0000   Median :1.0000  
 Mean   :2.49   Mean   :0.6667   Mean   :0.2353   Mean   :0.8627  
 3rd Qu.:3.00   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :4.00   Max.   :4.0000   Max.   :2.0000   Max.   :3.0000  
      Sp33             Sp34              Sp35             Sp36      
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.00000   Median :1.0000   Median :0.000  
 Mean   :0.2941   Mean   :0.07843   Mean   :0.5686   Mean   :0.451  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :2.00000   Max.   :3.0000   Max.   :3.000  
      Sp37              Sp38             Sp39              Sp40       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
 Mean   :0.05882   Mean   :0.1765   Mean   :0.03922   Mean   :0.4118  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :2.0000   Max.   :1.00000   Max.   :3.0000  
L<-aviurba$fau

summary(aviurba$traits) #(Q)
   feed.hab    feed.strat     breeding     migratory 
 insect:19   ground :27   ground  : 6   resident:23  
 grani :12   aerial : 3   building:14   migrant :17  
 omni  : 9   foliage:10   scrub   :12                
                          foliage : 8                
Q<-aviurba$traits

and explore the tables

head(R)  
   farms small.bui high.bui industry fields grassland scrubby deciduous conifer
R1   yes        no       no       no    yes        no      no       yes      no
R2   yes        no       no       no     no       yes     yes        no      no
R3    no        no       no       no    yes        no     yes       yes      no
R4    no        no       no       no    yes        no     yes        no      no
R5    no        no       no      yes     no        no     yes        no      no
R6    no        no       no      yes     no       yes     yes        no      no
   noisy veg.cover
R1    no       R98
R2    no       R87
R3    no      R100
R4    no      R100
R5    no        R5
R6   yes        R5
head(L)
   Sp1 Sp2 Sp3 Sp4 Sp5 Sp6 Sp7 Sp8 Sp9 Sp10 Sp11 Sp12 Sp13 Sp14 Sp15 Sp16 Sp17
R1   0   0   0   0   0   0   1   0   0    0    0    0    0    0    0    1    0
R2   0   0   0   0   0   0   1   0   0    0    0    0    1    0    0    0    0
R3   0   2   0   0   0   0   1   0   1    0    0    0    0    1    0    0    0
R4   0   0   0   0   0   0   1   0   1    0    0    0    0    0    1    0    0
R5   0   0   0   0   0   0   0   0   2    0    0    0    0    0    1    1    0
R6   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    1    0
   Sp18 Sp19 Sp20 Sp21 Sp22 Sp23 Sp24 Sp25 Sp26 Sp27 Sp28 Sp29 Sp30 Sp31 Sp32
R1    1    1    0    0    1    0    0    0    1    0    0    4    0    0    1
R2    2    1    0    0    1    0    1    0    1    0    0    4    4    1    1
R3    2    0    0    0    1    0    0    0    1    1    0    1    0    1    0
R4    2    0    0    0    1    0    1    0    0    2    0    1    0    1    1
R5    1    0    0    0    1    0    0    0    1    1    0    1    1    1    0
R6    0    0    0    0    0    0    0    0    0    0    0    4    1    0    1
   Sp33 Sp34 Sp35 Sp36 Sp37 Sp38 Sp39 Sp40
R1    1    0    1    0    0    0    0    1
R2    1    0    0    0    0    0    0    0
R3    1    0    0    0    0    0    0    1
R4    1    0    0    0    0    1    0    0
R5    0    0    0    0    0    0    0    0
R6    1    2    1    2    0    0    0    0
head(Q)
    feed.hab feed.strat breeding migratory
Sp1     omni     ground  foliage   migrant
Sp2     omni     ground building   migrant
Sp3     omni     ground   ground   migrant
Sp4    grani     ground building  resident
Sp5    grani     ground    scrub  resident
Sp6    grani     ground  foliage  resident

The first part is to perform our CA on specie table

afcL <- dudi.coa(log(L+1), scannf = FALSE) 
afcL  
Duality diagramm
class: coa dudi
$call: dudi.coa(df = log(L + 1), scannf = FALSE)

$nf: 2 axis-components saved
$rank: 39
eigen values: 0.3987 0.2736 0.1917 0.1624 0.1421 ...
  vector length mode    content       
1 $cw    40     numeric column weights
2 $lw    51     numeric row weights   
3 $eig   39     numeric eigen values  

  data.frame nrow ncol content             
1 $tab       51   40   modified array      
2 $li        51   2    row coordinates     
3 $l1        51   2    row normed scores   
4 $co        40   2    column coordinates  
5 $c1        40   2    column normed scores
other elements: N 

The first CA is done. We use log transformation because the abundance of species has a large range and we add “+1” to avoid the log(0) for some species. You must adapt the presence of transformation (or not) to your data.

Now, we can perform the two MCA analysis on the trait table and environmental table

acmR <- dudi.acm(R, row.w = afcL$lw, scannf = FALSE,nf = 4)
scatter(acmR)

acmQ <- dudi.acm(Q, row.w = afcL$cw, scannf = FALSE,nf = 4)
scatter(acmQ)

The scatterplot allows to see the ordination of each table and the repartition of factor on the simple axe of MCA.

But now, we will use the RLQ analyse that creates two co-inertia (R-L, L-Q), assembles and compares the co-inertia. We use the rlq function for that.

rlq <- rlq(acmR, afcL, acmQ, scannf = FALSE)
rlq
RLQ analysis
call: rlq(dudiR = acmR, dudiL = afcL, dudiQ = acmQ, scannf = FALSE)
class: rlq dudi 

$rank (rank)     : 8
$nf (axis saved) : 2

eigen values: 0.01068 0.003128 0.001525 0.0004536 0.0001298 ...

  vector length mode    content                    
1 $eig   8      numeric Eigenvalues                
2 $lw    28     numeric Row weigths (for acmR cols)
3 $cw    12     numeric Col weigths (for acmQ cols)

   data.frame nrow ncol content                                      
1  $tab       28   12   Crossed Table (CT): cols(acmR) x cols(acmQ)  
2  $li        28   2    CT row scores (cols of acmR)                 
3  $l1        28   2    Principal components (loadings for acmR cols)
4  $co        12   2    CT col scores (cols of acmQ)                 
5  $c1        12   2    Principal axes (loadings for acmQ cols)      
6  $lR        51   2    Row scores (rows of acmR)                    
7  $mR        51   2    Normed row scores (rows of acmR)             
8  $lQ        40   2    Row scores (rows of acmQ)                    
9  $mQ        40   2    Normed row scores (rows of acmQ)             
10 $aR        4    2    Corr acmR axes / rlq axes                    
11 $aQ        4    2    Corr afcL axes / coinertia axes              
axe=c(1:8)
print(paste("The contribution of axe n°",axe, "are", rlq$eig/(sum(rlq$eig))*100,"%"))
[1] "The contribution of axe n° 1 are 66.6748744763001 %"  
[2] "The contribution of axe n° 2 are 19.5228230896623 %"  
[3] "The contribution of axe n° 3 are 9.51702542395437 %"  
[4] "The contribution of axe n° 4 are 2.83090411660832 %"  
[5] "The contribution of axe n° 5 are 0.810200444830495 %" 
[6] "The contribution of axe n° 6 are 0.421072849575587 %" 
[7] "The contribution of axe n° 7 are 0.124897113199049 %" 
[8] "The contribution of axe n° 8 are 0.0982024858697611 %"
#randtest(rlq)
#summary(rlq)
#plot(rlq)

Here, the output of the RLQ is complex but only few information are, at this point important. We see that we have 8 eigenvalues and we have their values. All the different compounds of the output will be used after in representations or analysis.

Nevertheless, we can calculate the contribution of each axis of the RLQ by performing the formule below: \[\frac{rlq$eig}{\sum{rlq$eig}}.100\]

After that, and before to plot and analyse the result, it is important to test if the result of the RLQ is not only due to random combination of values but that we have a real correlation between the different tables. To produce this, we perform a permutation test with the function randtest.

randtest(rlq)
class: krandtest lightkrandtest 
Monte-Carlo tests
Call: randtest.rlq(xtest = rlq)

Number of tests:   2 

Adjustment method for multiple comparisons:   none 
Permutation number:   999 
     Test        Obs  Std.Obs   Alter Pvalue
1 Model 2 0.01602442 9.182067 greater  0.001
2 Model 4 0.01602442 2.359272 greater  0.026
plot(randtest(rlq))

The outputs above correspond to the permutation test. We see that the number of permutations of columns and rows was to 999 (default value).

The results of this test shows that the permutation of sites (rows) is the first result (p_value=0.1%) and the permutation of species (columns) is the second result (p_value=1.9%) So each result is significant (5% threshold) and we can conclude that our RLQ result is not linked to random effect.

Finally, the results of the RLQ are plot in the distribution law calculated by the permutation.

We can know analyse the result by using different types of plot:

plot(rlq)

The plot above is the complet plot, difficult to understand and that will be resumed step by step afterward.

However, we have here, up the score (the position) of the sites and species in this analyse. Below in the centre, we have the correlation of the environmental variables between them (R Canonical weights) and the correlation of the traits between them (Q Canonical weight). The plots of the R axes and the Q axes represent the reprojection of axis in the RLQ analyse in order to the simple analyses (CA or MCA here).

Now, we can decompose the result and start with the environmental variables:

#analyse the environmental variables:
round(rlq$l1,dig=2)
                 RS1   RS2
farms.yes      -0.11  0.73
farms.no        0.10 -0.65
small.bui.yes  -2.27 -0.53
small.bui.no    0.71  0.16
high.bui.yes   -2.17 -0.48
high.bui.no     0.35  0.08
industry.yes    0.43  2.63
industry.no    -0.11 -0.67
fields.yes      1.34  0.52
fields.no      -1.34 -0.52
grassland.yes  -1.07  0.26
grassland.no    0.30 -0.07
scrubby.yes    -0.10 -0.91
scrubby.no      0.14  1.37
deciduous.yes  -0.56 -0.64
deciduous.no    1.36  1.55
conifer.yes    -0.87  0.91
conifer.no      0.17 -0.18
noisy.yes      -2.61  0.67
noisy.no        0.43 -0.11
veg.cover.R100  4.35 -4.95
veg.cover.R98   1.74 -1.64
veg.cover.R93   0.13  3.16
veg.cover.R87  -0.94 -0.98
veg.cover.R62  -1.86 -0.35
veg.cover.R37  -2.01 -0.10
veg.cover.R22  -2.91  0.05
veg.cover.R5    1.04  3.24
s.label(rlq$l1, boxes=FALSE)

#calcul of the absolute contribution
iner=inertia.dudi(rlq,col.inertia=T,row.inertia=T)

abscoiE=iner$row.abs
abscoiE
                     Axis1        Axis2
farms.yes       0.05223514  2.304444147
farms.no        0.04623764  2.039853800
small.bui.yes  11.21784230  0.599492102
small.bui.no    3.52200412  0.188219231
high.bui.yes    5.99404001  0.297237022
high.bui.no     0.97600864  0.048399060
industry.yes    0.33291147 12.714009779
industry.no     0.08422136  3.216444135
fields.yes      8.17323519  1.238389579
fields.no       8.20233957  1.242799407
grassland.yes   2.29313411  0.132425116
grassland.no    0.64653934  0.037336694
scrubby.yes     0.04972304  4.526463864
scrubby.no      0.07475839  6.805519570
deciduous.yes   2.01878003  2.632197876
deciduous.no    4.90713304  6.398193444
conifer.yes     1.13513324  1.240985532
conifer.no      0.22330026  0.244123224
noisy.yes       8.82492060  0.589761352
noisy.no        1.47048833  0.098271387
veg.cover.R100 16.95851693 21.972519979
veg.cover.R98   4.57563062  4.054631466
veg.cover.R93   0.02364930 14.452989048
veg.cover.R87   0.82853117  0.891676376
veg.cover.R62   3.46745570  0.119501611
veg.cover.R37   5.23067194  0.012183942
veg.cover.R22   7.43352189  0.002336584
veg.cover.R5    1.23703663 11.899594674

The table corresponds to the position of each modality for each environmental variables on the two first axis. You can link this table to the plot. For example, “veg.cover.R100” seems to stand out from the other modalities. In the table, we can see that the position on the first axis is 4.35 and -4.95 on the second axis. Closer to the center of the cloud, we can talk about “veg.cover.R98” that have position of 1.74 on the first axis and -1.64 on the second axis. In addition, we can observe the contribution of this two variables. On the second table we have the absolute contribution of the modalities for each variable. The contribution of the “veg.cover.R100” of the first axis is 17% and 22% on the second axis. This is a high value if we take the threshold to 1/N with N the number of modalities (28 here so 3%). The contribution of “veg.cover.R98” is 4.6% for the first axis and 4.1% for the second axis.

We can therefore conclude that these two variables are correlated. This seems to be biologically coherent since these two variables are part of a plant cover gradient and are the two highest values.

We can also see that the variable negatively correlated with the first axis are modalities explaining the urbanization ()

Now we can realise the same analyse for the traits.

#analyse the traits:
round(rlq$c1,dig=2)
                     CS1   CS2
feed.hab.insect     0.72  0.73
feed.hab.grani     -0.69 -0.24
feed.hab.omni      -0.13 -1.64
feed.strat.ground  -0.11  0.01
feed.strat.aerial  -1.26  1.24
feed.strat.foliage  2.60 -1.97
breeding.ground     3.60  4.16
breeding.building  -1.16  0.35
breeding.scrub      1.71 -2.22
breeding.foliage   -0.15 -0.62
migratory.resident -0.24 -0.23
migratory.migrant   0.54  0.50
s.arrow(rlq$c1, boxes=FALSE)

#absolute contribution
abscoiV=iner$col.abs
abscoiV
                         Axis1        Axis2
feed.hab.insect     5.69187893  5.861738276
feed.hab.grani      5.10226145  0.632420595
feed.hab.omni       0.05897841  8.814467814
feed.strat.ground   0.22470905  0.003581384
feed.strat.aerial   7.09044249  6.818080636
feed.strat.foliage 19.80381068 11.362659415
breeding.ground    28.19988784 37.628640395
breeding.building  17.18536604  1.552603443
breeding.scrub     13.29440922 22.370738798
breeding.foliage    0.11612276  2.116775390
migratory.resident  1.00414733  0.881790779
migratory.migrant   2.22798580  1.956503077

As seen before, the first table is the position of the modalities for the two first axes of the RLQ analyses. The plot is the representation of that and the second table is the absolute contributions.

Here, we have larger differences between the different variable and modalities. We’re going to focus on a few examples, but you can take the time to go into detail about each modality. “feed.strat.foliage” (coord= 2.60; -1.97 for first and second axis respectively) and “breeding.scrub” (coord= 1.71: -2.22) break away from the centre of the cloud. In addition their contributions are all greater than 10%. We can conclude that these modalities are correlated and highly contribute of axes.

The last point is to observe the species coordinates.

#analyse the species:
round(rlq$lQ,dig=2)
     AxcQ1 AxcQ2
Sp1   0.04 -0.44
Sp2  -0.22 -0.19
Sp3   0.97  0.76
Sp4  -0.55 -0.03
Sp5   0.17 -0.67
Sp6  -0.30 -0.27
Sp7   0.71 -0.24
Sp8  -0.29  0.70
Sp9   0.99  1.17
Sp10 -0.29  0.70
Sp11 -0.29  0.70
Sp12  1.19  1.35
Sp13 -0.41 -0.38
Sp14  1.19  1.35
Sp15  1.19  1.35
Sp16 -0.20  0.22
Sp17  0.00  0.40
Sp18  0.71 -0.24
Sp19  0.31 -1.02
Sp20  1.39 -0.74
Sp21  1.39 -0.74
Sp22  1.20 -0.92
Sp23  1.39 -0.74
Sp24  1.39 -0.74
Sp25  0.48 -0.28
Sp26  0.48 -0.28
Sp27  0.64  0.93
Sp28  0.17 -0.67
Sp29 -0.55 -0.03
Sp30 -0.55 -0.03
Sp31 -0.30 -0.27
Sp32 -0.30 -0.27
Sp33  0.38 -0.76
Sp34  0.85 -1.16
Sp35 -0.30 -0.27
Sp36 -0.20  0.22
Sp37  1.18 -1.33
Sp38 -0.16 -0.62
Sp39 -0.41 -0.38
Sp40 -0.16 -0.62
s.label(rlq$lQ,boxes=FALSE)

s.label(rlq$lQ, label=aviurba$species.names.fr,boxes=FALSE)

Here, is to see that we seem to have a gradient of response along the first axis (i.e. Loriot Jaune/Sp37 [coord=1.18; -1.33] and Alouette des champs/Sp9 [coord=0.99; 1.17]) and along the second axis (i.e. Loriot Jaune/Sp37 [coord=1.18; -1.33] and Fauvette des jardins/Sp21 [coord=0.99; 1.17]). We will see that we need to link this to the other variables (trait and environment) For species and sites, we do not have contribution because these variables do not contribute of the axis creation.

To conclude on this analysis, the two first axes explain 86.2% of the total variation with respectively 66.7%, and 19.5% of the total inertia.

The correlations between the environmental variables and the RLQ axes showed that the first axis is explained by Veg.cover.R100 (contribution of 16.95%), veg.cover.R22 (7.43%), noisy.yes (8.82%), small.bui.yes (11.22%), field.yes (8.17%), field.no (8.2%) and Veg.cover.R22 (7.43%). The traits variables that contribute to the first axis of the RLQ axes are breeding.ground (28.19%), feed.strat.foliage (19.8%), breeding.scrub (13.3%) and negatively with breeding.building (17.18%) and feed.strat.aerial (7.09%). In other words, breeding.ground, feed.strat.foliage, breeding.scrub, feed.strat.aerial and breeding.building were the higher explanatory attributes for this RLQ axis (explained by the length and the angle between the axes and the vectors).

To go further with Multivariate Analysis

Two other methods of two tables analysis were not presented here :

  • Co-inertia is used when you have a table with you response variable \(Y\) and a second one with potentially explaining variables, \(X\), but you do not have any a priori about the effect of \(X\) on \(Y\).

  • CCA is used with two table, where both \(Y\) and \(X\) are quantitative data. It works the same as RDA.

  • PCCA is CCA like but it includes a spacial dependency… Spacial and Temporal dependencies are a great deal in data analysis !

Even if this presentation of multivariate analysis was not exhaustive, you can have a better view of what you can do when you are to analyse a data table. The method chosen will depend on the number of tables and the type of the variables (quantitative or qualitative). To help you decide what analysis to use regarding to the ones we presented here, you can have a look to this decision tree !

Decision tree multivariate analysis (from A. Pannard)