Exploring data on penguins

Author

CHOULETTE Klervi, DENIS Manon, DUPORT Maureen, MOTTE Arthur, POYARD Antoine, SESIA Marion, TANGUY Céline

Introduction

Study context

This document is a report on the analysis of a dataset on penguins. The data collected contains informations on multiples variables. The dataset was found on Kaggle : https://www.kaggle.com/datasets/utkarshx27/penguin-size-clutch-and-blood-isotope-data?select=penguins_raw.csv. This dataset was assemble from various articles according by Gorman et al. (Gorman 2014) and Horst et al. (Horst, Hill, and Gorman 2022). It focused on three penguin species : Adelie (Pygoscelis adeliae), Chinstrap (Pygoscelis antarcticus) and Gentoo (Pygoscelis papua). The dataset contains 344 observations and 17 variables. The goal of this analysis is to explore the dataset and to identify the main characteristics of the penguins.

Show the code
library(factoextra)         # for data visualization
library(FactoMineR)         # for PCA
library(tidyverse)          # for data manipulation and visualization
library(dplyr)              # for data manipulation
library(ggplot2)            # for data visualization
library(corrplot)           # for correlation test
library(sf)                 # for spatial data
library(rnaturalearth)      # for earth maps
library(rnaturalearthdata)  # for earth dataset
library(ggspatial)          # for spatial plot
library(gridExtra)          # for arranging plot
library(kableExtra)         # for enhanced tables

Location

Data were collected in Antarctica on Biscoe, Dream and Torgersen Islands. These three blue points show the location of the data collect in Figure 1.

Show the code
# Download country data (including Antarctica)
world <- ne_countries(scale = "medium", returnclass = "sf")

# Filter to keep only Antarctica
antarctica <- world[world$continent == "Antarctica", ]

rectangle_coords <- st_as_sfc(st_bbox(c(xmin = -90, xmax = -55, ymin = -75, ymax = -60), crs = st_crs(4326)))

antarctic_proj <- "+proj=laea +lat_0=-90 +datum=WGS84"
# Transform the rectangle into the polar projection system
rectangle_proj <- st_transform(rectangle_coords, crs = antarctic_proj)

# Create the map
gg_antartique = ggplot(data = antarctica) +
    annotate("rect", xmin = -90, xmax = -55, ymin = -75, ymax = -60,
           fill = "blue") +
  geom_sf() +
  coord_sf(crs = "+proj=laea +lat_0=-90 +datum=WGS84") +
 annotate("rect", 
           xmin = st_bbox(rectangle_proj)["xmin"], xmax = st_bbox(rectangle_proj)["xmax"], 
           ymin = st_bbox(rectangle_proj)["ymin"], ymax = st_bbox(rectangle_proj)["ymax"], 
           alpha = 0.2, fill = "blue")+
  theme_minimal() +
  labs(title = "Antarctica map",
       caption = "Data: Natural Earth") +
  annotation_scale(location = "bl", width_hint = 0.5)

antarctic_proj <- "+proj=laea +lat_0=-90 +datum=WGS84"

# Point coordinates in longitude and latitude
point_coords <- data.frame(lon = c(-66.209547, -64, -65), lat = c(-66, -65, -64.8))

# Convert the point into an sf object (longitude/latitude coordinate system WGS84)
point_sf <- st_as_sf(point_coords, coords = c("lon", "lat"), crs = 4326)

# Transform the point into the projected coordinate system of the map
point_proj <- st_transform(point_sf, crs = antarctic_proj)

# Extract the projected coordinates for use in annotate
point_proj_coords <- st_coordinates(point_proj)

# Create a bounding box manually with the coordinates
# Convert the bounding box into an sf object
bbox_sf <- st_as_sfc(st_bbox(c(xmin = -90, xmax = -55, ymin = -75, ymax = -60)))

# Crop the map using the bounding box
antarctica_cropped <- st_crop(antarctica, bbox_sf)

# Create the map
gg_zone_interet = ggplot() +
  geom_sf(data = antarctica_cropped) +
  geom_sf(data = point_proj, color = "blue", size = 3, alpha = 0.8, shape = 19) +

  coord_sf(crs = "+proj=laea +lat_0=-90 +datum=WGS84") +
  theme_minimal() +
  labs(title = "Area of Interest",
       caption = "Data: Natural Earth") +
  annotation_scale(location = "bl", width_hint = 0.5)

grid.arrange(gg_antartique, gg_zone_interet, ncol = 2)

Figure 1: Maps of Antarctica with the collect points

Dataset presentation

Columns name :

  • studyName : sampling expedition from which data were collected, generated, etc.
  • Sample Number : an integer denoting the continuous numbering sequence for each sample
  • Species : a character string denoting the penguin species
  • Region : a character string denoting the region of Palmer LTER sampling grid
  • Island : a character string denoting the island near Palmer Station where samples were collected
  • Stage : a character string denoting the reproductive stage at the time of sampling
  • Individual ID : a character string denoting the unique ID for each individual in dataset
  • Clutch Completion : a character string denoting if the study nest was observed with a full clutch, i.e., 2 eggs
  • Date Egg : a date denoting the date study nest observed with 1 egg (sampled)
  • Culmen Length : a number denoting the length of the dorsal ridge of a bird’s bill (millimeters)
  • Culmen Depth : a number denoting the depth of the dorsal ridge of a bird’s bill (millimeters)
  • Flipper Length : an integer denoting the length penguin flipper (millimeters)
  • Body Mass : an integer denoting the penguin body mass (grams)
  • Sex : a character string denoting the sex of an animal
  • Delta 15 N : a number denoting the measure of the ratio of stable isotopes 15N:14N
  • Delta 13 C : a number denoting the measure of the ratio of stable isotopes 13C:12C
  • Comments : a character string with text providing additional relevant information for data

Loading dataset :

data <- read.table("penguins_raw.csv", sep = "," , header = TRUE, stringsAsFactors = T)
summary(data)
       X            studyName   Sample.Number   
 Min.   :  1.00   PAL0708:110   Min.   :  1.00  
 1st Qu.: 86.75   PAL0809:114   1st Qu.: 29.00  
 Median :172.50   PAL0910:120   Median : 58.00  
 Mean   :172.50                 Mean   : 63.15  
 3rd Qu.:258.25                 3rd Qu.: 95.25  
 Max.   :344.00                 Max.   :152.00  
                                                
                                      Species       Region          Island   
 Adelie Penguin (Pygoscelis adeliae)      :152   Anvers:344   Biscoe   :168  
 Chinstrap penguin (Pygoscelis antarctica): 68                Dream    :124  
 Gentoo penguin (Pygoscelis papua)        :124                Torgersen: 52  
                                                                             
                                                                             
                                                                             
                                                                             
                Stage     Individual.ID Clutch.Completion       Date.Egg  
 Adult, 1 Egg Stage:344   N13A1  :  3   No : 36           2007-11-27: 18  
                          N13A2  :  3   Yes:308           2007-11-16: 16  
                          N18A1  :  3                     2008-11-09: 16  
                          N18A2  :  3                     2009-11-18: 14  
                          N21A1  :  3                     2008-11-04: 12  
                          N21A2  :  3                     2008-11-06: 12  
                          (Other):326                     (Other)   :256  
 Culmen.Length..mm. Culmen.Depth..mm. Flipper.Length..mm. Body.Mass..g. 
 Min.   :32.10      Min.   :13.10     Min.   :172.0       Min.   :2700  
 1st Qu.:39.23      1st Qu.:15.60     1st Qu.:190.0       1st Qu.:3550  
 Median :44.45      Median :17.30     Median :197.0       Median :4050  
 Mean   :43.92      Mean   :17.15     Mean   :200.9       Mean   :4202  
 3rd Qu.:48.50      3rd Qu.:18.70     3rd Qu.:213.0       3rd Qu.:4750  
 Max.   :59.60      Max.   :21.50     Max.   :231.0       Max.   :6300  
 NA's   :2          NA's   :2         NA's   :2           NA's   :2     
     Sex      Delta.15.N..o.oo. Delta.13.C..o.oo.
 FEMALE:165   Min.   : 7.632    Min.   :-27.02   
 MALE  :168   1st Qu.: 8.300    1st Qu.:-26.32   
 NA's  : 11   Median : 8.652    Median :-25.83   
              Mean   : 8.733    Mean   :-25.69   
              3rd Qu.: 9.172    3rd Qu.:-25.06   
              Max.   :10.025    Max.   :-23.79   
              NA's   :14        NA's   :13       
                                  Comments  
 Nest never observed with full clutch.: 34  
 Not enough blood for isotopes.       :  7  
 Sexing primers did not amplify.      :  4  
 No blood sample obtained for sexing. :  2  
 No blood sample obtained.            :  2  
 (Other)                              :  5  
 NA's                                 :290  
# dim(data)

Names modification of columns and species :

Show the code
data <- data %>%
  select(-X) %>%                   # Remove X column 
  select(-Comments) %>%           # Remove Comments column
  rename(                          # we rename the columns of the dataset
    sample_number = Sample.Number,
    id = Individual.ID,
    island = Island,
    region = Region,
    stage = Stage,
    clutch_completion = Clutch.Completion,
    date_egg = Date.Egg,
    culmen_length = Culmen.Length..mm.,
    culmen_depth = Culmen.Depth..mm.,
    flipper_length = Flipper.Length..mm.,
    body_mass = Body.Mass..g.,
    sex = Sex,
    delta_15_N = Delta.15.N..o.oo.,
    delta_13_C = Delta.13.C..o.oo.,
    species = Species)

names(data)
 [1] "studyName"         "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "id"                "clutch_completion" "date_egg"         
[10] "culmen_length"     "culmen_depth"      "flipper_length"   
[13] "body_mass"         "sex"               "delta_15_N"       
[16] "delta_13_C"       
Show the code
# rename species
data = data %>%
  mutate(species = recode(species, "Adelie Penguin (Pygoscelis adeliae)" = "Adelie", "Chinstrap penguin (Pygoscelis antarctica)" = "Chinstrap", "Gentoo penguin (Pygoscelis papua)" = "Gentoo"))

These names are easier to understand.

Missing values

The Figure 2 have a red line that represents the mean of missing values (NA) in the dataset and the bars are showing the number of NA’s for each variables.

Show the code
# Verification of missing values
colSums(is.na(data)) 
        studyName     sample_number           species            region 
                0                 0                 0                 0 
           island             stage                id clutch_completion 
                0                 0                 0                 0 
         date_egg     culmen_length      culmen_depth    flipper_length 
                0                 2                 2                 2 
        body_mass               sex        delta_15_N        delta_13_C 
                2                11                14                13 
Show the code
data_NA <- data 

# Representation of missing values
NA_df <- (colSums(is.na(data_NA)) / nrow(data_NA)) * 100

color_pal <- colorRampPalette(c("limegreen", "yellow", "firebrick"), bias = 20)(10)
NA_col <- color_pal[as.integer(NA_df/length(NA_df)*9)+1]

# Graphique
barplot(NA_df,ylab = "NA", col = NA_col, las=2, cex.names=0.6)

NA_mean<-mean(NA_df)
# 0.78% 

abline(h=NA_mean, col = "red", lwd = 2)

Figure 2: Number of missing value as a function of each variables

Exploratory analysis

In this dataset, we would like to compare the different qualitative variables :
- Culmen Length
- Culmen Depth
- Flipper Length
- Body Mass
- Delta 13C
- Delta 15N

Correlation

To compare these qualitative variables, we begin by examining the correlations among them.

The representation of this matrix is not optimal.

sub_data_num = data%>%
  select(culmen_length, culmen_depth,flipper_length,body_mass,delta_13_C,delta_15_N)%>%
  na.omit() # we delete the few individuals with NA

matcor = cor(sub_data_num,method = "pearson")
matcor
               culmen_length culmen_depth flipper_length  body_mass delta_13_C
culmen_length     1.00000000   -0.2259394      0.6491779  0.5934113  0.1807339
culmen_depth     -0.22593935    1.0000000     -0.5810728 -0.4677456  0.4258196
flipper_length    0.64917789   -0.5810728      1.0000000  0.8748008 -0.3820590
body_mass         0.59341135   -0.4677456      0.8748008  1.0000000 -0.3755823
delta_13_C        0.18073392    0.4258196     -0.3820590 -0.3755823  1.0000000
delta_15_N       -0.05975909    0.6058744     -0.5077872 -0.5378882  0.5706148
                delta_15_N
culmen_length  -0.05975909
culmen_depth    0.60587439
flipper_length -0.50778717
body_mass      -0.53788824
delta_13_C      0.57061480
delta_15_N      1.00000000

We can use the library corrplot to make some visual correlation matrix.

There is a correlation that we have to remove in a multivariate analysis between the body_mass variable and the flipper_length as we can see in Figure 3 and Figure 4.

Show the code
corrplot(matcor, method = "number", type = "upper",order = "hclust", tl.col="black", tl.srt=45)

Figure 3: Correlation matrix with numbers
Show the code
corrplot(matcor, type = "upper",order = "hclust", tl.col="black", tl.srt=45)

Figure 4: Correlation matrix with forms

Now, we want to observe whether certain qualitative variables have an effect on body mass. We will therefore try to create a graph for each factor (qualitative variables) according to body mass:
- Species
- Island
- Sex

We have noticed some trends in Figure 5 that we will have to confirm with statistical tests.
Box1: Gentoo penguins appear to be heavier than the other two species.
Box2: Penguins from Biscoe Island appear to be heavier than the other two islands.
Box3: Male penguins appear to be heavier than the other two categories.

Show the code
box1 <- ggplot(data = data)+
  geom_boxplot(aes(x=species,y=body_mass, fill = species))

box2 <- ggplot(data = data)+
  geom_boxplot(aes(x=island,y=body_mass, fill = island))

box3 <- ggplot(data = data)+
  geom_boxplot(aes(x=sex,y=body_mass, fill = sex))
global_box <- grid.arrange(box1,box2,box3)

Violin plot

Here is another way to visualize in Figure 6 the distribution between two variables.

Show the code
# Violin plot for Culmen Length (beak length)
v1 <- ggplot(data, aes(x = species, y = culmen_length, fill = species)) +
  geom_violin(trim = FALSE) +
  geom_jitter(shape = 16, position = position_jitter(0.2), color = "black", alpha = 0.3) +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 3, fill = "red", color = "red") +
  labs(title = "Distribution of Culmen Length by Species",
       x = "Species",
       y = "Culmen Length (mm)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 8),  # Taille de l'axe x
        axis.title.y = element_text(size = 8))

# Violin plot for Flipper Length
v2 <- ggplot(data, aes(x = species, y = flipper_length, fill = species)) +
  geom_violin(trim = FALSE) +
  geom_jitter(shape = 16, position = position_jitter(0.2), color = "black", alpha = 0.3) +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 3, fill = "red", color = "red") +
  labs(title = "Distribution of Flipper Length by Species",
       x = "Species",
       y = "Flipper Length (mm)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 8),  # Taille de l'axe x
        axis.title.y = element_text(size = 8))

# Violin plot for Body Mass
v3 <- ggplot(data, aes(x = species, y = body_mass, fill = species)) +
  geom_violin(trim = FALSE) +
  geom_jitter(shape = 16, position = position_jitter(0.2), color = "black", alpha = 0.3) +
    stat_summary(fun = mean, geom = "point", shape = 20, size = 3, fill = "red", color = "red") +
  labs(title = "Distribution of Body Mass by Species",
       x = "Species",
       y = "Body Mass (g)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(size = 10, face = "bold"),
        axis.title.x = element_text(size = 8),  # Taille de l'axe x
        axis.title.y = element_text(size = 8))

grid.arrange(v1, v2, v3,  #names of graphs  
             ncol = 3, nrow = 1)

Figure 6: Violin plot of body mass as function of many variables

ANOVA, U-Mann-Whitney tests, Student test

We run ANOVA and mean comparison tests to verify these differences.
There is a significant difference between the body mass of these 3 species, almost 2 of them, with a p-value of 2e-16.

# homoscedasticity
bartlett.test(data$body_mass~data$species) # p = 0.05005 > 0.05 homoscedasticity conditions are satisfied

    Bartlett test of homogeneity of variances

data:  data$body_mass by data$species
Bartlett's K-squared = 5.9895, df = 2, p-value = 0.05005
anova1 = aov(data$body_mass~data$species)

#normality of residuals
shapiro.test(residuals(anova1)) #p = 0.05118 > 0.05, the normality of residuals is accepted

    Shapiro-Wilk normality test

data:  residuals(anova1)
W = 0.99166, p-value = 0.05118
summary(anova1)
              Df    Sum Sq  Mean Sq F value Pr(>F)    
data$species   2 146864214 73432107   343.6 <2e-16 ***
Residuals    339  72443483   213698                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness

We perform a Tukey’s post-hoc test to see where the differences lie. There is a significant difference between the body mass of Gentoo and Adelie (p = 0) and Gentoo and Chinstrap (p = 0). There is no significant difference between Adelie and Chinstrap.

TukeyHSD(anova1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = data$body_mass ~ data$species)

$`data$species`
                       diff       lwr       upr     p adj
Chinstrap-Adelie   32.42598 -126.5002  191.3522 0.8806666
Gentoo-Adelie    1375.35401 1243.1786 1507.5294 0.0000000
Gentoo-Chinstrap 1342.92802 1178.4810 1507.3750 0.0000000

We cannot perform an ANOVA between body mass and islands because the homoscedasticity conditions are not satisfied, so we will perform a Kruskall-Wallis test. There is a significant difference between the body mass of these 3 islands, almost 2 of them, with a p-value of 2e-16.

# homoscedasticity
bartlett.test(data$body_mass~data$island) # p = 8.978e-14 < 0.05 homoscedasticity conditions are not satisfied

    Bartlett test of homogeneity of variances

data:  data$body_mass by data$island
Bartlett's K-squared = 60.083, df = 2, p-value = 8.978e-14
# Kruskall Wallis test
kruskal.test(data$body_mass~data$island)

    Kruskal-Wallis rank sum test

data:  data$body_mass by data$island
Kruskal-Wallis chi-squared = 130.07, df = 2, p-value < 2.2e-16

In order to observe the differences between the islands, we are going to carry out tests to compare the average between each island.

# normality
tapply(data$body_mass, data$island, shapiro.test)
$Biscoe

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.97247, p-value = 0.00209


$Dream

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.98889, p-value = 0.4161


$Torgersen

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.97185, p-value = 0.2636
# Biscoe: p = 0.00209 < 0.05, normality is not respected
# Dream: p = 0.4161 > 0.05, normality is respected
# Torgersen: p =  0.2636 > 0.05, normality is respected

Data normality is respected for Torgensen Island and Dream Island. We will therefore carry out a homoscedasticity test between these islands to determine the appropriate mean comparison test.

# Filter the data between the islands
dream_data <- subset(data, island == "Dream")$body_mass
torgersen_data <- subset(data, island == "Torgersen")$body_mass
biscoe_data <- subset(data, island == "Biscoe")$body_mass

# Perform a variance test between the 'Dream' and 'Torgersen' groups
var.test(dream_data, torgersen_data) # p = 0.5526 > 0.05, homoscedasticity conditions are satisfied

    F test to compare two variances

data:  dream_data and torgersen_data
F = 0.87619, num df = 123, denom df = 50, p-value = 0.5526
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.5354096 1.3682404
sample estimates:
ratio of variances 
         0.8761931 

We will therefore carry out two U-Mann-Whitney tests between the Biscoe and Torgersen Islands and the Biscoe and Dream Islands, and a Student t-test between the Dream and Torgersen Islands.

# U-Mann Withney test between the Biscoe and Torgersen
wilcox.test(biscoe_data, torgersen_data) # p = 1.396e-14 < 0.05

    Wilcoxon rank sum test with continuity correction

data:  biscoe_data and torgersen_data
W = 7293, p-value = 1.396e-14
alternative hypothesis: true location shift is not equal to 0
# U-Mann Withney test between the Biscoe and Dream
wilcox.test(biscoe_data, dream_data) # p = 2.2e-16 < 0.05

    Wilcoxon rank sum test with continuity correction

data:  biscoe_data and dream_data
W = 17741, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Student t-test between the Dream and Torgersen
t.test(dream_data, torgersen_data, var.equal = T) # p = 0.9265 > 0.05

    Two Sample t-test

data:  dream_data and torgersen_data
t = 0.092359, df = 173, p-value = 0.9265
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -133.0344  146.0958
sample estimates:
mean of x mean of y 
 3712.903  3706.373 

There is a significant difference in body mass between Dream and Biscoe Islands (p = 2.2e-16) and between Torgersen and Biscoe Islands (p = 1.396e-14). There was no significant difference between Torgersen and Dream Islands.

As there are only two modalities for gender, instead of an ANOVA we will perform a U-Mann-Whitney mean comparison test. There is a significant difference between the body mass and the sex with a p-value of 1.813e-15.

# normality
tapply(data$body_mass, data$sex, shapiro.test)
$FEMALE

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.91931, p-value = 6.155e-08


$MALE

    Shapiro-Wilk normality test

data:  X[[i]]
W = 0.92504, p-value = 1.227e-07
# Female: p = 6.155e-08 < 0.05, normality is not respected
# Male: p = 1.227e-07 < 0.05, normality is not respected

# U-Mann Withney test
female_data = subset(data, sex == "FEMALE")$body_mass
male_data = subset(data, sex == "MALE")$body_mass
wilcox.test(female_data, male_data) # p = 1.813e-15 < 0.05

    Wilcoxon rank sum test with continuity correction

data:  female_data and male_data
W = 6874.5, p-value = 1.813e-15
alternative hypothesis: true location shift is not equal to 0

Distribution of species between the islands in Table 1 : we can see from this table that the Adelie species is present on all three islands, while the Chrinstrap and Gentoo species are only present on 2 islands, Dream and Biscoe respectively.

Show the code
species_islands <- table(data$island, data$species)
kable(species_islands, "html", escape = F) %>%
  kable_styling(bootstrap_options = c("striped"),
                full_width = F,
                position = "center")
Table 1: Species distribution between the islands
Adelie Chinstrap Gentoo
Biscoe 44 0 124
Dream 56 68 0
Torgersen 52 0 0

Graphical representations and interpretations

In Figure 7, it would appear that there is a positive linear trend between the body_mass variable and flipper_length for each species.

Show the code
ggplot(data, aes(x = flipper_length, y = body_mass, color = species))+
  geom_point()+
  scale_color_manual(values = c("#B2DF8A","#FDBF6F","#CAB2D6"))+
  geom_smooth(method = "lm", col = "red", se = TRUE)+
  theme_bw()+
  theme(legend.position = "none")+
  facet_grid(~species)

Figure 7: Flipper length as a function of body mass

Check with linear regression :

rl1 <- lm(body_mass ~ flipper_length + species + species:flipper_length, data = data)
summary(rl1)

Call:
lm(formula = body_mass ~ flipper_length + species + species:flipper_length, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-911.18 -251.93  -31.77  197.82 1144.81 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -2535.837    879.468  -2.883  0.00419 ** 
flipper_length                     32.832      4.627   7.095 7.69e-12 ***
speciesChinstrap                 -501.359   1523.459  -0.329  0.74229    
speciesGentoo                   -4251.444   1427.332  -2.979  0.00311 ** 
flipper_length:speciesChinstrap     1.742      7.856   0.222  0.82467    
flipper_length:speciesGentoo       21.791      6.941   3.139  0.00184 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 370.6 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7896,    Adjusted R-squared:  0.7864 
F-statistic: 252.2 on 5 and 336 DF,  p-value: < 2.2e-16

These results show a highly significant effect of fin length for all species, with a greater effect for the Gentoo species. There was also a significant difference in body mass between species except between Adelie and Chinstrap.

In conclusion, we can say that fin length influences body mass and that this effect varies according to species.

It would appear that there is a positive linear trend between the body_mass variable and flipper_length for each species.

In the Figure 8, it would appear that there is a positive linear trend between the culmen_depth variable and culmen_length for each species.

Show the code
ggplot(data, aes(x = culmen_length, y = culmen_depth, color = species))+
  geom_point()+
  scale_color_manual(values = c("#B2DF8A","#FDBF6F","#CAB2D6"))+
  geom_smooth(method = "lm", col = "red", se = TRUE)+
  theme_bw()+
  theme(legend.position = "none")+
  facet_grid(~species)

Figure 8: Culmen length as a function of culmen depth

Check with linear regression :

rl2 <- lm(culmen_depth ~ culmen_length + species + species:culmen_length, data = data)
summary(rl2)

Call:
lm(formula = culmen_depth ~ culmen_length + species + species:culmen_length, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6574 -0.6675 -0.0524  0.5383  3.5032 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    11.40912    1.13812  10.025  < 2e-16 ***
culmen_length                   0.17883    0.02927   6.110 2.76e-09 ***
speciesChinstrap               -3.83998    2.05398  -1.870 0.062419 .  
speciesGentoo                  -6.15812    1.75451  -3.510 0.000509 ***
culmen_length:speciesChinstrap  0.04338    0.04558   0.952 0.341895    
culmen_length:speciesGentoo     0.02601    0.04054   0.642 0.521590    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9548 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7662 
F-statistic: 224.5 on 5 and 336 DF,  p-value: < 2.2e-16

These results show that beak length is significant and positive in Adelie. In the species Chinstrap and Gentoo, the beak depth is smaller and is significant in Gentoo. At the interaction level, it is not significant, which means that the effect of beak length on depth is relatively the same for 3 species.

We can conclude that beak depth depends on beak length but this does not vary between species. Gentoo penguins have a shallower beak than the others.

In Figure 9, it seems to be different trends between species.

Show the code
ggplot(data, aes(x = delta_15_N, y = delta_13_C, color = species))+
  geom_point()+
  scale_color_manual(values = c("#B2DF8A","#FDBF6F","#CAB2D6"))+
  geom_smooth(method = "lm", col = "red", se = TRUE)+
  theme_bw()+
  theme(legend.position = "none")+
  facet_grid(~species)

Figure 9: Delta_15_N as a function of delta_13_C

We check this using a linear model :

rl3 <- lm(delta_13_C ~ delta_15_N + species + species:delta_15_N, data = data)
summary(rl3)

Call:
lm(formula = delta_13_C ~ delta_15_N + species + species:delta_15_N, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.00888 -0.26656 -0.04769  0.20862  1.82777 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -32.01429    0.77063 -41.543  < 2e-16 ***
delta_15_N                    0.70093    0.08688   8.068 1.41e-14 ***
speciesChinstrap              8.28567    1.57149   5.273 2.46e-07 ***
speciesGentoo                15.64751    1.46204  10.702  < 2e-16 ***
delta_15_N:speciesChinstrap  -0.78957    0.17013  -4.641 5.04e-06 ***
delta_15_N:speciesGentoo     -1.89173    0.17387 -10.880  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4381 on 324 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.6957,    Adjusted R-squared:  0.691 
F-statistic: 148.2 on 5 and 324 DF,  p-value: < 2.2e-16

These results show that the relationship between nitrogen 15 and carbon 13 is positive and significant in Adelie, this relationship is reduced in Chinstrap and even more so in Gentoo. Chinstrap and Gentoo have significantly higher carbon-13 values than Adelie. The interactions between nitrogen 15 and the species are significant, showing that the effect of nitrogen 15 on carbon 13 varies according to the species.

In conclusion, this model shows that the effect of nitrogen 15 on carbon 13 is highly species-dependent, with differences between species.

Principal Component Analysis (PCA) of species characteristics

# Selection of quantitative data
data_numeric <- select_if(data, is.numeric)

# Remove sample_number 
# The body_mass et the flipper_length are more than 80% correlated, we're deleting flipper_length because it's the data that's most correlated with all the other data.
# names(data_numeric)
data_numeric <- data_numeric %>% select(-c("sample_number","flipper_length"))

# Normalizing data
data_scales <- scale(data_numeric)

# PCA realization
acp_data= PCA(data_scales, graph = F)
summary(acp_data)

Call:
PCA(X = data_scales, graph = F) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5
Variance               2.572   1.351   0.526   0.365   0.185
% of var.             51.442  27.029  10.525   7.299   3.705
Cumulative % of var.  51.442  78.471  88.996  96.295 100.000

Individuals (the 10 first)
                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
1             |  1.314 |  0.890  0.090  0.459 | -0.829  0.148  0.397 |  0.320
2             |  1.649 |  1.245  0.175  0.571 | -0.054  0.001  0.001 | -0.732
3             |  1.648 |  0.818  0.076  0.246 | -0.828  0.147  0.252 | -0.479
4             |  0.000 |  0.000  0.000  0.613 |  0.000  0.000  4.990 |  0.000
5             |  2.017 |  1.562  0.276  0.600 | -1.031  0.229  0.261 |  0.156
6             |  2.131 |  1.565  0.277  0.539 | -0.580  0.072  0.074 |  0.759
7             |  1.598 |  1.435  0.233  0.806 | -0.429  0.040  0.072 | -0.315
8             |  2.345 |  1.636  0.302  0.487 |  0.427  0.039  0.033 |  0.787
9             |  2.080 |  1.134  0.145  0.297 | -1.659  0.592  0.636 | -0.105
10            |  1.914 |  1.520  0.261  0.630 |  0.383  0.032  0.040 |  0.891
                 ctr   cos2  
1              0.057  0.059 |
2              0.296  0.197 |
3              0.126  0.084 |
4              0.000  4.933 |
5              0.013  0.006 |
6              0.318  0.127 |
7              0.055  0.039 |
8              0.342  0.113 |
9              0.006  0.003 |
10             0.439  0.217 |

Variables
                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
culmen_length | -0.382  5.660  0.146 |  0.885 57.930  0.783 |  0.037  0.260
culmen_depth  |  0.795 24.599  0.633 |  0.055  0.227  0.003 |  0.555 58.510
body_mass     | -0.818 26.034  0.670 |  0.392 11.384  0.154 |  0.304 17.588
delta_15_N    |  0.830 26.756  0.688 |  0.282  5.887  0.080 |  0.061  0.719
delta_13_C    |  0.660 16.950  0.436 |  0.576 24.572  0.332 | -0.347 22.924
                cos2  
culmen_length  0.001 |
culmen_depth   0.308 |
body_mass      0.093 |
delta_15_N     0.004 |
delta_13_C     0.121 |

Figure 10 is an histogram of the variance of the factorial axes with the percentage of inertia of each axis. Here, we keep 2 dimensions and 78.4% inertia.

Show the code
#Variance for each component
va_propre = fviz_eig(acp_data, main ="", choice = "variance",
         label_size =50, addlabels = T,
         barfill  = "grey",barcolor = "grey")+
   ylim(c(0,60))
va_propre

Figure 10: Barplot of eigen values

Figure 11 assess the absolut contribution of variables to each axis. We can see that three variables make a strong contribution to axis 1 (delta_15_N, body_mass and culmen_depth) and 2 to axis 2 (culmen_depth and delta_13_C).

Show the code
#Contribution of axis variables axis 1 and 2
c1 = fviz_contrib(acp_data,choice = "var",axes =1,
                  fill = "grey",color = "grey")+ labs(title = "Axe 1") #For axe 1
c2 = fviz_contrib(acp_data,choice = "var",axes =2,
                  fill = "grey",color = "grey")+ labs(title = "Axe 2") #For axe 2
grid.arrange(c1,c2, ncol = 2)

Figure 11: Barplots of absolut contributions of axe 1 and 2

First graph (Figure 12) show the result of PCA on axes 1 and 2. A priori clustering by the Species has been performed. Gentoo penguins have a greater mass and a longer culmen. Adelie are lighter and have a depth culmen. Chinstrap penguins are the lightest and have a long, depth culmen. Adelie and Chinstrap penguins both have a high concentration of carbon 13 and nitrogen 15.

Show the code
#Projection of variables and individuals and differentiation according to species
bi_plot = fviz_pca_biplot(acp_data, label="var", 
                habillage=data$species, 
                addEllipses = T,
                ellipse.alpha = F, 
                ylim=c(-3,3.5), xlim = c(-3.5,3.5))+
  scale_color_manual(values = c("#B2DF8A","#FDBF6F","#CAB2D6"))+
  theme_minimal()+
  theme(legend.position = "bottom")
grid.arrange(bi_plot, va_propre, c1, c2,  #names of graphs
             ncol = 2, nrow = 3, #How many row and col
             widths = c(2.2,1), #Here, the first col will be bigger than the second
             layout_matrix = cbind(c(1,1,1), #On the first col only graph 1 for the 3 rows
                                   c(2,3,4))) #On the second col, the 3 other graphs

Figure 12: Projection of variables and individuals according to species

Second graph (Figure 13) show the result of PCA on axes 1 and 2. A priori clustering by the Island has been performed. We can see that Gentoo penguins live only on Biscoe Island, while Chinstrap penguins live mainly on Dream Island. Adelie penguins live on these two islands and also on Togersen Island.

The common habitats of Chinstrap and Adelie penguins seem to explain the presence of isotopes in these two species, as they must find a similar diet.

Show the code
#Projection of variables and individuals and differentiation according to species 
bi_plot = fviz_pca_biplot(acp_data, label="var",  habillage=data$island,  addEllipses = T,  ellipse.alpha = F,   ylim=c(-3,3.5),xlim = c(-3.5,3.5))+  
  scale_color_manual(values = c("#B2DF8A","#FDBF6F","#CAB2D6"))+  
  theme_minimal()+  theme(legend.position = "bottom") 
grid.arrange(bi_plot, va_propre, c1, c2, #names of graphs  
             ncol = 2, nrow = 3, #How many row and col  
             widths = c(2.2,1), #Here, the first col will be bigger than the second  
             layout_matrix = cbind(c(1,1,1), #On the first col only graph 1 for the 3 rows  
                                   c(2,3,4))) #On the second col, the 3 other graphs

Figure 13: Projection of variables and individuals and dfferentiation according to the islands

There is a significant difference between the body mass of these 3 species, almost 2 of them.

anova2 = aov(body_mass~island, data = data)

summary(anova2)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
island        2  86314512 43157256     110 <2e-16 ***
Residuals   339 132993186   392310                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness

There is a significant difference between the body mass of these 3 islands, almost 2 of them.

Body mass model: an example of ANCOVA

ANCOVA (i.e. analysis of variance-covariance) is a statistic method that enable to model the relation between a dependent quantitative variable and several dependent explanatory variable that can be quantitative AND qualitative.

Here we aim to determine:

  1. Which variable influence the penguins’ body mass of the penguins

  2. What is the equation of the body mass model

We are selecting different variable of the data set that we think can influence the body mass: species, island, clutch_completion, culmen_length, culmen_depth, flipper_length, body_mass, sex, delta_15_N, delta_13_C. They are the explanatory variables. It is important to check the type (numeric or factor) of each variable and change if necessary.

We remove the individuals for which at least one measure is missing.

#selection of the interesting variables
dataLM<-data %>% 
  select(species,island,clutch_completion,culmen_length,culmen_depth,flipper_length,body_mass,sex,delta_15_N,delta_13_C)

#set the variable type (factors or numerics)
dataLM$species<-as.factor(dataLM$species)
dataLM$island<-as.factor(dataLM$island)
dataLM$clutch_completion<-as.factor(dataLM$clutch_completion)
dataLM$sex<-as.factor(dataLM$sex)
dataLM$flipper_length<-as.numeric(dataLM$flipper_length)
dataLM$body_mass<-as.numeric(dataLM$body_mass)

str(dataLM)

#check if there is missing values 
colSums(is.na(dataLM))

#delete individual with missing values
dataLM<-dataLM %>% 
  drop_na(culmen_length,culmen_depth,flipper_length,body_mass,sex,delta_15_N,delta_13_C)

Data exploration

Before any modelisation, we have to explore the data to avoid some mistake. These are the different step to follow prior to execute a model:

  1. Check for aberrant values and distribution shape of the response variable which is here body_mass.
  2. If the explanatory variable is quantitative, check for aberrant values and distribution shape, and if the explanatory variable is qualitative, analyse the number mode and the number of individual in each one.
  3. Analyse the potential relation between the response variable and the different explanatory variables.
  4. Look for potential interaction between explanatory variables.
  5. Check for potential colinearity between explanatory variables.

Response variable exploration

Show the code
par(mfrow=c(1,3))
# Cleveland plot
dotchart(dataLM$body_mass,pch=16,col='dodgerblue3',xlab='Body mass')
#Histogram
hist(dataLM$body_mass,col='dodgerblue3',xlab="body mass",main="")
# Quantile-Quantile plot
qqnorm(dataLM$body_mass,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$body_mass,col='red',lwd=2)

Figure 14: Distribution of the values of the variable body mass

We can see on the first graphic of the Figure 14 that there is no aberrant values for the variable body mass. The two other graphic of the figure show that there is a deviation of normality. However, normality of the response variable is not a requirement for ANCOVA.

Explanatory variables exploration

First we invastigate the distribution of the quantitative explanatory variables.

Show the code
par(mfrow=c(5,3))

# Culmen Length
# Cleveland plot
dotchart(dataLM$culmen_length,pch=16,col='dodgerblue3',xlab='culmen_length')
# Histogram
hist(dataLM$culmen_length,col='dodgerblue3',xlab="culmen_length",main="")
# Quantile-Quantile plot
qqnorm(dataLM$culmen_length,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$culmen_length,col='red',lwd=2)

# Culmen Depth
# Cleveland plot
dotchart(dataLM$culmen_depth,pch=16,col='dodgerblue3',xlab='culmen_depth')
# Histogram
hist(dataLM$culmen_depth,col='dodgerblue3',xlab="culmen_depth",main="")
# Quantile-Quantile plot
qqnorm(dataLM$culmen_depth,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$culmen_depth,col='red',lwd=2)

# Flipper length
# Cleveland plot
dotchart(dataLM$flipper_length,pch=16,col='dodgerblue3',xlab='flipper length')
# Histogram
hist(dataLM$flipper_length,col='dodgerblue3',xlab="flipper length",main="")
# Quantile-Quantile plot
qqnorm(dataLM$flipper_length,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$flipper_length,col='red',lwd=2)

# Delta_15_N
# Cleveland plot
dotchart(dataLM$delta_15_N,pch=16,col='dodgerblue3',xlab='delta_15_N')
# Histogram
hist(dataLM$delta_15_N,col='dodgerblue3',xlab="delta_15_N",main="")
# Quantile-Quantile plot
qqnorm(dataLM$delta_15_N,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$delta_15_N,col='red',lwd=2)

# Delta_13_C
# Cleveland plot
dotchart(dataLM$delta_13_C,pch=16,col='dodgerblue3',xlab='delta_13_C')
# Histogram
hist(dataLM$delta_13_C,col='dodgerblue3',xlab="delta_13_C",main="")
# Quantile-Quantile plot
qqnorm(dataLM$delta_13_C,pch=16,col='dodgerblue3',xlab='')
qqline(dataLM$delta_13_C,col='red',lwd=2)

Figure 15: Distribution of the values of the explanatory variables

We can see on the first graphic of each row of the Figure 15 that there is no abberant values for the variable culmen_length, culmen_depth, flipper_length, delta_15_N and delat_13_C. The two other graphic of each row of the figure show that there is a deviation of normality for all explanatory variable. However, normality of the explanatory variable is not a requirement for ANCOVA.

Secondly we analyse the qualitative explanatory variables. We look at the distribution of the values in the different mode.

# Factor Sex
summary(dataLM$sex)
FEMALE   MALE 
   163    161 
# Factor Island
summary(dataLM$island)
   Biscoe     Dream Torgersen 
      162       119        43 
# Factor Species
summary(dataLM$species)
   Adelie Chinstrap    Gentoo 
      139        67       118 
# Factor Clutch completion
summary(dataLM$clutch_completion)
 No Yes 
 34 290 

We observe that there is not the same number of observation in each mode so the plan is non orthogonal. This is not mandatory but it is best when the plan is orthogonal.

Secondly, we explore the dynamics between body mass (response variable) and the explanatory variable, to identify if there any potential trend.

Show the code
par(mfrow=c(5,2))
# Culmen length
plot(dataLM$body_mass~dataLM$culmen_length,pch=16,col='dodgerblue3',xlab='culmen length',ylab='body mass')
# Culmen depth
plot(dataLM$body_mass~dataLM$culmen_depth,pch=16,col='dodgerblue3',xlab='culmen depth',ylab='body mass')
# Flipper length
plot(dataLM$body_mass~dataLM$flipper_length,pch=16,col='dodgerblue3',xlab='flipper length',ylab='body mass')
# delta 15 N
plot(dataLM$body_mass~dataLM$delta_15_N,pch=16,col='dodgerblue3',xlab='delta 15N',ylab='body mass')
# delta 13 C
plot(dataLM$body_mass~dataLM$delta_13_C,pch=16,col='dodgerblue3',xlab='delta 13C',ylab='body mass')
#Species
boxplot(dataLM$body_mass~dataLM$species, varwidth = TRUE, ylab = "body mass", xlab = "Species", col='darkgrey', main = "")
#Island
boxplot(dataLM$body_mass~dataLM$island, varwidth = TRUE, ylab = "body mass", xlab = "Island", col='brown', main = "")
#Clutch completion
boxplot(dataLM$body_mass~dataLM$clutch_completion, varwidth = TRUE, ylab = "body mass", xlab = "Clutch completion", col='brown', main = "")
#Sex
boxplot(dataLM$body_mass~dataLM$sex, varwidth = TRUE, ylab = "body mass", xlab = "Sex", col='brown', main = "")

Figure 16: Effect of the explanatories variables on the body mass

We can see on the Figure 16 that the variables culmen_length, culmen_depth, flipper_length, delta_15_N, species and island seems to influence the body mass. The body mass seems to increase when culmen_length, culmen_depth and flipper_length increase. On the opposite, body mass seems to decrease when delta_15_N increase. The species Gentoo and the individuals of the Biscoe Island seem to have a higher body mass.

Here we are considering a simple model with no interaction but in some case it can usefull to examine the potential interaction between the different variables.

In order to avoid colinearity in modelling, we analyse how the explanatories variables are related. To do so we calculate the correlation between the quantitative explanatories variables and evaluate if the qualitative explanatory variables impact the quantitative one using boxplot.

We chose a threshold of 0.7, to ensure there is no colinearity detected between the covariables.

First between covariable: the quantitative explanatories variables.

Show the code
corrplot.mixed(cor(dataLM %>% select_if(is.numeric)%>%select(-body_mass)),upper="square",lower.col="black", tl.col="black",cl.cex = 0.8,tl.cex = 0.7,number.cex =0.8)

Figure 17: Corelation matrice between the covariables

The threshold above which we cosider there is colinearity is usually choose at +/- 0.7. We can thus see in the Figure 17 that there is no colinearity detected between the covariables.

Then between cofactors: the qualitative explanatory variables.

Show the code
# Checking collinearity between cofactors and covariables
par(mfrow=c(5,4))
#Culmen length and species
boxplot(dataLM$culmen_length~dataLM$species, varwidth = TRUE, ylab = "Culmen length", xlab = "Species", col='grey', main = "")
#Culmen length and island
boxplot(dataLM$culmen_length~dataLM$island, varwidth = TRUE, ylab = "Culmen length", xlab = "Island", col='grey', main = "")
#Culmen length and sex
boxplot(dataLM$culmen_length~dataLM$sex, varwidth = TRUE, ylab = "Culmen length", xlab = "Sex", col='grey', main = "")
#Culmen length and clutch completion
boxplot(dataLM$culmen_length~dataLM$clutch_completion, varwidth = TRUE, ylab = "Culmen length", xlab = "Clutch completion", col='grey', main = "")

#Culmen depth and species
boxplot(dataLM$culmen_depth~dataLM$species, varwidth = TRUE, ylab = "Culmen depth", xlab = "Species", col='grey', main = "")
#Culmen depth and island
boxplot(dataLM$culmen_depth~dataLM$island, varwidth = TRUE, ylab = "Culmen depth", xlab = "Island", col='grey', main = "")
#Culmen depth and sex
boxplot(dataLM$culmen_depth~dataLM$sex, varwidth = TRUE, ylab = "Culmen depth", xlab = "Sex", col='grey', main = "")
#Culmen depth and clutch completion
boxplot(dataLM$culmen_depth~dataLM$clutch_completion, varwidth = TRUE, ylab = "Culmen depth", xlab = "Clutch completion", col='grey', main = "")

#Flipper length and species
boxplot(dataLM$flipper_length~dataLM$species, varwidth = TRUE, ylab = "Flipper length", xlab = "Species", col='grey', main = "")
#Flipper length and island
boxplot(dataLM$flipper_length~dataLM$island, varwidth = TRUE, ylab = "Flipper length", xlab = "Island", col='grey', main = "")
#Flipper length and sex
boxplot(dataLM$flipper_length~dataLM$sex, varwidth = TRUE, ylab = "Flipper length", xlab = "Sex", col='grey', main = "")
#Flipper length and clutch completion
boxplot(dataLM$flipper_length~dataLM$clutch_completion, varwidth = TRUE, ylab = "Flipper length", xlab = "Clutch completion", col='grey', main = "")

#delta 15N and species
boxplot(dataLM$delta_15_N~dataLM$species, varwidth = TRUE, ylab = "delta 15N", xlab = "Species", col='grey', main = "")
#delta 15N and island
boxplot(dataLM$delta_15_N~dataLM$island, varwidth = TRUE, ylab = "delta 15N", xlab = "Island", col='grey', main = "")
#delta 15N and sex
boxplot(dataLM$delta_15_N~dataLM$sex, varwidth = TRUE, ylab = "delta 15N", xlab = "Sex", col='grey', main = "")
#delta 15N and clutch completion
boxplot(dataLM$delta_15_N~dataLM$clutch_completion, varwidth = TRUE, ylab = "delta 15N", xlab = "Clutch completion", col='grey', main = "")

#delta 13C and species
boxplot(dataLM$delta_13_C~dataLM$species, varwidth = TRUE, ylab = "delta 13C", xlab = "Species", col='grey', main = "")
#delta 13C and island
boxplot(dataLM$delta_13_C~dataLM$island, varwidth = TRUE, ylab = "delta 13C", xlab = "Island", col='grey', main = "")
#delta 13C and sex
boxplot(dataLM$delta_13_C~dataLM$sex, varwidth = TRUE, ylab = "delta 13C", xlab = "Sex", col='grey', main = "")
#delta 13C and clutch completion
boxplot(dataLM$delta_13_C~dataLM$clutch_completion, varwidth = TRUE, ylab = "delta 13C", xlab = "Clutch completion", col='grey', main = "")

Figure 18: Exploration of potential colinearities between cofactors and covariable

According to the Figure 18 there is no colinearity between cofactors and covariables.

We keep all our explanatory variables to do the modelisation.

Model selection

For the statistic modelisation, we analyse first the full model, which means the model including all explanatories variables.

In order to identify the candidate model from the full model, we are going to use the backward selection model. This method is based on the significance of the model terms. From the full model, we identify the significance of each term of the model and proceed to a skimming of the non-sgnificant terms. The withdraw on the non significant term have to follow two step:
First step : remove successively the non significant interaction
Second step : remove successively the explanatory variable that are non significant and not include in interaction terms

First we build the full model:

modlm<-lm(body_mass ~ species + island + clutch_completion + culmen_length + culmen_depth + flipper_length + sex + delta_15_N + delta_13_C,data=dataLM) #full model

Using the function drop1 we analyse the significance of each variables:

drop1(modlm,test="F") #test significativté des variables avec test fischer
Single term deletions

Model:
body_mass ~ species + island + clutch_completion + culmen_length + 
    culmen_depth + flipper_length + sex + delta_15_N + delta_13_C
                  Df Sum of Sq      RSS    AIC F value    Pr(>F)    
<none>                         24182707 3659.4                      
species            2   5084613 29267320 3717.2 32.8003 1.178e-13 ***
island             2     69965 24252672 3656.3  0.4513 0.6371895    
clutch_completion  1    191844 24374551 3660.0  2.4751 0.1166736    
culmen_length      1    474327 24657033 3663.7  6.1197 0.0139006 *  
culmen_depth       1    776489 24959196 3667.7 10.0181 0.0017031 ** 
flipper_length     1   2679918 26862625 3691.5 34.5757 1.054e-08 ***
sex                1   5851484 30034190 3727.6 75.4946 < 2.2e-16 ***
delta_15_N         1   1027298 25210005 3670.9 13.2540 0.0003183 ***
delta_13_C         1    353797 24536504 3662.1  4.5646 0.0334167 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We exclude the variable island of the model, because it is the non-significant variable with the highest p-value. Then, we create a new model (without the variable island) and test for significance.

modlm2<-lm(body_mass ~ species + clutch_completion + culmen_length + 
    culmen_depth + flipper_length + sex + delta_15_N + delta_13_C, data=dataLM)
drop1(modlm2,test="F")
Single term deletions

Model:
body_mass ~ species + clutch_completion + culmen_length + culmen_depth + 
    flipper_length + sex + delta_15_N + delta_13_C
                  Df Sum of Sq      RSS    AIC F value    Pr(>F)    
<none>                         24252672 3656.3                      
species            2   6638740 30891412 3730.7 42.9760 < 2.2e-16 ***
clutch_completion  1    235068 24487740 3657.5  3.0434 0.0820427 .  
culmen_length      1    481954 24734626 3660.7  6.2399 0.0130018 *  
culmen_depth       1    784713 25037386 3664.7 10.1597 0.0015800 ** 
flipper_length     1   2630322 26882994 3687.7 34.0548 1.334e-08 ***
sex                1   5935926 30188599 3725.3 76.8526 < 2.2e-16 ***
delta_15_N         1    996020 25248693 3667.4 12.8955 0.0003821 ***
delta_13_C         1    333909 24586581 3658.8  4.3231 0.0384091 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As previously, we exclude the variable clutch_completion of the model, because it is non-significant. We build a new model and test again for variables’ significance.

modlm3<-lm(body_mass ~ species + culmen_length + 
    culmen_depth + flipper_length + sex + delta_15_N + delta_13_C, data=dataLM)
drop1(modlm3,test="F")
Single term deletions

Model:
body_mass ~ species + culmen_length + culmen_depth + flipper_length + 
    sex + delta_15_N + delta_13_C
               Df Sum of Sq      RSS    AIC F value    Pr(>F)    
<none>                      24487740 3657.5                      
species         2   6586068 31073809 3730.6 42.3602 < 2.2e-16 ***
culmen_length   1    462092 24949833 3661.5  5.9442 0.0153187 *  
culmen_depth    1    767080 25254821 3665.5  9.8674 0.0018420 ** 
flipper_length  1   2774571 27262312 3690.2 35.6909 6.246e-09 ***
sex             1   5977734 30465475 3726.2 76.8951 < 2.2e-16 ***
delta_15_N      1   1086210 25573950 3669.5 13.9725 0.0002203 ***
delta_13_C      1    335608 24823349 3659.9  4.3171 0.0385403 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the variable in the model are significant, so we keep all of them in our final model. The best model is thus: \(body\:mass\) ~ \(species + culmen_{length} + culmen_{depth} + flipper_{length} + sex + delta\:15\:N + delta\:13\:C\)

Using the function summary we can acces the coefficient for all variable.

summary(modlm3)

Call:
lm(formula = body_mass ~ species + culmen_length + culmen_depth + 
    flipper_length + sex + delta_15_N + delta_13_C, data = dataLM)

Residuals:
    Min      1Q  Median      3Q     Max 
-681.10 -184.79   -3.52  186.14  822.43 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1467.105   1051.184   1.396  0.16380    
speciesChinstrap -245.090     89.050  -2.752  0.00626 ** 
speciesGentoo     882.101    131.653   6.700 9.59e-11 ***
culmen_length      17.063      6.998   2.438  0.01532 *  
culmen_depth       61.555     19.596   3.141  0.00184 ** 
flipper_length     17.757      2.972   5.974 6.25e-09 ***
sexMALE           417.889     47.655   8.769  < 2e-16 ***
delta_15_N       -164.321     43.960  -3.738  0.00022 ***
delta_13_C         65.171     31.366   2.078  0.03854 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 278.8 on 315 degrees of freedom
Multiple R-squared:  0.8842,    Adjusted R-squared:  0.8813 
F-statistic: 300.8 on 8 and 315 DF,  p-value: < 2.2e-16

The model can thus be written as:

\[ Body\:Mass = 1467.1\:-\:164.3^{***}. Delta\:15N\:+\:65.2^{*}. Delta\:13C\] \[+\:(Species_{Adeliae} = 0 \:;\:Species_{Pygoscelis antarctica} = -245.1^{**}\:;\:Species_{Pygoscelis papua} = 882.1^{***})\] \[ +\:17.1^{*NS*}. Culmen\:Length\:+\:61.6^{**}. Culmen\:Depth\:+\:17.8^{***}. Flipper\:Length \] \[ + (Sex_{Female}= 0\:;\: Sex_{Male}=417.9^{***})\]

From the summary of the model we can also access the variance proportion that is explained by the model. It correspond to the variable of Adjusted R-squared which is here 0.8813. The model is thus explaining 88% of the body mass variance.

Model validation

The hypotheses for ANCOVA are independence, residuals normality and variance homogeneity. We also need to check for influent statistics unities, which means an observation with a high contribution to model construction.

Here, we assume an independence of variables as we don’t have the details of the data collection.

We check the normality of residuals.

Show the code
par(mfrow=c(1,2))
# Histogram
hist(modlm3$residuals,col='dodgerblue3',xlab="residuals",main="Check Normality")
# Quantile-Quantile plot
qqnorm(modlm3$residuals,pch=16,col='dodgerblue3',xlab='')
qqline(modlm3$residuals,col='red',lwd=2)

Figure 19: Verification of residuals normality

We can see on the first plot of Figure 19 that the histogram seems to have a normal distribution. This is confirmed by the second plot of the figure where we can see that the residuals are align with the red line, which means that the residuals have a normal distribution.

We check the homoscedasticity of residuals.

Show the code
# residuals vs fitted
plot(residuals(modlm3)~fitted(modlm3)
      , col='dodgerblue3'
      , pch=16)
abline(h = 0)

Figure 20: verification of residuals homoscedasticity

We can see on the Figure 20 that there is no tendency on the point repartition. We can thus conclude that the variance is homogeneous.

We also verify that there is not individual that is particularly influential.

Show the code
par(mfrow = c(1, 1))
plot(cooks.distance(modlm3), type = "h", ylim = c(0, 1))
abline(h = 1, col = 2,lwd = 3)

Figure 21: Verification of influential individual

As all the bar are lower that 1 on the Figure 21, we can conclude that there is no influential individuals.

Finally, we can conclude that the linear model is valid.

References

Gorman, Tony D. AND Fraser, Kristen B. AND Williams. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” PLOS ONE 9 (3): 1–14. https://doi.org/10.1371/journal.pone.0090081.
Horst, Allison, Alison Hill, and Kristen B. Gorman. 2022. “Palmer Archipelago Penguins Data in the Palmerpenguins r Package - an Alternative to Anderson’s Irises.” The R Journal 14 (June): 244–54. https://doi.org/10.32614/RJ-2022-020.