Analysis of the Factors Favorable to Nest Success of Different Bird Species in Two USA States, Wisconsin and Illinois

Author

Rudy, Tanguy, Manon, Garan, Anaëlle, Oceane, Victor

Do not forget to set your working directory in the right place Menu : Session/Set Working Directory/Choose directory And indicate Git/2024_MODE_OCR

Data presentation

The dataset (available following this link https://zenodo.org/records/8251495 ) used for this work contains nest records from different studies conducted in North America and more specifically in Wisconsin and Illinois. It was conducted in 13 different counties : Dane, Grant, Green, Iowa, Lafayette, Monroe, Rock, Lee, Ogle, Will, Grundy, Carroll, and Jo Daviess counties. The dataset contains the records of 3257 nests and for each of them different observations.

These observations are the following:

State: Illinois or Wisconsin - qualitative explanatory variable
County: Dane (Wisconsin), Grant (Wisconsin), Green (Wisconsin), Iowa (Wisconsin), Lafayette (Wisconsin), Monroe (Wisconsin), Rock (Wisconsin), Lee (Illinois), Ogle (Illinois), Will (Illinois), Grundy (Illinois), Carroll (Illinois), and Jo Daviess (Illinois) - qualitative explanatory variable
Habitat: Cool-season Grassland, Pasture, Prairie, Warm-season Grassland - qualitative explanatory variable
Species: Species of birds - Bobolink, Dickcissel, Eastern Meadowlark, Grasshopper Sparrow, Henslow’s Sparrow, Savannah Sparrow, Vesper Sparrow, and Western Meadowlark - qualitative explanatory variable
Years: from 1991 to 2011 - qualitative explanatory variable
Nest Fate: the fate of the chicks -> Successly fledged at least one young (1) or Failed (0)- quantitative explanatory variable
Initiation Date: Date of the first laided egg - quantitative explanatory variable
Number fledged: Number of chicks that survived and fledged the nest - quantitative explanatory variable
Nest End Date: the date when the fate of it was recorded - quantitative explanatory variable

All theses observations are presented in a table with the following collums :

Nest_ID = unique nest identifier
State = State in which nest was located (Illinois or Wisconsin)
County = County in the state in which nest was located
Habitat = Grassy habitat in which nest was located (pasture, cool-season grassland, warm-season grassland, or prairie)
Species = English common name
Year = Year nest was found
Nest_Fate = Successly fledged at least one young (1) or Failed (0).
Initiation_Date = Coded date when the first egg was laid. April 1 = 1; May 1 = 31; June 1 = 62; July 1 = 92; August 1 = 123; September 1 = 154
Number_Fledged = For a successful nest, number of young in nest on nest end date; For a failed nest, Number_Fledged is 0
Nest_End_Date = Coded date when the nest failed or was successful. April 1 = 1; May 1 = 31; June 1 = 62; July 1 = 92; August 1 = 123; September 1 = 154

What kind of questions can we rise ?

Import dataset

Charge the packages Download the spDataLarge package, unavailable on CRAN

if (!require(spDataLarge)) {
  stop("The package 'spDataLarge' must be installed from the following depot : https://nowosad.github.io/drat/ \n
        Use the following command in order to install : \n
        install.packages('spDataLarge', repos = 'https://nowosad.github.io/drat/', type = 'source')")

}

Loading required package: spDataLarge

The rest of the packages

First look at the data :

You can visualize the first rows of the dataset by running the following cell :

# View the firsts lines of the dataset
head(data)

  Nest_ID    State County Habitat  Species Year Nest_Fate Initiation_Date
1 745JAAP Illinois   Will Pasture Bobolink 1995         0              58
2 748JAAP Illinois   Will Pasture Bobolink 1995         0              52
3 753JAAP Illinois   Will Pasture Bobolink 1995         1              51
4 754JAAP Illinois   Will Pasture Bobolink 1995         0              60
5 755JAAP Illinois   Will Pasture Bobolink 1995         0              53
6 758JAAP Illinois   Will Pasture Bobolink 1995         0              65
  Number_Fledged Nest_End_Date
1              0            78
2              0            74
3              6            75
4              0            75
5              0            69
6              0            78

In order to picture the data in a more visual way, we will represent our dataset on a map. The first step in our map creation is the separation of the dataset by the counties.

data_list_county <- split(data, data$County)

data_will <- data_list_county[["Will"]]
data_jo_daviess <- data_list_county[["Jo_Daviess"]]
data_lafayette <- data_list_county[["Lafayette"]]
data_iowa <- data_list_county[["Iowa"]]
data_dane <- data_list_county[["Dane"]]
data_green <- data_list_county[["Green"]]
data_grant <- data_list_county[["Grant"]]
data_grundy <- data_list_county[["Grundy"]]
data_lee_ogle <- data_list_county[["Lee/Ogle"]]
data_lee <- data_list_county[["Lee"]]
data_rock <- data_list_county[["Rock"]]
data_monroe <- data_list_county[["Monroe"]]
data_carroll <- data_list_county[["Carroll"]]
data_carroll_whiteside <- data_list_county[["Carroll/Whiteside"]]

We are now adding columns for the coordinates of each county, the coordinates were obtained by searching for the counties on openstreetmap.

data_will <- data_will %>%
  mutate(longitude_column = -88.0817,
         latitude_column = 41.5250)

data_jo_daviess <- data_jo_daviess %>%
  mutate(longitude_column = -90.4290,
         latitude_column = 42.4167)

data_grundy <- data_grundy %>%
  mutate(longitude_column = -88.4210,
         latitude_column = 41.3570)

data_lee <- data_lee %>%
  mutate(longitude_column = -89.4850,
         latitude_column = 41.8430)

data_carroll <- data_carroll %>%
  mutate(longitude_column = -89.9826,
         latitude_column = 42.0956)

data_lafayette <- data_lafayette %>%
  mutate(longitude_column = -90.1160,
         latitude_column = 42.6825)

data_iowa <- data_iowa %>%
  mutate(longitude_column = -90.1160,
         latitude_column = 42.6825)

data_dane <- data_dane %>%
  mutate(longitude_column = -89.4012,
         latitude_column = 43.0731)

data_green <- data_green %>%
  mutate(longitude_column = -89.6387,
         latitude_column = 42.6017)

data_grant <- data_grant %>%
  mutate(longitude_column = -90.7082,
         latitude_column = 42.8475)

data_rock <- data_rock %>%
  mutate(longitude_column = -89.0187,
         latitude_column = 42.6828)

data_monroe <- data_monroe %>%
  mutate(longitude_column = -90.8129,
         latitude_column = 43.9447)

data_lee_ogle <- data_lee_ogle %>%
  mutate(longitude_column = -89.4038,
         latitude_column = 41.9293)

data_carroll_whiteside <- data_carroll_whiteside %>%
  mutate(longitude_column = -89.9742,
         latitude_column = 41.9527)

Now, for each county dataset, we subdivide the said dataset for each year when at least one nest has been observed in this county.

county_names <- c("data_will", "data_jo_daviess", "data_lafayette", "data_iowa", 
                  "data_dane", "data_green", "data_grant", "data_grundy", 
                  "data_lee_ogle", "data_lee", "data_rock", "data_monroe", 
                  "data_carroll", "data_carroll_whiteside")

county_dataframes <- mget(county_names)
tables <- c()

for(i in seq_along(county_dataframes)) {
    county <- county_dataframes[[i]]
    county_name <- county_names[i]
    
    if (!is.data.frame(county)) {
        print(paste0("Error: ", county_name, " is not a dataframe"))
        next
    }
    
    if (!"Year" %in% colnames(county)) {
        print(paste0("Error: ", county_name, " does not contain a 'Year' column"))
        next
    }
    
    data_county_year <- split(county, county$Year)

    for(year in min(county$Year):max(county$Year)){
        tableau <- data_county_year[[as.character(year)]]
        
        if (!is.null(tableau)) {
            variable_name <- paste0(county_name, "_", year)
            tables <- c(tables, variable_name)
            assign(variable_name, tableau, envir = .GlobalEnv)
        }
    }
}

We went from a full dataset with everything mixed-up to multiple datasets for each year and each county when an observation is available.

Here are, for example, the first rows of the dataframe related to the county of Will in 1995

head(data_will_1995)

  Nest_ID    State County Habitat  Species Year Nest_Fate Initiation_Date
1 745JAAP Illinois   Will Pasture Bobolink 1995         0              58
2 748JAAP Illinois   Will Pasture Bobolink 1995         0              52
3 753JAAP Illinois   Will Pasture Bobolink 1995         1              51
4 754JAAP Illinois   Will Pasture Bobolink 1995         0              60
5 755JAAP Illinois   Will Pasture Bobolink 1995         0              53
6 758JAAP Illinois   Will Pasture Bobolink 1995         0              65
  Number_Fledged Nest_End_Date longitude_column latitude_column
1              0            78         -88.0817          41.525
2              0            74         -88.0817          41.525
3              6            75         -88.0817          41.525
4              0            75         -88.0817          41.525
5              0            69         -88.0817          41.525
6              0            78         -88.0817          41.525

The main information that we want to present on the map are the mean number of fledged birds by nest and the amount of data available for each county and each year.

Here is the creation of a summary dataframe that contains all these informations for each dataframe.

annee <- c()
taux_de_reussite <- c()
nb_donnees <- c()
lat <- c()
lon<- c()

# Creation of a summary for each year
for(table_name in tables){
  tablecreated <- get(table_name)
  annee <- c(annee, tablecreated$Year[1])
  taux_de_reussite <- c(taux_de_reussite, sum(tablecreated$Number_Fledged)/nrow(tablecreated))
  nb_donnees <- c(nb_donnees, nrow(tablecreated))
  lat <- c(lat, tablecreated$latitude_column[1])
  lon <- c(lon, tablecreated$longitude_column[1])
}

# Creating a dataframe of thoose summaries
tableau <- data.frame(
  annee = annee, 
  Mean_number_fledged_by_nest = taux_de_reussite, 
  nb_donnees = nb_donnees, 
  lat = lat, 
  lon = lon
)

# Seting the dataframe as a spatial class in order to create a map
tableau_sf <- st_as_sf(tableau, coords = c("lon", "lat"), crs = 4326)
tableau_sf <- st_transform(tableau_sf, crs = st_crs(us_states))

As we now have all the data of interest in one single organized table, we can actually create the map.

tmap_mode("view")

tmap mode set to interactive viewing

bird_anim <- tm_shape(us_states) + 
  tm_polygons() + 
  tm_shape(tableau_sf) + 
  tm_symbols(size = "Mean_number_fledged_by_nest", col = "nb_donnees", 
             style = "quantile", title.size = "Mean number fledged by nest", title.col = "Number of nest observed") + 
  tm_layout(title = "Bird Observations Over Time", 
            legend.position = c("left", "bottom"),
            frame = FALSE, 
            legend.show = T) +
  tm_facets(by = "annee", nrow = 1, ncol = 1, free.coords = FALSE)

And now, we just have to run the following cell to display the map.

tmap_animation(bird_anim)

Creating frames

========

===

====

====

====

====

===

====

====

====

====

====

===

====

====

====

====

===

====

====


Creating animation

Please, check out the newly created map on the “Viewer” panel of your Rstudio application.

I believe that the first thing we can say about the data, is that it is unevenly distributed. Depending on the year and the county, the number of observed nests changes ; and sometimes not even a single observation is available.

1. Data exploration

Repartition of the data through time

Statistical tests

In this part, several statistical tests are performed in order to answer questions.

First question : Does the number of fledged birds depend on the species ?

H0 : there is no differences between species

First, we observe the data with a boxplot. There is no significant differences through the boxplots.

ggplot(data = data, 
       aes(x = Habitat, 
           y = Nest_End_Date)
       ) +
  geom_boxplot(aes(fill = Habitat)) +
  ggtitle("Nest_End_Date depending on the Habitat")

  theme(
    axis.text.x = element_blank(),
    plot.title = element_text(face = "bold")
    )

List of 2
 $ axis.text.x: list()
  ..- attr(*, "class")= chr [1:2] "element_blank" "element"
 $ plot.title :List of 11
  ..$ family       : NULL
  ..$ face         : chr "bold"
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE

Then, we test the normality of the data. All the p-values are ranged in a dataframe.

data |>
  # Selection of only two columns of interest
  dplyr::select(  
    Species, 
    Number_Fledged
  ) |>
  # Calculation of the p-value of the shapiro test for each species
  dplyr::group_by(Species) |>
  dplyr::summarise(
    shap = shapiro.test(Number_Fledged)$p
  )

# A tibble: 8 × 2
  Species                 shap
  <fct>                  <dbl>
1 Bobolink            2.64e-23
2 Dickcissel          5.49e-29
3 Eastern_Meadowlark  5.26e-36
4 Grasshopper_Sparrow 1.80e-32
5 Henslow's_Sparrow   5.38e-11
6 Savannah_Sparrow    1.65e-35
7 Vesper_Sparrow      2.70e-18
8 Western_Meadowlark  4.43e-10

Conclusion : each subset does not follow the normality. We can perform a U test (non parametric test) :

# Construction of an empty 8x8 dataframe. It will contains each p-value of the U test
cor_df <- as.data.frame(
  matrix(nrow = 8, ncol = 8, NA)
  )
  
# Retrive the species names
species <- as.character(unique(data$Species))

# Rename the matrix
colnames(cor_df) <- species
rownames(cor_df) <- species

data_species <- data |> 
  dplyr::select(
    Species, 
    Number_Fledged
    ) |>
  dplyr::arrange(Species)


for (sp1 in (1:(length(species)))){
  for (sp2 in (1:length(species))){

    ## Retrieve the data of the species 1
    species_1 <- data_species |>
      dplyr::filter(Species == species[sp1]) |>
      dplyr::select(Number_Fledged)

    ## Retrieve the data of the species 2
    species_2 <- data_species |>
      dplyr::filter(Species == species[sp2]) |>
      dplyr::select(Number_Fledged)
    
    ## Formating of the vector containing the number of fledged birds. 
    species_1 <- as.vector(species_1)$Number_Fledged
    species_2 <- as.vector(species_2)$Number_Fledged
    
    ## To put the p.values on the correlation matrix
    cor_df[sp1,sp2] <- wilcox.test(species_1, species_2)$p.value
  }
}

Now, we can observe the p-values in the following matrix :

ggcorrplot(
  corr = cor_df,
  lab = TRUE,
  lab_size = 3,
  show.diag = TRUE,
  type = "upper"
  ) +
  theme(
    axis.text.x = element_text(size = 8,
                               angle = 45),
    axis.text.y = element_text(size = 8)
    ) +
  scale_fill_gradient2(
    low = "tomato2", 
    high = "springgreen3", 
    midpoint = 0.2) +
  ggtitle("P-values of the U-tests with the number of fledged birds as variable")

Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Observation : we see that there is globally no differences in the mean of the groups. So we keep H0.

Second question : we want to summarise the information depending on different factor.

In the following chunks, a list named data_summariser is created and it will contains several dataframes. These dataframes summarise the different quantitative variables (i.e. Nest_Fate, Initiation_Date, Number_Fledged and Nest_End_Date) by the means and for each factor (i.e. County, State, Species and Habitat). There is one dataframe per factor. The goal is to better visualise the repartition of the species as a function of a factor AND to see the means.

Here, we summarise with the factor : Species.

## Calculation of the mean of each variables depending on the Species
data_sp <- data |>
  dplyr::select(
    Species,Nest_Fate,Initiation_Date, Number_Fledged, Nest_End_Date
  )|>
  dplyr::group_by(
    Species
    ) |>
  dplyr::summarise(
    count = n(),
    prop_Nest_Fate = sum(Nest_Fate) / length(Nest_Fate),
    mean_Init_Date = mean(Initiation_Date),
    mean_Number_Fledged = mean(Number_Fledged),
    mean_Nest_End_Date = mean(Nest_End_Date),
  )

## Calculation of the mean of each variables depending on the habitat
data_hab <- data |>
  dplyr::select(
    Habitat,Nest_Fate,Initiation_Date, Number_Fledged, Nest_End_Date
  )|>
  dplyr::group_by(
    Habitat
    ) |>
  dplyr::summarise(
    count = n(),
    prop_Nest_Fate = sum(Nest_Fate) / length(Nest_Fate),
    mean_Init_Date = mean(Initiation_Date),
    mean_Number_Fledged = mean(Number_Fledged),
    mean_Nest_End_Date = mean(Nest_End_Date),
  )

## Calculation of the mean of each variables depending on the state
data_state <- data |>
  dplyr::select(
    State,Nest_Fate,Initiation_Date, Number_Fledged, Nest_End_Date
  )|>
  dplyr::group_by(
    State
    ) |>
  dplyr::summarise(
    count = n(),
    prop_Nest_Fate = sum(Nest_Fate) / length(Nest_Fate),
    mean_Init_Date = mean(Initiation_Date),
    mean_Number_Fledged = mean(Number_Fledged),
    mean_Nest_End_Date = mean(Nest_End_Date),
  )

## Calculation of the mean of each variables depending on the county
data_county <- data |>
  dplyr::select(
    State, County, Nest_Fate,Initiation_Date, Number_Fledged, Nest_End_Date
  )|>
  dplyr::group_by(
    State, County
    ) |>
  dplyr::summarise(
    count = n(),
    prop_Nest_Fate = sum(Nest_Fate) / length(Nest_Fate),
    mean_Init_Date = mean(Initiation_Date),
    mean_Number_Fledged = mean(Number_Fledged),
    mean_Nest_End_Date = mean(Nest_End_Date),
  )

`summarise()` has grouped output by 'State'. You can override using the
`.groups` argument.

data_sp

# A tibble: 8 × 6
  Species             count prop_Nest_Fate mean_Init_Date mean_Number_Fledged
  <fct>               <int>          <dbl>          <dbl>               <dbl>
1 Bobolink              390          0.472           59.5               1.72 
2 Dickcissel            470          0.347           85.9               1.05 
3 Eastern_Meadowlark    726          0.321           61.4               1.01 
4 Grasshopper_Sparrow   693          0.404           73.6               1.39 
5 Henslow's_Sparrow      72          0.389           63.2               1.18 
6 Savannah_Sparrow      664          0.295           66.5               0.887
7 Vesper_Sparrow        174          0.345           69.4               0.908
8 Western_Meadowlark     68          0.426           68.9               1.49 
# ℹ 1 more variable: mean_Nest_End_Date <dbl>

We create a list called data_summariser that will contains each previously created tables.

data_summariser <-list()  # Creation of the due list data_summariser
data_summariser$Species <- data_sp # Add the Species table
data_summariser$Habitat <- data_hab 
data_summariser$State <- data_state
data_summariser$County <- data_county

data_summariser

$Species
# A tibble: 8 × 6
  Species             count prop_Nest_Fate mean_Init_Date mean_Number_Fledged
  <fct>               <int>          <dbl>          <dbl>               <dbl>
1 Bobolink              390          0.472           59.5               1.72 
2 Dickcissel            470          0.347           85.9               1.05 
3 Eastern_Meadowlark    726          0.321           61.4               1.01 
4 Grasshopper_Sparrow   693          0.404           73.6               1.39 
5 Henslow's_Sparrow      72          0.389           63.2               1.18 
6 Savannah_Sparrow      664          0.295           66.5               0.887
7 Vesper_Sparrow        174          0.345           69.4               0.908
8 Western_Meadowlark     68          0.426           68.9               1.49 
# ℹ 1 more variable: mean_Nest_End_Date <dbl>

$Habitat
# A tibble: 4 × 6
  Habitat               count prop_Nest_Fate mean_Init_Date mean_Number_Fledged
  <fct>                 <int>          <dbl>          <dbl>               <dbl>
1 Cool-season_Grassland   677          0.403           65.3                1.39
2 Pasture                1683          0.319           69.2                1.01
3 Prairie                 496          0.363           70.4                1.12
4 Warm-season_Grassland   401          0.456           72.6                1.48
# ℹ 1 more variable: mean_Nest_End_Date <dbl>

$State
# A tibble: 2 × 6
  State     count prop_Nest_Fate mean_Init_Date mean_Number_Fledged
  <fct>     <int>          <dbl>          <dbl>               <dbl>
1 Illinois   1073          0.314           74.6                1.02
2 Wisconsin  2184          0.383           66.2                1.23
# ℹ 1 more variable: mean_Nest_End_Date <dbl>

$County
# A tibble: 14 × 7
# Groups:   State [2]
   State     County      count prop_Nest_Fate mean_Init_Date mean_Number_Fledged
   <fct>     <fct>       <int>          <dbl>          <dbl>               <dbl>
 1 Illinois  Carroll        15          0.333           74.7               0.733
 2 Illinois  Carroll/Wh…     5          0.2             73.4               0.8  
 3 Illinois  Grundy         53          0.208           72.3               0.642
 4 Illinois  Jo_Daviess    292          0.435           75.8               1.43 
 5 Illinois  Lee            20          0.3             84.3               1.1  
 6 Illinois  Lee/Ogle      297          0.286           78.9               0.892
 7 Illinois  Will          391          0.261           70.3               0.877
 8 Wisconsin Dane          441          0.508           64.3               1.81 
 9 Wisconsin Grant         101          0.198           65.4               0.554
10 Wisconsin Green         181          0.343           64.9               1.10 
11 Wisconsin Iowa          744          0.349           63.9               1.15 
12 Wisconsin Lafayette     481          0.343           69.7               0.983
13 Wisconsin Monroe        233          0.442           70.8               1.33 
14 Wisconsin Rock            3          0.667           94.7               0.667
# ℹ 1 more variable: mean_Nest_End_Date <dbl>

Convert dates to months to make a MCA:

### Creation of a function to replace the number by a month 

num_to_month <- function(num){
  if (num %in% (1:30)){
    res <- "april"
  } else if (num %in% (31:61)){
    res <- "may"
  } else if (num %in% (62:91)){
    res <- "june"
  } else if (num %in% (92:122)){
    res <- "july"
  } else if (num %in% (123:153)){
    res <- "august"
  } else {
    res <- "september"
  }
  return (res)
}

Now, we trasform all the quantitative variable into factors.

## Conversion into months 
data_ACM <- data
data_ACM$Initiation_Date <- lapply(X = data_ACM$Initiation_Date, 
                                FUN = num_to_month
                                )
data_ACM$Nest_End_Date <- lapply(X = data_ACM$Nest_End_Date, 
                              FUN = num_to_month
                              )
## Conversion into factors
data_ACM <- data_ACM |>
  dplyr::mutate(Nest_Fate = as.factor(Nest_Fate),
                Initiation_Date = as.factor(unlist(Initiation_Date)),
                Number_Fledged = as.factor(Number_Fledged),
                Nest_End_Date = as.factor(unlist(Nest_End_Date)),
                Year = as.factor(Year)
                )

MCA analysis (data exploring)

Before doing the MCA analysis, we must verify that all the modalities represent at least 3% of the data

# Proportions de chaques modalités dans chaques variables:
prop.table(table(data_ACM$State))*100


 Illinois Wisconsin 
 32.94443  67.05557

prop.table(table(data_ACM$County))*100


          Carroll Carroll/Whiteside              Dane             Grant 
        0.4605465         0.1535155        13.5400675         3.1010132 
            Green            Grundy              Iowa        Jo_Daviess 
        5.5572613         1.6272644        22.8431072         8.9653055 
        Lafayette               Lee          Lee/Ogle            Monroe 
       14.7681916         0.6140620         9.1188210         7.1538225 
             Rock              Will 
        0.0921093        12.0049125

prop.table(table(data_ACM$Habitat))*100


Cool-season_Grassland               Pasture               Prairie 
             20.78600              51.67332              15.22874 
Warm-season_Grassland 
             12.31194

prop.table(table(data_ACM$Species))*100


           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
          11.974209           14.430457           22.290451           21.277249 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
           2.210623           20.386859            5.342340            2.087811

prop.table(table(data_ACM$Year))*100


      1991       1992       1993       1994       1995       1996       1997 
 2.0878109  0.8289837  0.7368744  0.7982806  6.3248388  8.0135094 10.1320233 
      1998       1999       2000       2001       2002       2003       2004 
 9.8556954  8.3512435 13.7549893  8.7503838  5.5572613  1.7193737  3.1317163 
      2005       2006       2007       2008       2009       2010       2011 
 1.6886706  2.7939822  4.2677310  5.5265582  2.0571078  2.2413264  1.3816395

prop.table(table(data_ACM$Nest_Fate))*100


       0        1 
63.98526 36.01474

prop.table(table(data_ACM$Number_Fledged))*100


         0          1          2          3          4          5          6 
63.9852625  3.8378876  6.1099171  9.7635861 10.8688978  5.0046055  0.4298434

prop.table(table(data_ACM$Initiation_Date))*100


     april     august       july       june        may 
 1.4737488  0.0307031 14.2462389 47.8354314 36.4138778

prop.table(table(data_ACM$Nest_End_Date))*100


     april     august       july       june        may 
 0.0307031  2.8860915 39.5455941 47.0371508 10.5004605

Some of the modalities are under-represented. We must delete them. The modalities under-represented are: Countries Carroll, Carroll/Whiteside, Grundy, Lee and Rock, Species Henslow’s Sparrow and Western Meadowlark, half of the years modalities, 6 fledged chicks, and months April and August for both nest initiation date and nest end date.

prop.table(table(data_ACM$County))*100 < 3


          Carroll Carroll/Whiteside              Dane             Grant 
             TRUE              TRUE             FALSE             FALSE 
            Green            Grundy              Iowa        Jo_Daviess 
            FALSE              TRUE             FALSE             FALSE 
        Lafayette               Lee          Lee/Ogle            Monroe 
            FALSE              TRUE             FALSE             FALSE 
             Rock              Will 
             TRUE             FALSE

# We must delete the data of counties Carroll, Carroll/Whiteside, Grundy, Lee and Rock
data_ACM <- data_ACM[-which(data_ACM$County == "Carroll" | 
                              data_ACM$County == "Carroll/Whiteside" | 
                              data_ACM$County == "Grundy" | 
                              data_ACM$County == "Lee" | 
                              data_ACM$County == "Rock"),]
data_ACM$County <- droplevels(data_ACM$County)

prop.table(table(data_ACM$Species))*100 < 3


           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
              FALSE               FALSE               FALSE               FALSE 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
               TRUE               FALSE               FALSE                TRUE

# We must delete the data of species Henslow's_Sparrow and Western_Meadowlark
data_ACM <- data_ACM[-which(data_ACM$Species == "Henslow's_Sparrow" | 
                              data_ACM$Species == "Western_Meadowlark"),]
data_ACM$Species <- droplevels(data_ACM$Species)

prop.table(table(data_ACM$Year))*100 < 3


 1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003 
 TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
 2004  2005  2006  2007  2008  2009  2010  2011 
FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

summary(prop.table(table(data_ACM$Year))*100 < 3)

   Mode   FALSE    TRUE 
logical      11      10

# As half of the modalities are under-represented, we will group the years by 3.

data_ACM <- data_ACM %>%
  mutate(Year = factor(case_when(
    Year %in% c("1991", "1992", "1993") ~ "1991-1993",
    Year %in% c("1994", "1995", "1996") ~ "1994-1996",
    Year %in% c("1997", "1998", "1999") ~ "1997-1999",
    Year %in% c("2000", "2001", "2002") ~ "2000-2002",
    Year %in% c("2003", "2004", "2005") ~ "2003-2005",
    Year %in% c("2006", "2007", "2008") ~ "2006-2008",
    Year %in% c("2009", "2010", "2011") ~ "2009-2011"
  ), levels = c("1991-1993", "1994-1996", "1997-1999", "2000-2002", "2003-2005", "2006-2008", "2009-2011")))

prop.table(table(data_ACM$Year))*100 < 3


1991-1993 1994-1996 1997-1999 2000-2002 2003-2005 2006-2008 2009-2011 
    FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE

# None of the years are still under-represented.

prop.table(table(data_ACM$Number_Fledged))*100 < 3


    0     1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

# We will group the nest with 5 or 6 fledged chicks with a modality labelled ">5"
levels(data_ACM$Number_Fledged) <- c(levels(data_ACM$Number_Fledged), ">5")
data_ACM[which(data_ACM$Number_Fledged == 5 | data_ACM$Number_Fledged == 6),]$Number_Fledged <- as.factor(">5")
data_ACM$Number_Fledged <- droplevels(data_ACM$Number_Fledged)

prop.table(table(data_ACM$Initiation_Date))*100 < 3


 april august   july   june    may 
  TRUE   TRUE  FALSE  FALSE  FALSE

# We must delete the data with an initiation date in August and April.
data_ACM <- data_ACM[-which(data_ACM$Initiation_Date == "august" | 
                              data_ACM$Initiation_Date == "april"),]
data_ACM$Initiation_Date <- droplevels(data_ACM$Initiation_Date)

prop.table(table(data_ACM$Nest_End_Date))*100 < 3


 april august   july   june    may 
  TRUE   TRUE  FALSE  FALSE  FALSE

# We must delete the data with an end date in August and April.
data_ACM <- data_ACM[-which(data_ACM$Nest_End_Date == "august" | 
                              data_ACM$Nest_End_Date == "april"),]
data_ACM$Nest_End_Date <- droplevels(data_ACM$Nest_End_Date)

Now, we can check again that no modalities are under-represented (threshold 3%).

# Proportions de chaques modalités dans chaques variables:
prop.table(table(data_ACM$State))*100


 Illinois Wisconsin 
 30.26634  69.73366

prop.table(table(data_ACM$County))*100


      Dane      Grant      Green       Iowa Jo_Daviess  Lafayette   Lee/Ogle 
 12.832930   3.424421   5.984089  23.555863   7.990315  15.980630   9.477689 
    Monroe       Will 
  7.955725  12.798340

prop.table(table(data_ACM$Habitat))*100


Cool-season_Grassland               Pasture               Prairie 
             19.43964              53.58008              15.25424 
Warm-season_Grassland 
             11.72605

prop.table(table(data_ACM$Species))*100


           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
          13.386371           13.870633           22.103079           22.206849 
   Savannah_Sparrow      Vesper_Sparrow 
          22.552750            5.880318

prop.table(table(data_ACM$Year))*100


1991-1993 1994-1996 1997-1999 2000-2002 2003-2005 2006-2008 2009-2011 
 3.562781 14.285714 27.429955 30.854376  6.675891 11.587686  5.603597

prop.table(table(data_ACM$Nest_Fate))*100


       0        1 
64.99481 35.00519

prop.table(table(data_ACM$Number_Fledged))*100


        0         1         2         3         4        >5 
64.994811  3.493601  6.226219  9.408509 10.446212  5.430647

prop.table(table(data_ACM$Initiation_Date))*100


    july     june      may 
11.58769 50.22484 38.18748

prop.table(table(data_ACM$Nest_End_Date))*100


     july      june       may 
41.093047 49.394673  9.512279

All the modalities are enough represented. We can now start the MCA analysis.

# MCA without the Nest IDs.
ACM_bird <- MCA(data_ACM[,-1])

fviz_eig(ACM_bird, main="histogram of % inertia", addlabels = T)

The inertia holded by the first axis is very low (9.3%)

Let’s see wich modalities are significantly contributing to the analysis, for the axis 1 and 2

fviz_contrib(ACM_bird, choice='var', axes=1)

fviz_contrib(ACM_bird, choice='var', axes=2)

Let’s print only the modalities above the thresold of 1/n, n is the number of modalities, and the modalities above this thresold are significantly contributing to the MCA.

# We set the thresold for significant modalities
threshold <- (1/nrow(ACM_bird$var$contrib))*100

# We extract significant contributions for dim 1
var_contrib <- get_mca_var(ACM_bird)$contrib
significant_vars_dim1 <- rownames(var_contrib)[var_contrib[, 1] > threshold]
length(significant_vars_dim1)

[1] 12

# We extract significant contributions for dim 2
significant_vars_dim2 <- rownames(var_contrib)[var_contrib[, 2] > threshold]
length(significant_vars_dim2)

[1] 15

significant_vars_tot <- unique(c(significant_vars_dim1, significant_vars_dim2))
length(significant_vars_tot)

[1] 24

# There is 24 significant modalities on dimension 1 and 2. Let's print them on the graph

fviz_mca_var(ACM_bird, col.var='blue', repel = T, alpha.var = "contrib")

fviz_mca_var(ACM_bird, col.var='blue', repel = T, select.var = list(contrib = 24))

# Representation of the individuals between axis 1 and 2, coloring for each variable
plotellipses(ACM_bird, means = F)

We can see that the nests are clearly separated by the state they come from. We can see a light shift on axis 2 with the nest fate values. The other variables are forming less distinct groups.

On axis 1, the most contributing variable is the state Illinois. It is associated with the specie Dickcissel, with the county Will and Lee/Ogle, years 1994-1996, and the warm season on Grassland. On the other side of the axis, the specie Savannah Sparrow is higly associated with years 2000-2002, and also with the county Lafayette and Iowa and the state Wisconsin.

On axis 2, the most contributing variable is the county Dane, and associated with years 2006-2008 and cool season Grassland. It is also associated with the specie Bobolink, the nest fate equal 1, and the number fledged superior or equal to 5. To say otherwise, the nesting birds have a better success in the county Dane, especially between 2006 and 2008 and during the cool season in grasslands. The specie Bobolink is also highly successive in nesting a lot of chicks. On the other side of the axis, the variable significantly contributing are years 1997-1999, state Illinois, nest end in July, habitat pasture and county Jo Daviess, with nest fate 0 and number fledged 0. Therefore, it is most likely that these conditions are decreasing the probabilities of success in nesting chicks.

II. Analysing the relationship between nest fate and county, habitat, species, year, initiation date using a generalized linear model (GLM) with a binomial distribution

A. Data Exploration

In this part we choose to explain the nest fate with different variables. We chose to focus on the state of Illinois because adding the state might lead to bad modelling as there might be not enough data for each of them.

Variables are :

Nest_Fate : Response variable, binary
County : qualitative explanatory variable
Habitat : qualitative explanatory variable
Species : qualitative explanatory variable
Year : qualitative explanatory variable
Initiation_Date : Quantitative explanatory variable

Aim of the study : We want to model the fate of the nest as a response of the County, the Habitat, the Species, the Year, the Initiation date. We are looking for a model that could explain the distribution of Nest fate.

Method : Here our response variable is either a sucess or a failure. This response follows a binomial distribution with the probability density function \[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

The term \[ \binom{n}{k} \] is the binomial coefficient, which represents the number of ways to choose ( k ) successes from ( n ) trials.

\[ p \] is the probability of success on a single trial.

\[1 - p\] is the probability of failure on a single trial.

Every statistical analysis starts with the exploration of the dataset.

Outlier in \(Y\)

Here we are checking the values of Y (the values of Nest Fate)

Show the code

# Number of 0 and 1 in Y
table(data_Illinois$Nest_Fate)


  0   1 
736 337

According to the results table, it appears that overall, there are roughly twice as many failures as successes.

For quantitative Xs : outlier and distribution of \(X\) values

Here we have one quantitative explanatory variable \(X\). We check its distribution and if there is any outlier value.

Show the code

par(mfrow=c(1,3))

# Length
# Cleveland plot
dotchart(data_Illinois$Initiation_Date,pch=16,col='dodgerblue3',xlab='Initiation Date')
# Histogram
hist(data_Illinois$Initiation_Date,col='dodgerblue3',xlab="Initiation Date",main="")
# Quantile-Quantile plot
qqnorm(data_Illinois$Initiation_Date,pch=16,col='dodgerblue3',xlab='')
qqline(data_Illinois$Initiation_Date,col='red',lwd=2)

We do not observe any outliers and it seems that the initiation date of the nesting period follows a normal distribution.

For each qualitative explanatory variable \(X\) : Modalities and number of individuals for each of them

We need to analyse the modalities and the number of individuals for each modality.

#| echo: true
#| code-fold: true
#| code-summary: "Show the code"

# Factor County
summary(data_Illinois$County)

          Carroll Carroll/Whiteside              Dane             Grant 
               15                 5                 0                 0 
            Green            Grundy              Iowa        Jo_Daviess 
                0                53                 0               292 
        Lafayette               Lee          Lee/Ogle            Monroe 
                0                20               297                 0 
             Rock              Will 
                0               391

# Factor Habitat
summary(data_Illinois$Habitat)

Cool-season_Grassland               Pasture               Prairie 
                  242                   537                   137 
Warm-season_Grassland 
                  157

# Factor Species
summary(data_Illinois$Species)

           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
                 62                 314                 229                 350 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
                  5                  13                  54                  46

# Factor Year
summary(data_Illinois$Year)

1995 1996 1997 1998 2007 2008 2009 
 166  261  330  166   52   60   38

The results tables provided by our script show different things. First of all for the County variable we can observe that some counties have very little records. For example Carrol/Whiteside has 5 records of nest while counties like Jo Daviess have 292 records. This could lead to mis-modelling so attention on the model results is necessary. For Habitat variable it seems that one habitat in particular (Pasture) has more records than the other. For the Species variable, similarly to county variable, we observe that some species have more records than others. We observe the same for the Year variable.

Relationships between Y and Xs ?

We can then analyse graphically if we might have relationships between \(Y\) and \(Xs\).

Show the code

par(mfrow=c(1,3), mar=c(5, 4, 4, 2) + 0.1) 

# Initiation_Date
plot(data_Illinois$Nest_Fate ~ data_Illinois$Initiation_Date, 
     pch=16, 
     col='dodgerblue3',
     xlab='Initiation_Date', 
     ylab='Nest_Fate',
     cex.lab=1.2,  
     cex.axis=1.2)

# Define the color palette
color <- c('dodgerblue3', 'darkred', 'forestgreen', 'gold', 
           'purple', 'orange', 'cyan', 'magenta', 
           'salmon', 'lightseagreen', 'plum', 
           'yellowgreen', 'sandybrown', 'steelblue')

# County
mosaicplot(data_Illinois$Nest_Fate ~ data_Illinois$County, 
           col = color, 
           main="", 
           cex.axis=0.8,  
           las=2)        
title("Relationship Nest Fate & County", cex.main=1.2)  

# Habitat
mosaicplot(data_Illinois$Nest_Fate ~ data_Illinois$Habitat, 
           col = color[1:4],  
           main="",  
           cex.axis=0.8,
           las=2)
title("Relationship Nest Fate & Habitat", cex.main=1.2)

Show the code

# Species
mosaicplot(data_Illinois$Nest_Fate ~ data_Illinois$Species, 
           col = color[1:8],  
           main="", 
           cex.axis=0.8,
           las=2)
title("Relationship Nest Fate & Species", cex.main=1.2)  

# Year
mosaicplot(data_Illinois$Nest_Fate ~ data_Illinois$Year, 
           col = color[1:6],  
           main="", 
           cex.axis=0.8,
           las=2)
title("Relationship Nest Fate & Year", cex.main=1.2)

Looking at those graphs it doesn’t seems like the Initiation_date, the County or the Habitat has any effects on the Nest_fate. Maybe a little bit for County but the statistical analysis afterward will enable us to have more information on which variable has an effect. For the Year and the Species we might observe a little difference, but still really light.

Analysing interaction between Xs

In any modelling approach there is a need to check for interactions between variables. In our study we judge that there is none.

Checking collinearity between \(Xs\)

To avoid multicollinearity in modeling, we need to analyze how the explanatory variables \(X_s\) are related. It is important to check for collinearity. Indeed if some variables are correlated they measure the same thing and therefore keeping both of them might lead to bad prediction.

Show the code

# Checking collinearity between qualitative explanatory variable

# Contingency table between County and Habitat
table(data_Illinois$County, data_Illinois$Habitat)

                   
                    Cool-season_Grassland Pasture Prairie Warm-season_Grassland
  Carroll                               0       0      15                     0
  Carroll/Whiteside                     0       0       5                     0
  Dane                                  0       0       0                     0
  Grant                                 0       0       0                     0
  Green                                 0       0       0                     0
  Grundy                                0       6      26                    21
  Iowa                                  0       0       0                     0
  Jo_Daviess                            7     264       7                    14
  Lafayette                             0       0       0                     0
  Lee                                   4      12       4                     0
  Lee/Ogle                            104      39      58                    96
  Monroe                                0       0       0                     0
  Rock                                  0       0       0                     0
  Will                                127     216      22                    26

Show the code

# Contingency table between Species and Year
table(data_Illinois$Species, data_Illinois$Year)

                     
                      1995 1996 1997 1998 2007 2008 2009
  Bobolink              14   22   20    1    2    1    2
  Dickcissel            50   60   83   27   35   31   28
  Eastern_Meadowlark    46  119   41   13    5    1    4
  Grasshopper_Sparrow   41   50  128   93   10   25    3
  Henslow's_Sparrow      4    1    0    0    0    0    0
  Savannah_Sparrow       7    3    3    0    0    0    0
  Vesper_Sparrow         4    6   28   13    0    2    1
  Western_Meadowlark     0    0   27   19    0    0    0

Show the code

# Contingency table between Habitat and Species
table(data_Illinois$Habitat, data_Illinois$Species)

                       
                        Bobolink Dickcissel Eastern_Meadowlark
  Cool-season_Grassland       18        162                 42
  Pasture                     37         60                130
  Prairie                      7         37                 39
  Warm-season_Grassland        0         55                 18
                       
                        Grasshopper_Sparrow Henslow's_Sparrow Savannah_Sparrow
  Cool-season_Grassland                  13                 4                2
  Pasture                               220                 0                7
  Prairie                                49                 1                0
  Warm-season_Grassland                  68                 0                4
                       
                        Vesper_Sparrow Western_Meadowlark
  Cool-season_Grassland              1                  0
  Pasture                           37                 46
  Prairie                            4                  0
  Warm-season_Grassland             12                  0

Show the code

# Boxplots to check for relationship between Initiation_Date and the qualitative explanatory variables

# Initiation_Date and County
boxplot(data_Illinois$Initiation_Date ~ data_Illinois$County, 
        xlab="County", 
        ylab="Initiation Date", 
        cex.axis=0.8)

Show the code

# Initiation_Date and Habitat
boxplot(data_Illinois$Initiation_Date ~ data_Illinois$Habitat, 
        xlab="Habitat", 
        ylab="Initiation Date", 
        cex.axis=0.8)

Show the code

# Initiation_Date and Species
boxplot(data_Illinois$Initiation_Date ~ data_Illinois$Species, 
        xlab="Species", 
        ylab="Initiation Date", 
        cex.axis=0.4)

Show the code

# Initiation_Date and Year
boxplot(data_Illinois$Initiation_Date ~ data_Illinois$Year, 
        xlab="Year", 
        ylab="Initiation Date", 
        cex.axis=0.8)

Here it seems that there is a strong collinearity between County and Habitat as well as between Species and Year and also between Species and Habitat. Therefore, we are not going to keep County, Habitat and Year as explanatory variables in our model.

B. Statistical Analysis

Model building up

Because our response variable is a binary variable that follows a binomial distribution we have a link function : the logit.

Show the code

# Model formulation
mod1<-glm(Nest_Fate~Species
        + Initiation_Date
        ,data=data_Illinois
        ,family=binomial(link="logit"))
# Then we check for significance
drop1(mod1,test="Chi")

Single term deletions

Model:
Nest_Fate ~ Species + Initiation_Date
                Df Deviance    AIC    LRT  Pr(>Chi)    
<none>               1296.3 1314.3                     
Species          7   1334.0 1338.0 37.701 3.453e-06 ***
Initiation_Date  1   1296.4 1312.4  0.029    0.8656    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results table shows that Initiation_Date is not significant. However the AIC with Initiation_Date is better which means we should keep this variable despite its lack of significance.

Model coefficient and its equation

Show the code

# Coefficients of the model
summary(mod1)


Call:
glm(formula = Nest_Fate ~ Species + Initiation_Date, family = binomial(link = "logit"), 
    data = data_Illinois)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)   
(Intercept)                -1.097853   0.381114  -2.881  0.00397 **
SpeciesDickcissel           0.189410   0.329713   0.574  0.56565   
SpeciesEastern_Meadowlark  -0.382551   0.335541  -1.140  0.25424   
SpeciesGrasshopper_Sparrow  0.664586   0.315297   2.108  0.03505 * 
SpeciesHenslow's_Sparrow   -0.327407   1.155216  -0.283  0.77686   
SpeciesSavannah_Sparrow    -0.655641   0.822713  -0.797  0.42549   
SpeciesVesper_Sparrow       0.180147   0.420980   0.428  0.66871   
SpeciesWestern_Meadowlark   0.961508   0.416342   2.309  0.02092 * 
Initiation_Date             0.000686   0.004052   0.169  0.86557   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1335.5  on 1072  degrees of freedom
Residual deviance: 1296.3  on 1064  degrees of freedom
AIC: 1314.3

Number of Fisher Scoring iterations: 4

The following equation is the one of the model : \[\log\left(\frac{p}{1-p}\right) = -1.097853 + 0.189410 \cdot \text{Species}_{\text{Dickcissel}} - 0.382551 \cdot \text{Species}_{\text{Eastern Meadowlark}} + 0.664586 \cdot \text{Species}_{\text{Grasshopper Sparrow}} - 0.327407 \cdot \text{Species}_{\text{Henslow's Sparrow}} - 0.655641 \cdot \text{Species}_{\text{Savannah Sparrow}} + 0.180147 \cdot \text{Species}_{\text{Vesper Sparrow}} + 0.961508 \cdot \text{Species}_{\text{Western Meadowlark}} + 0.000686 \cdot \text{Initiation\_Date}\]

where \(p\) is the probability that the event is true.

Intercept (-1.097853): The value of the log-odds when all explanatory variables are zero.
Species Dickcissel (0.189410): A positive coefficient, but not significant ((p = 0.56565)), indicating that the effect of this species on the chances of nest success is not statistically significant.
Species Eastern Meadowlark (-0.382551): A negative coefficient, not significant ((p = 0.25424)).
Species Grasshopper Sparrow (0.664586): A positive and significant coefficient ((p = 0.03505)), suggesting that this species increases the chances of nest success.
Species Henslow’s Sparrow (-0.327407): A negative coefficient, not significant ((p = 0.77686)).
Species Savannah Sparrow (-0.655641): A negative coefficient, not significant ((p = 0.42549)).
Species Vesper Sparrow (0.180147): A positive coefficient, not significant ((p = 0.66871)).
Species Western Meadowlark (0.961508): A positive and significant coefficient ((p = 0.02092)).
Initiation Date (0.000686): A positive coefficient, but not significant ((p = 0.86557)), indicating that it does not have a statistically significant effect on nest success.

Explanation provided by the model

One way to evaluate the goodness of fit for our models is to assess the difference between our model and the null model (which summarizes the data with a single parameter: the mean of𝑌) Y). This null model essentially explains nothing. To do so we calculate a parameter, \[{pseudo}-R^2\] by examining the distance between the deviance of the null model and the residual deviance of our model.

Show the code

# Estimate of deviance explained
(mod1$null.deviance-mod1$deviance)/mod1$null.deviance

[1] 0.0293313

Show the code

# Some others estimates of deviance explained - package 'rcompanion'
nagelkerke(mod1)

$Models
                                                                                              
Model: "glm, Nest_Fate ~ Species + Initiation_Date, binomial(link = \"logit\"), data_Illinois"
Null:  "glm, Nest_Fate ~ 1, binomial(link = \"logit\"), data_Illinois"                        

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                            0.0293313
Cox and Snell (ML)                  0.0358486
Nagelkerke (Cragg and Uhler)        0.0503523

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq    p.value
      -8     -19.586 39.172 4.5671e-06

$Number.of.observations
           
Model: 1073
Null:  1073

$Messages
[1] "Note: For models fit with REML, these statistics are based on refitting with ML"

$Warnings
[1] "None"

According to this table, the estimate of the explained deviance is 2.9% which is very low.

C. Model validation

Checking for influential individuals

We make sure that there is no influential individuals.

Show the code

par(mfrow = c(1, 1))
plot(cooks.distance(mod1), type = "h", ylim = c(0, 1))
abline(h = 1, col = 2,lwd = 3)

The graph does not show any influential individuals.

Prediction with different data

Here we want to assess if our model is able to well-classify nest based on our data.

Show the code

# We could do a simulation
N    <- nrow(data_Illinois)
Pi   <- fitted(mod1)
data_Illinois$Ysim <- rbinom(N, size = 1, Pi)
# Classification table
Z <- table(data_Illinois$Nest_Fate, data_Illinois$Ysim) / N
rownames(Z) <- c("Observed 0", "Observed 1")
colnames(Z) <- c("Predicted 0", "Predicted 1")
Z

            
             Predicted 0 Predicted 1
  Observed 0   0.4846226   0.2013048
  Observed 1   0.2013048   0.1127679

Show the code

#Correctly classified:
sum(diag(Z))

[1] 0.5973905

Show the code

# And repeat this 1000 times, store the results and calculate an average classification table

Pi   <- fitted(mod1)
N    <- nrow(data_Illinois)                 
NSim <- 1000                           
diagZ<- matrix(nrow = N, ncol = NSim)
for (i in 1:NSim) {
  Ysim <- rbinom(N, size = 1, Pi)
  Z<- table(data_Illinois$Nest_Fate, Ysim) / N
  diagZ[,i]<-sum(diag(Z))
  }
#Average rate of individuals well-classified by the model
boxplot(diagZ[2,], col='dodgerblue3',ylab='#Rate of nest fate well-classified')

Show the code

mean(diagZ[2,])

[1] 0.5850457

Here we have 58% of good predictions, which is somewhat below the common threshold of 70% for a model to be considered good. This suggests that the model may need improvement.

This result is not surprising given our initial data. Because they are from different studies some variables barely contains some species or years, etc… This could explain why our model is not suitable.

III. Analysing the relationship between nest fate and county, habitat, species, year, initiation date using a generalized linear model (GLM) with a Poisson distribution

Variables presentation:

data$Nest_Fate <- as.factor(data$Nest_Fate)
data$County <- as.factor(data$County)
data$Habitat <- as.factor(data$Habitat)
data$Year <- as.factor(data$Year)

County : the county in the state where the state has been found. 14 levels factor.
Habitat : the type of habitat in which the nest has been found. 4 levels factor.
Species : the species of the bird that has build the nest. 8 levels factor.
Year : the year of the nest building. 21 levels factor.
Number_Fledged : the number of birds that hatch on the nest when they hatch. Quantitative variable.

We want to study the impact of four qualitative variables, the habitat, the county, the species and the year over the number of young hatched in the nest “Number_Fledged”. It is a counting variable, which imply that it follows a Poisson distribution. A negative binomial distribution could be used but is preferred when the count values are high (which is not really the case here since the highest value is of 6)

A Poisson law is used to describe rare, discrete and independent events. Hence the variable \(Y\) that follows a Poisson distribution of parameter \(\lambda\) is written: \[Pr(Y=y)=\frac{e^{-\lambda}.\lambda^y}{y!}\] where \(y\) represent the value of the count (here \(y\) ∈ [0-6]) and \(\lambda\) is the mean value and variance of the Poisson distribution (because in a Poisson law E(\(y\))= Var(\(y\))=\(\lambda\)).

A. Data exploration

Before any analysis, it is important to have a look at the data.

Outlayer values in the response variable \(Y\) (the number of young in the nest, Number_Fledged) and values distribution:

par(mfrow=c(2,2))
# Boxplot
boxplot(data$Number_Fledged,col='dodgerblue3',ylab='Number of young in nest')
# Cleveland plot
dotchart(data$Number_Fledged,pch=16,col='dodgerblue3',xlab='Number of young in nest')
# Histogram
hist(data$Number_Fledged,col='dodgerblue3',xlab="Number of young in nest",main="")
# Quantile-Quantile plot
qqnorm(data$Number_Fledged,pch=16,col='dodgerblue3',xlab='')
qqline(data$Number_Fledged,col='red',lwd=2)

There aren’t any outlayer values. There are a lot of value at 0. Maybe we can study only the number of young when there are some. It could be a normal distribution.

# Get rid of the lines where the Number of young individuals in the nest is equal to 0
data_filtered <- subset(data, Number_Fledged != 0)

par(mfrow=c(2,2))
# Boxplot
boxplot(data_filtered$Number_Fledged,col='dodgerblue3',ylab='Number of young in nest')
# Cleveland plot
dotchart(data_filtered$Number_Fledged,pch=16,col='dodgerblue3',xlab='Number of young in nest')
# Histogram
hist(data_filtered$Number_Fledged,col='dodgerblue3',xlab="Number of young in nest",main="")
# Quantile-Quantile plot
qqnorm(data_filtered$Number_Fledged,pch=16,col='dodgerblue3',xlab='')
qqline(data_filtered$Number_Fledged,col='red',lwd=2)

Now we have a better view of the data.

Outlayer values in the explanatory variable \(Xs\) and values distribution:

# Factor County
summary(data_filtered$County)

          Carroll Carroll/Whiteside              Dane             Grant 
                5                 1               224                20 
            Green            Grundy              Iowa        Jo_Daviess 
               62                11               260               127 
        Lafayette               Lee          Lee/Ogle            Monroe 
              165                 6                85               103 
             Rock              Will 
                2               102

# Factor Habitat
summary(data_filtered$Habitat)

Cool-season_Grassland               Pasture               Prairie 
                  273                   537                   180 
Warm-season_Grassland 
                  183

# Factor Species
summary(data_filtered$Species)

           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
                184                 163                 233                 280 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
                 28                 196                  60                  29

# Factor Year
summary(data_filtered$Year)

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 
  35    4   13   13   73   45  120  137   98  131   66   75   20   44   27   57 
2007 2008 2009 2010 2011 
  46   73   24   50   22

Problem: there aren’t the same number of observations in each level of each factors. Suggestion: normalize by the number of observation for each level.

Analyze the relation between Y and Xs

par(mfrow=c(2,2))

# County
boxplot(Number_Fledged~County, varwidth = TRUE, ylab = "Number of young in nest", xlab = "County", col=terrain.colors(14), main = "", data=data_filtered)

# Habitat
boxplot(Number_Fledged~Habitat, varwidth = TRUE, ylab = "Number of young in nest", xlab = "Habitat", col=terrain.colors(4), main = "",data=data_filtered)

# Species
boxplot(Number_Fledged~Species, varwidth = TRUE, ylab = "Number of young in nest", xlab = "Species", col=terrain.colors(8), main = "",data=data_filtered)

# Year
boxplot(Number_Fledged~Year, varwidth = TRUE, ylab = "Number of young in nest", xlab = "Year", col=terrain.colors(21), main = "",data=data_filtered)

It’s not easy to see if the variables have an impact because some of them have a lot of levels, but it is possible to see small differences between them. The analysis will tell us if they really got an impact over the number of young in the nest.

Analyze the interactions between Xs

There are a lot of levels so it’s hard to plot the interactions. We will see if there are some in the analysis.

par(mfrow=c(1,3))
# Interactions between County & Habitat
plot(Number_Fledged~County,type='n',ylab = "Number of young in nest",xlab="County", data=data_filtered)
points(Number_Fledged[Habitat=="Pasture"]~County[Habitat=="Pasture"],pch=16,cex=2,col='#000099', data=data_filtered)
points(Number_Fledged[Habitat=="Cool-season Grassland"]~County[Habitat=="Cool-season Grassland"],pch=16,cex=2,col='green', data=data_filtered)
points(Number_Fledged[Habitat=="Prairie"]~County[Habitat=="Prairie"],pch=16,cex=2,col='red', data=data_filtered)
points(Number_Fledged[Habitat=="Warm-season Grassland"]~County[Habitat=="Warm-season Grassland"],pch=16,cex=2,col='yellow', data=data_filtered)

Here the example between county and habitat doesn’t show a clear pattern. It’s hard to conclude on the interaction aspect.

Checking collinearity between categorical independent variables Xs:

The following code only presents the contingency table between the variables Habitat and Species for convenience, but the same table has to be done with every couple of 4 variables.

contingency_table<-table(data_filtered$Habitat,data_filtered$Species)
print(contingency_table)

                       
                        Bobolink Dickcissel Eastern_Meadowlark
  Cool-season_Grassland       89         51                 95
  Pasture                     53         26                 86
  Prairie                      2         21                 32
  Warm-season_Grassland       40         65                 20
                       
                        Grasshopper_Sparrow Henslow's_Sparrow Savannah_Sparrow
  Cool-season_Grassland                  13                21                4
  Pasture                               141                 0              189
  Prairie                                80                 1                0
  Warm-season_Grassland                  46                 6                3
                       
                        Vesper_Sparrow Western_Meadowlark
  Cool-season_Grassland              0                  0
  Pasture                           13                 29
  Prairie                           44                  0
  Warm-season_Grassland              3                  0

chi2_test<-chisq.test(contingency_table)

Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect

print(chi2_test)


    Pearson's Chi-squared test

data:  contingency_table
X-squared = 748.71, df = 21, p-value < 2.2e-16

print(chi2_test$expected)

                       
                        Bobolink Dickcissel Eastern_Meadowlark
  Cool-season_Grassland 42.82353   37.93606           54.22762
  Pasture               84.23529   74.62148          106.66752
  Prairie               28.23529   25.01279           35.75448
  Warm-season_Grassland 28.70588   25.42967           36.35038
                       
                        Grasshopper_Sparrow Henslow's_Sparrow Savannah_Sparrow
  Cool-season_Grassland            65.16624          6.516624         45.61637
  Pasture                         128.18414         12.818414         89.72890
  Prairie                          42.96675          4.296675         30.07673
  Warm-season_Grassland            43.68286          4.368286         30.57801
                       
                        Vesper_Sparrow Western_Meadowlark
  Cool-season_Grassland      13.964194           6.749361
  Pasture                    27.468031          13.276215
  Prairie                     9.207161           4.450128
  Warm-season_Grassland       9.360614           4.524297

Problem : for most of the variables the chi² test can’t be done because of the low number of values in each group. The only one that fulfill the conditions and therefore can be made is between Habitat and Species but the result shows that the variables are correlated.

B. Statistical Analysis

Now that we have a better idea of the data implied in this study, we can begin the statistical analysis.

Building of the model

Because our response variable follows a poisson distribution we use a link function : the log function.

# Model formulation
mod1<-glm(Number_Fledged~ County
        + Species
        + Habitat
        + Year
        ,data=data_filtered
        ,family=poisson(link="log"))
# Variable type : poisson. Link fonction : log
# Then we check for significance
drop1(mod1,test="Chi")

Single term deletions

Model:
Number_Fledged ~ County + Species + Habitat + Year
        Df Deviance    AIC     LRT Pr(>Chi)   
<none>       541.13 4125.3                    
County  13   557.02 4115.2 15.8822 0.255551   
Species  7   560.94 4131.1 19.8043 0.006008 **
Habitat  3   541.44 4119.6  0.3096 0.958219   
Year    20   551.54 4095.7 10.4077 0.960163   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We decide to remove the less significative variable : here it’s Year.

# Model formulation
mod1<-glm(Number_Fledged~ County
        + Species
        + Habitat
        ,data=data_filtered
        ,family=poisson(link="log"))
#type de la variable : poisson et fonction de lien de type log
# Then we check for significance
drop1(mod1,test="Chi")

Single term deletions

Model:
Number_Fledged ~ County + Species + Habitat
        Df Deviance    AIC     LRT Pr(>Chi)   
<none>       551.54 4095.7                    
County  13   569.19 4087.3 17.6441  0.17151   
Species  7   574.48 4104.6 22.9344  0.00175 **
Habitat  3   551.76 4089.9  0.2194  0.97439   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the less significative is Habitat

# Model formulation
mod1<-glm(Number_Fledged~ County
        + Species
        ,data=data_filtered
        ,family=poisson(link="log"))
#type de la variable : poisson et fonction de lien de type log
# Then we check for significance
drop1(mod1,test="Chi")

Single term deletions

Model:
Number_Fledged ~ County + Species
        Df Deviance    AIC    LRT Pr(>Chi)   
<none>       551.76 4089.9                   
County  13   572.82 4085.0 21.056 0.071837 . 
Species  7   574.68 4098.8 22.914 0.001764 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

County is still not significative so we have to remove it.

# Model formulation
mod1<-glm(Number_Fledged~ Species
        ,data=data_filtered
        ,family=poisson(link="log"))
#type de la variable : poisson et fonction de lien de type log
# Then we check for significance
drop1(mod1,test="Chi")

Single term deletions

Model:
Number_Fledged ~ Species
        Df Deviance    AIC    LRT Pr(>Chi)    
<none>       572.82 4085.0                    
Species  7   599.76 4097.9 26.938 0.000342 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod1)


Call:
glm(formula = Number_Fledged ~ Species, family = poisson(link = "log"), 
    data = data_filtered)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 1.29234    0.03863  33.452  < 2e-16 ***
SpeciesDickcissel          -0.18558    0.05934  -3.128 0.001762 ** 
SpeciesEastern_Meadowlark  -0.14760    0.05347  -2.761 0.005769 ** 
SpeciesGrasshopper_Sparrow -0.05604    0.05030  -1.114 0.265207    
SpeciesHenslow's_Sparrow   -0.18190    0.11514  -1.580 0.114157    
SpeciesSavannah_Sparrow    -0.19203    0.05648  -3.400 0.000674 ***
SpeciesVesper_Sparrow      -0.32409    0.08844  -3.665 0.000248 ***
SpeciesWestern_Meadowlark  -0.04452    0.10674  -0.417 0.676633    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 599.76  on 1172  degrees of freedom
Residual deviance: 572.82  on 1165  degrees of freedom
AIC: 4085

Number of Fisher Scoring iterations: 4

We have the final model with only the Species.

\(Species{Bobolink }\) = 0 (intercept)
\(Species{Dickcissel }\) = \(-0.18^{**}\)
\(Species{Eastern Meadowlark }\) = \(-0.14^{**}\)
\(Species{Grasshopper Sparrow }\) = \(-0.05\)
\(Species{Henslow's Sparrow }\) = \(-0.18\)
\(Species{Savannah Sparrow }\) = \(-0.19^{***}\)
\(Species{Vesper Sparrow }\) = \(-0.32^{***}\)
\(Species{Western Meadowlark }\) = \(-0.04\)

The equation of the model is the following:

\[ Number_Fledged = 1.29^{***} + (Species{Bobolink} = 0 ;\:Species{Dickcissel} = -0.18^{**} ; \:Species{Eastern Meadowlark} = -0.14^{**} ; \:Species{Grasshopper Sparrow} = -0.05 ; \:Species{Henslow's Sparrow} = -0.18 ; \:Species{Savannah Sparrow = -0.19^{***} ; \:Species{Vesper Sparrow = -0.32^{***} ; \:Species{Western Meadowlark} = -0.04) \]

According to the model’s summary, the null deviance is of 599.76 and residual deviance is of 572.82. In order to estimate the explained deviance, one can use the following formula:

\[Pseudo\:R^2=100\:.\:\frac{Null\:Deviance- Residual\:Deviance}{Null\:Deviance}\]

# Estimate of deviance explained
(mod1$null.deviance-mod1$deviance)/mod1$null.deviance

[1] 0.04491502

This is very very bad… The variable species doesn’t seem to explain a lot of the variation in the number of young individuals in the nest.

C. Model validation:

The model that we have done is a generalized linear model so there aren’t residual normality or homogeneity. However, it is important to check the independence of residuals.

First of all, in situations where a counting model is used, such as when following a Poisson distribution, the overdispersion has to be checked. It can be done with the following code. If the result is over 1, there are overdispersion and the model should be reconsidered.

# Scale parameter calculation
E1 <- resid(mod1, type = "pearson") # (Y - mu) / sqrt(mu)
N  <- nrow(data_filtered)
p  <- length(coef(mod1))
sum(E1^2) / (N - p)

[1] 0.4439871

The value is of 0.44 so it’s okay, the residuals aren’t overdispersed.

We can also check for the most influential individuals:

par(mfrow = c(1, 1))
plot(cooks.distance(mod1), type = "h", ylim = c(0, 1))
abline(h = 1, col = 2,lwd = 3)

There aren’t very influential individuals which can be logical because of the high number of them and the low impact of species over the number of young individuals in the nest.

Conclusions:

The species of the birds has an impact over the number of young individuals in the nest, but it is clearly not the only variable that explain the variation of the response (only 4% are due to the species).

IV. ANOVA

Analysis of variance (ANOVA) is one of the most widely used statistical methods in various disciplines. It allows to model the relationship between a quantitative dependent variable \(Y\) according to the different modalities of one or more qualitative explanatory variables \(Xs\).

We will perform an ANOVA to see the effect of qualitatives variables (state, habitat, species, year) on the quantitative variable Initiation_Date in the Bird_dataset.csv dataset.

We need to check the type of the variables :

Show the code

# Dataset : variables as factors

data$State<-as.factor(data$State)
data$Habitat<-as.factor(data$Habitat)
data$Species<-as.factor(data$Species)
data$Year<-as.factor(data$Year)
str(data)

'data.frame':   3257 obs. of  10 variables:
 $ Nest_ID        : Factor w/ 3256 levels "1000JAAP","1005JAAP",..: 974 977 979 980 981 982 983 984 985 986 ...
 $ State          : Factor w/ 2 levels "Illinois","Wisconsin": 1 1 1 1 1 1 1 1 1 1 ...
 $ County         : Factor w/ 14 levels "Carroll","Carroll/Whiteside",..: 14 14 14 14 14 14 14 14 14 14 ...
 $ Habitat        : Factor w/ 4 levels "Cool-season_Grassland",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Species        : Factor w/ 8 levels "Bobolink","Dickcissel",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year           : Factor w/ 21 levels "1991","1992",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Nest_Fate      : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 2 2 1 1 ...
 $ Initiation_Date: int  58 52 51 60 53 65 52 47 50 65 ...
 $ Number_Fledged : int  0 0 6 0 0 0 5 4 0 0 ...
 $ Nest_End_Date  : int  78 74 75 75 69 78 81 72 63 81 ...

Show the code

# Checking for missing data

colSums(is.na(data))

        Nest_ID           State          County         Habitat         Species 
              0               0               0               0               0 
           Year       Nest_Fate Initiation_Date  Number_Fledged   Nest_End_Date 
              0               0               0               0               0

There is no missing data here.

Before any modeling, we must explore the data in order to avoid possible errors. Below we present the list of steps to follow before proceeding to the modeling:

1: Check for the presence of outliers in \(Y\) and the distribution of the values of \(Y\)
2: If \(X\) is a quantitative explanatory variable, check for the presence of outliers and the distribution of the values of \(X\) OR/AND If \(X\) is a qualitative explanatory variable, analyze the number of modalities and the number of individuals per modality
3: Analyze the potential relationship between \(Y\) and the \(X_{s}\)
4: Check for the presence of possible interactions between the \(X_{s}\)

Outliers in \(Y\) and distribution of values of \(Y\)

Show the code

par(mfrow=c(2,2))
# Boxplot
boxplot(data$Initiation_Date,col='dodgerblue3',ylab='Initiation date')
# Cleveland plot
dotchart(data$Initiation_Date,pch=16,col='dodgerblue3',xlab='Initiation date')
# Histogram
hist(data$Initiation_Date,col='dodgerblue3',xlab="Initiation date",main="")
# Quantile-Quantile plot
qqnorm(data$Initiation_Date,pch=16,col='dodgerblue3',xlab='')
qqline(data$Initiation_Date,col='red',lwd=2)

The different graphs show us no outliers for Initiation_Date. In addition, our Y variable seems to follow a Normal Distribution.

Outliers and distribution of the values of \(X\)

The four \(Xs\) are factors. We can analyze the number of modalities and the number of individuals per modality.

Show the code

# Factor State
summary(data$State)

 Illinois Wisconsin 
     1073      2184

Show the code

# Factor Habitat
summary(data$Habitat)

Cool-season_Grassland               Pasture               Prairie 
                  677                  1683                   496 
Warm-season_Grassland 
                  401

Show the code

# Factor Species
summary(data$Species)

           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
                390                 470                 726                 693 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
                 72                 664                 174                  68

Show the code

# Factor Year
summary(data$Year)

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 
  68   27   24   26  206  261  330  321  272  448  285  181   56  102   55   91 
2007 2008 2009 2010 2011 
 139  180   67   73   45

Factor State: The experimental design is not balanced (1071 and 2150 data), but the analysis can be performed because there are enough data in each modality (> 10).

Factor Habitat: The experimental design is not balanced (from 392 to 1683 data), but the analysis can be performed because there are enough data in each modality (> 10).

Factor Species: The experimental design is not balanced. Several species have less than 10 data per modality. It will therefore be necessary to remove these data before continuing the analysis

Factor Year: The experimental design is not balanced (from 24 to 448 data), but the analysis can be performed because there are enough data in each modality (> 10 : threshold from which we consider that carrying out statistical analyses makes sense. You can choose your own threshold).

Removal of poorly represented modalities (< 10 data)

Show the code

# Count the number of occurrences for each modality in data$Species
species_count <- table(data$Species)

# Keep only terms with 10 or more occurrences
data <- data %>% filter(Species %in% names(species_count[species_count >= 10]))

# Show filtered data
head(data)

  Nest_ID    State County Habitat  Species Year Nest_Fate Initiation_Date
1 745JAAP Illinois   Will Pasture Bobolink 1995         0              58
2 748JAAP Illinois   Will Pasture Bobolink 1995         0              52
3 753JAAP Illinois   Will Pasture Bobolink 1995         1              51
4 754JAAP Illinois   Will Pasture Bobolink 1995         0              60
5 755JAAP Illinois   Will Pasture Bobolink 1995         0              53
6 758JAAP Illinois   Will Pasture Bobolink 1995         0              65
  Number_Fledged Nest_End_Date
1              0            78
2              0            74
3              6            75
4              0            75
5              0            69
6              0            78

Show the code

summary(data$Species)

           Bobolink          Dickcissel  Eastern_Meadowlark Grasshopper_Sparrow 
                390                 470                 726                 693 
  Henslow's_Sparrow    Savannah_Sparrow      Vesper_Sparrow  Western_Meadowlark 
                 72                 664                 174                  68

Analyzing the potential relationship between Y vs Xs

We can graphically analyze the possible relationships between Y and X. This graphical analysis of the relationships between Y and X does not allow us to predict the significance of the relationship. Statistical modeling remains the only available means to identify whether a relationship exists or not.

Show the code

# Boxplot 
par(mfrow=c(1,2))
boxplot(data$Initiation_Date~data$State, varwidth = TRUE, ylab = "Initiation date", xlab = "State", col='darkgrey', main = "")
boxplot(data$Initiation_Date~data$Habitat, varwidth = TRUE, ylab = "Initiation date", xlab = "Habitat", col='darkgrey', main = "")

Show the code

boxplot(data$Initiation_Date~data$Species, varwidth = TRUE, ylab = "Initiation date", xlab = "Species", col='darkgrey', main = "")
boxplot(data$Initiation_Date~data$Year, varwidth = TRUE, ylab = "Initiation date", xlab = "Year", col='darkgrey', main = "")

From the first visualizations, we could make the hypothesis that Year and Species influence Initiation_Date.

Analysis of possible interactions between the four Xs

The interaction between two factors can only be tested if the factors are crossed (all the modalities of one factor are represented in all the modalities of the other factor and vice versa = a complete factorial design). To estimate the presence of interactive effects, we will develop a graphical approach.

Show the code

# Interaction table
table(data$State,data$Habitat)

           
            Cool-season_Grassland Pasture Prairie Warm-season_Grassland
  Illinois                    242     537     137                   157
  Wisconsin                   435    1146     359                   244

Show the code

table(data$State,data$Species)

           
            Bobolink Dickcissel Eastern_Meadowlark Grasshopper_Sparrow
  Illinois        62        314                229                 350
  Wisconsin      328        156                497                 343
           
            Henslow's_Sparrow Savannah_Sparrow Vesper_Sparrow
  Illinois                  5               13             54
  Wisconsin                67              651            120
           
            Western_Meadowlark
  Illinois                  46
  Wisconsin                 22

Show the code

table(data$State,data$Year)

           
            1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
  Illinois     0    0    0    0  166  261  330  166    0    0    0    0    0
  Wisconsin   68   27   24   26   40    0    0  155  272  448  285  181   56
           
            2004 2005 2006 2007 2008 2009 2010 2011
  Illinois     0    0    0   52   60   38    0    0
  Wisconsin  102   55   91   87  120   29   73   45

Show the code

table(data$Habitat,data$Species)

                       
                        Bobolink Dickcissel Eastern_Meadowlark
  Cool-season_Grassland      157        163                260
  Pasture                    166         85                322
  Prairie                      9         69                 95
  Warm-season_Grassland       58        153                 49
                       
                        Grasshopper_Sparrow Henslow's_Sparrow Savannah_Sparrow
  Cool-season_Grassland                  32                53               10
  Pasture                               360                 0              644
  Prairie                               200                 1                2
  Warm-season_Grassland                 101                18                8
                       
                        Vesper_Sparrow Western_Meadowlark
  Cool-season_Grassland              2                  0
  Pasture                           40                 66
  Prairie                          120                  0
  Warm-season_Grassland             12                  2

Show the code

table(data$Habitat,data$Year)

                       
                        1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
  Cool-season_Grassland    1    0    0    0   40   79   75    7    0    0   34
  Pasture                  0    0   15   26   92  106  226  272  272  411  158
  Prairie                 11    2    0    0   41   29    9   28    0   32   92
  Warm-season_Grassland   56   25    9    0   33   47   20   14    0    5    1
                       
                        2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  Cool-season_Grassland   17   18   25   48   88  109  132    4    0    0
  Pasture                 11   19   33    0    2   12   17    7    1    3
  Prairie                151   19   44    0    1   11   12   11    0    3
  Warm-season_Grassland    2    0    0    7    0    7   19   45   72   39

Show the code

table(data$Species,data$Year)

                     
                      1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
  Bobolink               0    0    0    5   20   22   20   29   37   34   20
  Dickcissel            68   27   24    0   50   60   83   27    0    7    5
  Eastern_Meadowlark     0    0    0    3   51  119   41   35   47   60   70
  Grasshopper_Sparrow    0    0    0    0   45   50  128  114   21   66   59
  Henslow's_Sparrow      0    0    0    0    4    1    0    0    0    0    4
  Savannah_Sparrow       0    0    0   14   27    3    3   81  161  262   91
  Vesper_Sparrow         0    0    0    2    4    6   28   13    0   16   36
  Western_Meadowlark     0    0    0    2    5    0   27   22    6    3    0
                     
                      2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  Bobolink               4   10    6   24   28   32   46    9   29   15
  Dickcissel            11    0    0    0    3   36   33   29    1    6
  Eastern_Meadowlark    36   21   36   18   43   49   62    5   21    9
  Grasshopper_Sparrow   74   15   38    5    3   11   28   15   10   11
  Henslow's_Sparrow      2    1    7    6   11   11    7    4   10    4
  Savannah_Sparrow       0    7    4    2    3    0    2    2    2    0
  Vesper_Sparrow        54    2   10    0    0    0    2    1    0    0
  Western_Meadowlark     0    0    1    0    0    0    0    2    0    0

Subsequently, we will not analyze the interactions between the factors since there is no interaction between the Xs variables.

Search for candidate model

Show the code

# Model formulation
mod1<-lm(Initiation_Date~State+Habitat+Species+Year,data=data)

# Then we check for significance
drop1(mod1,test="F")

Single term deletions

Model:
Initiation_Date ~ State + Habitat + Species + Year
        Df Sum of Sq     RSS   AIC F value    Pr(>F)    
<none>                976530 18639                      
State    1      2745  979275 18646  9.0652  0.002625 ** 
Habitat  3       707  977238 18636  0.7788  0.505682    
Species  7    103831 1080362 18954 48.9863 < 2.2e-16 ***
Year    20     38882 1015412 18726  6.4204 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the Habitat factor is not significant, we can remove it in our search for the candidate model.

Show the code

# Model formulation
mod2<-lm(Initiation_Date~State+Species+Year,data=data)

# Then we check for significance
drop1(mod2,test="F")

Single term deletions

Model:
Initiation_Date ~ State + Species + Year
        Df Sum of Sq     RSS   AIC F value    Pr(>F)    
<none>                977238 18636                      
State    1      2704  979942 18643  8.9334  0.002821 ** 
Species  7    106384 1083622 18958 50.2010 < 2.2e-16 ***
Year    20     50931 1028169 18761  8.4118 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All factors are significant. So we have our candidate model.

To understand how factors influence Initiation_Date, we need to analyze the coefficients of the model.

Analysis of the coefficients of the candidate model

Show the code

# Coefficients of the model
summary(mod2)


Call:
lm(formula = Initiation_Date ~ State + Species + Year, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.632 -12.483  -0.898  12.648  47.293 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 75.2202     3.1666  23.754  < 2e-16 ***
StateWisconsin              -4.1385     1.3846  -2.989 0.002821 ** 
SpeciesDickcissel           22.1389     1.4608  15.155  < 2e-16 ***
SpeciesEastern_Meadowlark    0.5651     1.1230   0.503 0.614845    
SpeciesGrasshopper_Sparrow  10.9373     1.2108   9.033  < 2e-16 ***
SpeciesHenslow's_Sparrow     7.1495     2.2608   3.162 0.001580 ** 
SpeciesSavannah_Sparrow      2.0384     1.2402   1.644 0.100344    
SpeciesVesper_Sparrow        6.5509     1.7251   3.797 0.000149 ***
SpeciesWestern_Meadowlark    4.8149     2.3779   2.025 0.042969 *  
Year1992                    -8.1836     3.9578  -2.068 0.038750 *  
Year1993                    -9.3039     4.1311  -2.252 0.024379 *  
Year1994                   -10.9649     4.2064  -2.607 0.009184 ** 
Year1995                   -11.7072     2.9317  -3.993 6.66e-05 ***
Year1996                   -10.8870     3.0252  -3.599 0.000324 ***
Year1997                   -10.0784     2.9379  -3.430 0.000610 ***
Year1998                    -9.2642     2.7433  -3.377 0.000742 ***
Year1999                    -6.2924     2.6793  -2.348 0.018910 *  
Year2000                    -4.6532     2.5812  -1.803 0.071518 .  
Year2001                    -6.1430     2.6142  -2.350 0.018840 *  
Year2002                   -10.5584     2.7096  -3.897 9.95e-05 ***
Year2003                   -12.7861     3.3400  -3.828 0.000132 ***
Year2004                   -18.2139     2.9473  -6.180 7.22e-10 ***
Year2005                   -14.2968     3.3841  -4.225 2.46e-05 ***
Year2006                   -18.2607     3.0289  -6.029 1.84e-09 ***
Year2007                   -14.3252     2.8354  -5.052 4.61e-07 ***
Year2008                   -17.6839     2.7534  -6.422 1.54e-10 ***
Year2009                   -19.0075     3.2178  -5.907 3.85e-09 ***
Year2010                   -16.1358     3.1693  -5.091 3.76e-07 ***
Year2011                   -13.8112     3.4968  -3.950 7.99e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 3228 degrees of freedom
Multiple R-squared:  0.2269,    Adjusted R-squared:  0.2202 
F-statistic: 33.84 on 28 and 3228 DF,  p-value: < 2.2e-16

Show the code

# From this listing, you read the coefficients table hereafter :

# Coefficients:
#                            Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                 75.7832     3.1611  23.974  < 2e-16 ***
# StateWisconsin              -4.4838     1.3892  -3.228 0.001261 ** 
# SpeciesDickcissel           21.9212     1.4579  15.036  < 2e-16 ***
# SpeciesEastern Meadowlark    0.5278     1.1198   0.471 0.637432    
# SpeciesGrasshopper Sparrow  10.6652     1.2095   8.818  < 2e-16 ***
# SpeciesSavannah Sparrow      1.9590     1.2370   1.584 0.113382    
# SpeciesVesper Sparrow        6.2687     1.7225   3.639 0.000278 ***
# SpeciesWestern Meadowlark    4.5978     2.3695   1.940 0.052424 .  
# Year1992                    -8.1836     3.9411  -2.076 0.037930 *  
# Year1993                    -9.3039     4.1136  -2.262 0.023780 *  
# Year1994                   -11.0971     4.1890  -2.649 0.008111 ** 
# Year1995                   -11.7709     2.9221  -4.028 5.75e-05 ***
# Year1996                   -11.2748     3.0178  -3.736 0.000190 ***
# Year1997                   -10.4340     2.9302  -3.561 0.000375 ***
# Year1998                    -9.4950     2.7335  -3.474 0.000520 ***
# Year1999                    -6.4308     2.6686  -2.410 0.016018 *  
# Year2000                    -4.7644     2.5708  -1.853 0.063934 .  
# Year2001                    -6.3518     2.6071  -2.436 0.014893 *  
# Year2002                   -10.3377     2.7027  -3.825 0.000133 ***
# Year2003                   -12.4141     3.3412  -3.716 0.000206 ***
# Year2004                   -17.6615     2.9707  -5.945 3.06e-09 ***
# Year2005                   -14.4370     3.4662  -4.165 3.20e-05 ***
# Year2006                   -20.7535     3.0848  -6.728 2.04e-11 ***
# Year2007                   -15.3033     2.8578  -5.355 9.17e-08 ***
# Year2008                   -17.9192     2.7567  -6.500 9.28e-11 ***
# Year2009                   -19.0311     3.2531  -5.850 5.41e-09 ***
# Year2010                   -15.0863     3.2508  -4.641 3.61e-06 ***
# Year2011                   -13.2651     3.5681  -3.718 0.000205 ***

Explanatory power of the model: R²

We can determine the portion of variance of \(Y\) explained by our model.

Show the code

# R² of the model
summary(mod2)$adj.r.squared

[1] 0.2201967

We obtain the adjusted R² = 0.22. This means that approximately 22% of the variance of Initiation_Date is explained by our modeling.

Validation of modeling

The assumptions of ANOVA are the same as for any general linear model, namely independence, normality of residuals and homogeneity of variances. In addition, one must check the presence of influential statistical units (i.e. observations having a significant contribution to the construction of the model).

Normality of residuals

Show the code

par(mfrow=c(1,2))
# Histogram
hist(mod2$residuals,col='dodgerblue3',xlab="residuals",main="Check Normality")
# Quantile-Quantile plot
qqnorm(mod2$residuals,pch=16,col='dodgerblue3',xlab='')
qqline(mod2$residuals,col='red',lwd=2)

The residuals seem to follow a Normal Distribution.

Homogeneity of variance

The assumption of homogeneity of variance is the fact that the variation of the residuals is approximately equal over the entire range of explanatory variables, should be verified by plotting the residuals at the estimated values and the residuals at the values of the significant variables. The residuals should show no trend.

Show the code

par(mfrow=c(2,2))
# residuals vs fitted
plot(residuals(mod1)~fitted(mod1)
     , col='dodgerblue3'
     , pch=16)
# residuals against state factor
boxplot(residuals(mod2)~ data$State, 
        varwidth = TRUE,
        col='darkgrey',
        ylab = "Residuals",
        xlab = "States",
        main = "")
# residuals against species factor
boxplot(residuals(mod2)~ data$Species, 
        varwidth = TRUE,
        col='brown',
        ylab = "Residuals",
        xlab = "Species",
        main = "")
# residuals against year factor
boxplot(residuals(mod2)~ data$Year, 
        varwidth = TRUE,
        col='brown',
        ylab = "Residuals",
        xlab = "Years",
        main = "")

We do not observe any trend. Therefore the hypothesis of homogeneity of variance is verified.

Independence

ANOVA (like any other general linear model) assumes that all statistical units (and therefore residuals) are independent of each other. This issue must be taken into account at the research design stage. Given the approach here, the statistical individuals are independent. This assumption is verified.

Verification of influential individuals

Show the code

par(mfrow = c(1, 1))
plot(cooks.distance(mod2), type = "h", ylim = c(0, 1))
abline(h = 1, col = 2,lwd = 3)

There is no influential individual.

All conditions being verified, our modeling can be validated.

Conclusion of our ANOVA

The explanatory variables retained in our model are “State”, “Species” and “Year”.

The equation of our model is:

\[ \text{Y} = 75.7832 - 4.4838 \cdot \text{State}_{\text{Wisconsin}} + 21.9212 \cdot \text{Species}_{\text{Dickcissel}} + 0.5278 \cdot \text{Species}_{\text{Eastern Meadowlark}} + 10.6652 \cdot \text{Species}_{\text{Grasshopper Sparrow}} + 1.9590 \cdot \text{Species}_{\text{Savannah Sparrow}} + 6.2687 \cdot \text{Species}_{\text{Vesper Sparrow}} + 4.5978 \cdot \text{Species}_{\text{Western Meadowlark}} - 8.1836 \cdot \text{Year}_{1992} - 9.3039 \cdot \text{Year}_{1993} - 11.0971 \cdot \text{Year}_{1994} - 11.7709 \cdot \text{Year}_{1995} - 11.2748 \cdot \text{Year}_{1996} - 10.4340 \cdot \text{Year}_{1997} - 9.4950 \cdot \text{Year}_{1998} - 6.4308 \cdot \text{Year}_{1999} - 4.7644 \cdot \text{Year}_{2000} - 6.3518 \cdot \text{Year}_{2001} - 10.3377 \cdot \text{Year}_{2002} - 12.4141 \cdot \text{Year}_{2003} - 17.6615 \cdot \text{Year}_{2004} - 14.4370 \cdot \text{Year}_{2005} - 20.7535 \cdot \text{Year}_{2006} - 15.3033 \cdot \text{Year}_{2007} - 17.9192 \cdot \text{Year}_{2008} - 19.0311 \cdot \text{Year}_{2009} - 15.0863 \cdot \text{Year}_{2010} - 13.2651 \cdot \text{Year}_{2011} \]

In this equation each variable takes the value 1 if the corresponding condition is true and 0 otherwise.

But despite all the conditions of the model being validated, this model only explains 22% (R² = 0.22) of our variable Y (Initiation_Date).

V. FAMD

Generical exploration

Here, we perform an FAMD on the initial dataframe. The FAMD is used for tables with qualitative and quantitative variables. The ID are removed because of their uselessness.

Show the code

data_famd <- data
### Conversion of the Year into factor
data_famd$Year <- as.factor(data_famd$Year)

### Removing of ID
data_famd <- data_famd |>
  dplyr::select(-Nest_ID)

We can perform the famd (see (fig?)-)

Show the code

famd <- FAMD(base = data_famd, graph = FALSE)

First, let’s see the summary :

Show the code

summary(famd)


Call:
FAMD(base = data_famd, graph = FALSE) 


Eigenvalues
                      Dim.1  Dim.2  Dim.3  Dim.4  Dim.5
Variance              3.416  2.831  2.644  2.327  2.148
% of var.             7.116  5.897  5.509  4.849  4.475
Cumulative % of var.  7.116 13.013 18.522 23.371 27.846

Individuals (the 10 first)
                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
1                 |  5.833 |  1.039  0.010  0.032 | -0.144  0.000  0.001 |
2                 |  5.890 |  0.893  0.007  0.023 | -0.069  0.000  0.000 |
3                 |  6.588 |  0.807  0.006  0.015 |  1.015  0.011  0.024 |
4                 |  5.839 |  1.024  0.009  0.031 | -0.144  0.000  0.001 |
5                 |  5.919 |  0.833  0.006  0.020 | -0.049  0.000  0.000 |
6                 |  5.810 |  1.141  0.012  0.039 | -0.204  0.000  0.001 |
7                 |  6.330 |  0.921  0.008  0.021 |  0.856  0.008  0.018 |
8                 |  6.231 |  0.724  0.005  0.014 |  0.834  0.008  0.018 |
9                 |  5.998 |  0.700  0.004  0.014 |  0.012  0.000  0.000 |
10                |  5.799 |  1.186  0.013  0.042 | -0.222  0.001  0.001 |
                   Dim.3    ctr   cos2  
1                 -1.498  0.026  0.066 |
2                 -1.538  0.027  0.068 |
3                 -1.114  0.014  0.029 |
4                 -1.505  0.026  0.066 |
5                 -1.557  0.028  0.069 |
6                 -1.473  0.025  0.064 |
7                 -1.124  0.015  0.032 |
8                 -1.224  0.017  0.039 |
9                 -1.596  0.030  0.071 |
10                -1.459  0.025  0.063 |

Continuous variables
                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
Initiation_Date   |  0.531  8.252  0.282 | -0.287  2.912  0.082 |  0.117  0.517
Number_Fledged    | -0.031  0.029  0.001 |  0.336  3.996  0.113 |  0.115  0.496
Nest_End_Date     |  0.532  8.273  0.283 | -0.187  1.233  0.035 |  0.143  0.776
                    cos2  
Initiation_Date    0.014 |
Number_Fledged     0.013 |
Nest_End_Date      0.021 |

Categories (the 10 first)
                      Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
Illinois          |   2.246  14.243   0.816  48.602 |  -0.380   0.594   0.023
Wisconsin         |  -1.103   6.998   0.816 -48.602 |   0.187   0.292   0.023
Carroll           |   1.468   0.085   0.009   3.084 |  -1.222   0.086   0.006
Carroll/Whiteside |   1.543   0.031   0.004   1.867 |  -1.536   0.045   0.004
Dane              |  -0.291   0.098   0.007  -3.552 |   2.883  14.046   0.684
Grant             |  -1.704   0.772   0.082  -9.410 |  -1.430   0.791   0.058
Green             |  -1.276   0.775   0.080  -9.555 |  -0.816   0.462   0.033
Grundy            |   2.311   0.745   0.075   9.176 |  -0.037   0.000   0.000
Iowa              |  -1.331   3.470   0.373 -22.364 |   0.213   0.129   0.010
Jo_Daviess        |   2.026   3.155   0.213  19.633 |  -1.770   3.506   0.163
                   v.test     Dim.3     ctr    cos2  v.test  
Illinois           -9.037 |  -0.508   1.214   0.042 -12.487 |
Wisconsin           9.037 |   0.249   0.597   0.042  12.487 |
Carroll            -2.819 |   1.454   0.139   0.009   3.471 |
Carroll/Whiteside  -2.043 |   1.652   0.060   0.004   2.273 |
Dane               38.695 |  -0.083   0.013   0.001  -1.154 |
Grant              -8.674 |  -0.774   0.266   0.017  -4.861 |
Green              -6.714 |   0.274   0.059   0.004   2.328 |
Grundy             -0.161 |   0.161   0.006   0.000   0.725 |
Iowa                3.925 |  -0.264   0.228   0.015  -5.046 |
Jo_Daviess        -18.840 |  -0.330   0.140   0.006  -3.634 |

Realisation of the FAMD

We can couple the summary with a plot of the inertia per axes :

Show the code

fviz_eig(X = famd,
         addlabels = TRUE)

Figure 1: Percentage of explained variances depending on the dimensions

Conclusion : We can see that keeping three axes is necessary to cover as much inertia as possible while being parcimonious. However, we will simplify the analysis by keeping only two axes in the interpretation parts. Now, we can see the contribution of the variables to the factorial axes :

Show the code

par(mfrow = c(1,3))
fviz_contrib(X = famd,
             choice = "var",
             axes = 1
             )
### County / State / Year / Species
fviz_contrib(X = famd,
             choice = "var",
             axes = 2
             )
### Year / Habitat / County / Species

fviz_contrib(X = famd,
             choice = "var",
             axes = 3
             )
famd$var$contrib[,1:3] > 1/9*100

                Dim.1 Dim.2 Dim.3
Initiation_Date FALSE FALSE FALSE
Number_Fledged  FALSE FALSE FALSE
Nest_End_Date   FALSE FALSE FALSE
State            TRUE FALSE FALSE
County           TRUE  TRUE  TRUE
Habitat         FALSE  TRUE  TRUE
Species          TRUE  TRUE  TRUE
Year             TRUE  TRUE  TRUE
Nest_Fate       FALSE FALSE FALSE

Show the code

### County / Habitat / Year / Species

Figure 2: Contributions of the variables to the axes

Figure 3: Contributions of the variables to the axes

Figure 4: Contributions of the variables to the axes

We can see that the only variables that contributed to the axes are the qualitative variables such as County, State, Year, Habitat and Species. The quantitative variables have low contributions.

And now, let’s observe the projections of the variables and of the observations :

Show the code

### Axes 1 and 2
fviz_famd_var(X = famd, 
              repel = TRUE,
              axes = c(1,2),
              col.var = "red",
              alpha.var = 2,
              "quali.var"
              )

Show the code

fviz_famd_var(X = famd, 
              repel = TRUE,
              axes = c(1,2),
              col.var = "red",
              "quanti.var"
              )

Show the code

fviz_famd_ind(X = famd, 
              repel = FALSE,
              axes = c(1,2),
              col.ind = "orange",
              col.quali.var = "red"
              )

Realisation of a Hierarchical clustering

We can use the function fviz_cluster to visualise the optimal number of clusters to make. The k-means method is used.

Show the code

famd_g <- as.data.frame(famd$ind$coord[,1:2])
class <- HCPC(famd_g,
              nb.clust = 3
              )

fviz_nbclust(x = famd_g,
             method = "silhouette",
             FUNcluster = kmeans,
             linecolor = "green3"
             )

Three clusters is the best choice according to the k-means method.

We can define classes for a Hierarchical Clustering thanks to the function fviz_cluster.

Show the code

fviz_cluster(class, axes = c(1,2), ellipse.type = "convex")

The fviz_cluster function shows three clusters. The projection of the variables shows an homogeneous scatterplots. The projection of the observations suggests two groups of individuals, separated merely by the axe of the ordinates. To determine the kinds of the groups, we have to gather variables.

The blue cluster is characterised by :

Year = 1991, 1992, 1993, 1995, 1996 et 1997, 1998
County = Jo_Daviess, Lee, Lee/Ogle, Grundy, Rock, Will (Illinois)
State : Illinois
Species : Dickcissel, Western_Meadowlark,
Habitat : Prairie, Warm-season Grassland

The red cluster is characterised by :

Year : 1994, 1999, 2000, 2001, 2002
County : Green, Grant , Monroe, Lafayette
State : Wisconsin
Species : Savanah Sparrow
Habitat : Pasture

The green side is characterised by :

Year : 2003, 2004, 2005, 2006
County : Dane
State : Wisconsin
Species : Bobolink, Henslow’s Sparrow, Eastern Meadowlark
Habitat : Cool-season Grassland

It is necessary to note that several species are near the border between the Illinois and the Wisconsin. Thus they can be seen in both states, which can be counter-intuitive by seeing the FAMD. However, the only factor that can explain this distribution could be years of sampling. There is a clear separation of the years depending on the part of the graphics.

(byers2017?; ellison2013?; herkert2003?; renfrew2005?; ribic?; ribic2012?; vos2013?)

Data presentation

What kind of questions can we rise ?

First look at the data :

1. Data exploration

Repartition of the data through time

Statistical tests

MCA analysis (data exploring)

II. Analysing the relationship between nest fate and county, habitat, species, year, initiation date using a generalized linear model (GLM) with a binomial distribution

A. Data Exploration

Outlier in \(Y\)

For quantitative Xs : outlier and distribution of \(X\) values

For each qualitative explanatory variable \(X\) : Modalities and number of individuals for each of them

Relationships between Y and Xs ?

Analysing interaction between Xs

Checking collinearity between \(Xs\)

B. Statistical Analysis

Model building up

Model coefficient and its equation

Explanation provided by the model

C. Model validation

Checking for influential individuals

Prediction with different data

III. Analysing the relationship between nest fate and county, habitat, species, year, initiation date using a generalized linear model (GLM) with a Poisson distribution

Variables presentation:

A. Data exploration

Outlayer values in the response variable \(Y\) (the number of young in the nest, Number_Fledged) and values distribution:

Outlayer values in the explanatory variable \(Xs\) and values distribution:

Analyze the relation between Y and Xs

Analyze the interactions between Xs

Checking collinearity between categorical independent variables Xs:

B. Statistical Analysis

Building of the model

C. Model validation:

Conclusions:

IV. ANOVA

Outliers in \(Y\) and distribution of values of \(Y\)

Outliers and distribution of the values of \(X\)

Removal of poorly represented modalities (< 10 data)

Analyzing the potential relationship between Y vs Xs

Analysis of possible interactions between the four Xs

Search for candidate model

Analysis of the coefficients of the candidate model

Explanatory power of the model: R²

Validation of modeling

Normality of residuals

Homogeneity of variance

Independence

Verification of influential individuals

Conclusion of our ANOVA

V. FAMD

Generical exploration

Realisation of the FAMD

Realisation of a Hierarchical clustering

References