Statistics: The Art & Science of Learning from Data

1 Introduction

1.1 Statistics

  • Statistics is the science of uncertainty & variability
  • Statistics turns data into information
  • Data -> Information -> Knowledge -> Wisdom
  • Data Driven Decisions (3Ds)
  • Statistics is the interpretation of Science
  • Statistics is the Art & Science of learning from data

 

1.2 Variable

  • Characteristic that may vary from individual to individual
  • Height, Weight, CGPA etc

 

1.3 Measurement

  • Process of assigning numbers or labels to objects or states in accordance with logically accepted rules
Measurement

Figure 1.1: Measurement

 

1.4 Measurement Scales

  • Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories
  • Ordinal Scale: Obersvations may be ranked
  • Interval Scale: Difference between obersvations is meaningful
  • Ratio Scale: Ratio between obersvations is meaningful & true zero point
Measurement Scales

Figure 1.2: Measurement Scales

 

2 Exploring Data with Graphs & Numerical Summaries

2.1 Graphs

2.2 Numerical Summaries

2.2.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

library(readxl)
library(tidyverse)
Data <-
  read_excel(path = "NHANES1.xlsx") |>
  mutate(
    Gender  = fct_relevel(Gender, "Male", "Female")
  , BMI.Cat = fct_relevel(BMI.Cat, "Under Weight", "Normal Weight", "Over Weight", "Obese")
    )
Data
# A tibble: 8,819 × 6
      ID Gender   Age Race                 BMI BMI.Cat      
   <dbl> <fct>  <dbl> <chr>              <dbl> <fct>        
 1 31128 Female    11 Non-Hispanic Black  17.4 Under Weight 
 2 31129 Male      15 Non-Hispanic Black  26.5 Over Weight  
 3 31131 Female    44 Non-Hispanic Black  30.9 Obese        
 4 31132 Male      70 Non-Hispanic White  24.7 Normal Weight
 5 31133 Female    16 Non-Hispanic Black  16.8 Under Weight 
 6 31134 Male      73 Non-Hispanic White  30.6 Obese        
 7 31137 Female    14 Non-Hispanic Black  27.8 Over Weight  
 8 31138 Male       3 Mexican American    16.1 Under Weight 
 9 31139 Female    18 Other Hispanic      29.4 Over Weight  
10 31140 Female    11 Non-Hispanic White  18.2 Under Weight 
# … with 8,809 more rows
library(janitor)
Out1 <-
  Data |>
  tabyl(Gender) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)

Out1
 Gender    n percent
   Male 4305  48.82%
 Female 4514  51.18%
  Total 8819 100.00%
library(scales)
Plot1 <-
  ggplot(data = Data, mapping = aes(x = Gender)) +
  geom_bar(stat = "count", width = 0.4) +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 4700)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Gender"
  , y = "Frequency"
  ) +
  theme_classic() +
  theme(aspect.ratio = 2/1)

Plot1

Out2 <-
  Data |>
  tabyl(Gender)
Out2
 Gender    n   percent
   Male 4305 0.4881506
 Female 4514 0.5118494
library(ggchicklet)
Plot2 <-
  ggplot(data = Out2, mapping = aes(x = Gender, y = n)) +
  geom_chicklet(radius = grid::unit(3, "mm"), width = 0.5) +
  geom_text(mapping = aes(label = scales::comma(n)), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 4700)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Gender"
  , y = "Frequency"
  ) +
  theme_classic() +
  theme(aspect.ratio = 2/1)

Plot2

Out3 <-
  Data |>
  tabyl(Race) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)
Out3
               Race    n percent
   Mexican American 2360  26.76%
 Non-Hispanic Black 2392  27.12%
 Non-Hispanic White 3350  37.99%
     Other Hispanic  291   3.30%
        Other/Mixed  426   4.83%
              Total 8819 100.00%
Plot3 <-
  ggplot(data = Data, mapping = aes(x = Race)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 3500)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Race"
  , y = "Frequency"
  ) +
  theme_classic()

Plot3

Out4 <-
  Data |>
  tabyl(BMI.Cat) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)

Out4
       BMI.Cat    n percent
  Under Weight 1921  21.78%
 Normal Weight 2770  31.41%
   Over Weight 2059  23.35%
         Obese 2069  23.46%
         Total 8819 100.00%
Plot4 <-
  ggplot(data = Data, mapping = aes(x = BMI.Cat)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 3000)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "BMI"
  , y = "Frequency"
  ) +
  theme_classic()

Plot4

Out5 <-
  Data |>
  tabyl(Gender, BMI.Cat) |>
  adorn_totals(c("row", "col")) |>
  adorn_percentages("row") |>
  adorn_pct_formatting(digits = 2) |>
  adorn_ns(position = "front") |>
  adorn_title("combined")

Out5
 Gender/BMI.Cat  Under Weight Normal Weight   Over Weight         Obese
           Male  985 (22.88%) 1299 (30.17%) 1115 (25.90%)  906 (21.05%)
         Female  936 (20.74%) 1471 (32.59%)  944 (20.91%) 1163 (25.76%)
          Total 1921 (21.78%) 2770 (31.41%) 2059 (23.35%) 2069 (23.46%)
          Total
 4305 (100.00%)
 4514 (100.00%)
 8819 (100.00%)
Plot5 <-
  ggplot(data = Data, mapping = aes(x = BMI.Cat, fill = Gender)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  facet_grid(cols = vars(Gender)) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 1600)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "BMI"
  , y = "Frequency"
  ) +
  theme_classic()

Plot5

3 Descriptive Statistics

  • Number of Observations
  • Measures of Central Tendency
  • Measures of Central Dispersion
  • Measures of Skewness
  • Measures of Kurtosis

 

3.0.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

Plot6 <-
  ggplot(data = Data, mapping = aes(y = BMI)) +
  geom_boxplot(stat = "boxplot", outlier.colour = "red") +
  scale_x_discrete() +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, NA)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = ""
  , y = "BMI"
  ) +
  theme_classic()

Plot6

Plot7 <-
  ggplot(data = Data, mapping = aes(y = BMI)) +
  geom_boxplot(stat = "boxplot", outlier.colour = "red") +
  facet_grid(cols = vars(Gender)) +
  scale_x_discrete() +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, NA)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = ""
  , y = "BMI"
  ) +
  theme_classic()

Plot7

library(moments)
Out6 <-
  Data |>
  summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )

Out6
# A tibble: 1 × 6
      n  Mean Variance    SD    SK     K
  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1  8819  25.1     57.6  7.59  1.20  8.51
Out7 <-
  Data |>
  filter(BMI < 70) |>
  summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )
Out7
# A tibble: 1 × 6
      n  Mean Variance    SD    SK     K
  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1  8816  25.1     55.8  7.47 0.885  4.19
Out8 <-
  Data |>
  group_by(Gender) |>
    summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )
Out8
# A tibble: 2 × 7
  Gender     n  Mean Variance    SD    SK     K
  <fct>  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1 Male    4305  24.7     50.9  7.14 1.46  14.8 
2 Female  4514  25.4     63.6  7.98 0.994  4.55
Out9 <-
  Data |>
  filter(BMI < 70) |>
  group_by(Gender) |>
    summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )

Out9
# A tibble: 2 × 7
  Gender     n  Mean Variance    SD    SK     K
  <fct>  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1 Male    4304  24.7     48.3  6.95 0.773  4.11
2 Female  4512  25.4     62.6  7.91 0.923  4.07

4 Correlation Analysis

4.0.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

Out10 <-
  Data |>
  select(BMI, Age) |>
  var()

Out10
         BMI       Age
BMI 57.55196  81.74215
Age 81.74215 504.62626
Out11 <-
  Data |>
  select(BMI, Age) |>
  cor()
Out11
          BMI       Age
BMI 1.0000000 0.4796573
Age 0.4796573 1.0000000
Out12 <-
  Data |>
  filter(BMI < 70) |>
  select(BMI, Age) |>
  var()
Out12
         BMI      Age
BMI 55.76963  81.6055
Age 81.60550 504.7595
Out13 <-
  Data |>
  filter(BMI < 70) |>
  select(BMI, Age) |>
  cor()
Out13
          BMI       Age
BMI 1.0000000 0.4863829
Age 0.4863829 1.0000000
Data |>
  filter(Gender == "Male") |>
  select(BMI, Age) |>
  var()
         BMI       Age
BMI 50.91307  82.28081
Age 82.28081 531.44024
Data |>
  filter(Gender == "Male") |>
  select(BMI, Age) |>
  cor()
         BMI      Age
BMI 1.000000 0.500215
Age 0.500215 1.000000
Data |>
  filter(Gender == "Female") |>
  select(BMI, Age) |>
  var()
         BMI      Age
BMI 63.63634  81.5858
Age 81.58580 478.7229
Data |>
  filter(Gender == "Female") |>
  select(BMI, Age) |>
  cor()
          BMI       Age
BMI 1.0000000 0.4674336
Age 0.4674336 1.0000000

 

5 An Introduction to Linear Models

5.1 Regression Analysis

  • Quantifying dependency of a normal response on quantitative explanatory variable(s)

 

Population Regression Function

Figure 5.1: Population Regression Function

 

5.1.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm1Plot1 <-
  ggplot(data = Data, mapping = aes(x = Age, y = BMI)) +
  geom_point() +
  theme_classic()

fm1Plot1

fm1 <- lm(formula = BMI ~ Age, data = Data)
summary(fm1)$coef
              Estimate  Std. Error  t value Pr(>|t|)
(Intercept) 20.3072818 0.116956424 173.6312        0
Age          0.1619855 0.003155804  51.3294        0
anova(fm1)
Analysis of Variance Table

Response: BMI
            Df Sum Sq Mean Sq F value    Pr(>F)    
Age          1 116760  116760  2634.7 < 2.2e-16 ***
Residuals 8817 390734      44                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(emmeans)
fm1Plot2 <-
  emmip(
      object     = fm1
    , formula    =  ~ Age
    , cov.reduce = range
    ) +
  scale_x_continuous(breaks = pretty_breaks(12)) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Age", y = "BMI") +
  theme_classic()
fm1Plot2

library(ggpmisc)
fm1Plot3 <-
  ggplot(data = Data, mapping = aes(x = Age, y = BMI)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) +
  stat_poly_eq(
      formula = y ~ x
    , eq.with.lhs = FALSE
    , mapping = aes(
                   label = paste("widehat(italic(BMI))", "~`=`~", gsub("italic(x)","Age",..eq.label..,fixed = TRUE),  "~~~", ..rr.label..,  "~~~",  ..adj.rr.label.., "~~~",  ..p.value.label.. ,  sep = "")
                  )
    , size  = 4.5
    , parse = TRUE
    ) +
  theme_classic()

fm1Plot3

5.2 Analysis of Variance (ANOVA)

  • Comparing means of Normal dependent variable for levels of different factor(s)

 

Analysis of Variance

Figure 5.2: Analysis of Variance

 

5.2.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm2 <- lm(formula = BMI ~ Gender, data = Data)
anova(fm2)
Analysis of Variance Table

Response: BMI
            Df Sum Sq Mean Sq F value    Pr(>F)    
Gender       1   1173 1172.53  20.418 6.304e-06 ***
Residuals 8817 506321   57.43                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fm2Plot1 <-
  emmip(
      object     = fm2
    , formula    =  ~ Gender
    ) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Gender", y = "BMI") +
  theme_classic()
fm2Plot1

5.3 Analysis of Covariance (ANCOVA)

  • Quantifying dependency of a normal response on quantitative explanatory variable(s)
  • Comparing means of Normal dependent variable for levels of different factor(s)

 

Analysis of Covariance

Figure 5.3: Analysis of Covariance

 

5.3.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm3 <- lm(formula = BMI ~ Age*Gender, data = Data)
anova(fm3)
Analysis of Variance Table

Response: BMI
             Df Sum Sq Mean Sq   F value    Pr(>F)    
Age           1 116760  116760 2647.6079 < 2.2e-16 ***
Gender        1   1722    1722   39.0430 4.338e-10 ***
Age:Gender    1    270     270    6.1294   0.01331 *  
Residuals  8815 388742      44                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fm3)$coef
                    Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)      20.06946217 0.165999451 120.900775  0.000000e+00
Age               0.15482609 0.004390925  35.260475 6.652534e-255
GenderFemale      0.42441011 0.233467368   1.817856  6.912004e-02
Age:GenderFemale  0.01559778 0.006300190   2.475763  1.331382e-02
fm3Plot1 <-
  emmip(
      object     = fm3
    , formula    = Gender ~ Age
    , cov.reduce = range
    ) +
  scale_x_continuous(breaks = pretty_breaks(12)) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Age", y = "BMI") +
  theme_classic()
fm3Plot1

6 An Introduction to R

R is a free, open-source programming language and software environment for statistical computing, bioinformatics, visualization and general computing. R provides a wide variety of statistical and graphical techniques, and is highly extensible. The latest version of R can be obtained from https://cran.r-project.org/bin/windows/base/.

 

  R R for Windows (32/64 bit)  

 

  RStudio RStudio  

 

  Manual An Introduction to R  

 

  Ref Card R Reference Card  

 

  New York Times New York Times  

 

  Nature Article Nature Article  

comments powered by Disqus