Statistics: The Art & Science of Learning from Data

Blogs

20 May 2022

Data Science, Statistics, Biostatistics, Bioinformatics, Research, R

1 Introduction
2 Exploring Data with Graphs & Numerical Summaries
- 2.1 Graphs
- 2.2 Numerical Summaries
3 Descriptive Statistics
4 Correlation Analysis
5 An Introduction to Linear Models
6 An Introduction to R

1 Introduction

1.1 Statistics

Statistics is the science of uncertainty & variability

Statistics turns data into information

Data -> Information -> Knowledge -> Wisdom

Data Driven Decisions (3Ds)

Statistics is the interpretation of Science

Statistics is the Art & Science of learning from data

1.2 Variable

Characteristic that may vary from individual to individual

Height, Weight, CGPA etc

1.3 Measurement

Process of assigning numbers or labels to objects or states in accordance with logically accepted rules

Figure 1.1: Measurement

1.4 Measurement Scales

Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories

Ordinal Scale: Obersvations may be ranked

Interval Scale: Difference between obersvations is meaningful

Ratio Scale: Ratio between obersvations is meaningful & true zero point

Figure 1.2: Measurement Scales

2 Exploring Data with Graphs & Numerical Summaries

2.1 Graphs

2.2 Numerical Summaries

2.2.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

library(readxl)
library(tidyverse)
Data <-
  read_excel(path = "NHANES1.xlsx") |>
  mutate(
    Gender  = fct_relevel(Gender, "Male", "Female")
  , BMI.Cat = fct_relevel(BMI.Cat, "Under Weight", "Normal Weight", "Over Weight", "Obese")
    )
Data

# A tibble: 8,819 × 6
      ID Gender   Age Race                 BMI BMI.Cat      
   <dbl> <fct>  <dbl> <chr>              <dbl> <fct>        
 1 31128 Female    11 Non-Hispanic Black  17.4 Under Weight 
 2 31129 Male      15 Non-Hispanic Black  26.5 Over Weight  
 3 31131 Female    44 Non-Hispanic Black  30.9 Obese        
 4 31132 Male      70 Non-Hispanic White  24.7 Normal Weight
 5 31133 Female    16 Non-Hispanic Black  16.8 Under Weight 
 6 31134 Male      73 Non-Hispanic White  30.6 Obese        
 7 31137 Female    14 Non-Hispanic Black  27.8 Over Weight  
 8 31138 Male       3 Mexican American    16.1 Under Weight 
 9 31139 Female    18 Other Hispanic      29.4 Over Weight  
10 31140 Female    11 Non-Hispanic White  18.2 Under Weight 
# … with 8,809 more rows

library(janitor)
Out1 <-
  Data |>
  tabyl(Gender) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)

Out1

 Gender    n percent
   Male 4305  48.82%
 Female 4514  51.18%
  Total 8819 100.00%

library(scales)
Plot1 <-
  ggplot(data = Data, mapping = aes(x = Gender)) +
  geom_bar(stat = "count", width = 0.4) +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 4700)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Gender"
  , y = "Frequency"
  ) +
  theme_classic() +
  theme(aspect.ratio = 2/1)

Plot1

Out2 <-
  Data |>
  tabyl(Gender)
Out2

 Gender    n   percent
   Male 4305 0.4881506
 Female 4514 0.5118494

library(ggchicklet)
Plot2 <-
  ggplot(data = Out2, mapping = aes(x = Gender, y = n)) +
  geom_chicklet(radius = grid::unit(3, "mm"), width = 0.5) +
  geom_text(mapping = aes(label = scales::comma(n)), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 4700)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Gender"
  , y = "Frequency"
  ) +
  theme_classic() +
  theme(aspect.ratio = 2/1)

Plot2

Out3 <-
  Data |>
  tabyl(Race) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)
Out3

               Race    n percent
   Mexican American 2360  26.76%
 Non-Hispanic Black 2392  27.12%
 Non-Hispanic White 3350  37.99%
     Other Hispanic  291   3.30%
        Other/Mixed  426   4.83%
              Total 8819 100.00%

Plot3 <-
  ggplot(data = Data, mapping = aes(x = Race)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 3500)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "Race"
  , y = "Frequency"
  ) +
  theme_classic()

Plot3

Out4 <-
  Data |>
  tabyl(BMI.Cat) |>
  adorn_totals(c("row")) |>
  adorn_pct_formatting(digits = 2)

Out4

       BMI.Cat    n percent
  Under Weight 1921  21.78%
 Normal Weight 2770  31.41%
   Over Weight 2059  23.35%
         Obese 2069  23.46%
         Total 8819 100.00%

Plot4 <-
  ggplot(data = Data, mapping = aes(x = BMI.Cat)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 3000)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "BMI"
  , y = "Frequency"
  ) +
  theme_classic()

Plot4

Out5 <-
  Data |>
  tabyl(Gender, BMI.Cat) |>
  adorn_totals(c("row", "col")) |>
  adorn_percentages("row") |>
  adorn_pct_formatting(digits = 2) |>
  adorn_ns(position = "front") |>
  adorn_title("combined")

Out5

 Gender/BMI.Cat  Under Weight Normal Weight   Over Weight         Obese
           Male  985 (22.88%) 1299 (30.17%) 1115 (25.90%)  906 (21.05%)
         Female  936 (20.74%) 1471 (32.59%)  944 (20.91%) 1163 (25.76%)
          Total 1921 (21.78%) 2770 (31.41%) 2059 (23.35%) 2069 (23.46%)
          Total
 4305 (100.00%)
 4514 (100.00%)
 8819 (100.00%)

Plot5 <-
  ggplot(data = Data, mapping = aes(x = BMI.Cat, fill = Gender)) +
  geom_bar(stat = "count") +
  geom_text(stat = "count", mapping = aes(label = comma(stat(count))), vjust = -0.2) +
  facet_grid(cols = vars(Gender)) +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, 1600)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = "BMI"
  , y = "Frequency"
  ) +
  theme_classic()

Plot5

3 Descriptive Statistics

Number of Observations

Measures of Central Tendency

Measures of Central Dispersion

Measures of Skewness

Measures of Kurtosis

3.0.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

Plot6 <-
  ggplot(data = Data, mapping = aes(y = BMI)) +
  geom_boxplot(stat = "boxplot", outlier.colour = "red") +
  scale_x_discrete() +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, NA)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = ""
  , y = "BMI"
  ) +
  theme_classic()

Plot6

Plot7 <-
  ggplot(data = Data, mapping = aes(y = BMI)) +
  geom_boxplot(stat = "boxplot", outlier.colour = "red") +
  facet_grid(cols = vars(Gender)) +
  scale_x_discrete() +
  scale_y_continuous(
      expand = c(0, NA)
    , limits = c(0, NA)
    , labels = scales::comma
    , breaks = pretty_breaks(12)
    ) +
  labs(
    x = ""
  , y = "BMI"
  ) +
  theme_classic()

Plot7

library(moments)
Out6 <-
  Data |>
  summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )

Out6

# A tibble: 1 × 6
      n  Mean Variance    SD    SK     K
  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1  8819  25.1     57.6  7.59  1.20  8.51

Out7 <-
  Data |>
  filter(BMI < 70) |>
  summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )
Out7

# A tibble: 1 × 6
      n  Mean Variance    SD    SK     K
  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1  8816  25.1     55.8  7.47 0.885  4.19

Out8 <-
  Data |>
  group_by(Gender) |>
    summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )
Out8

# A tibble: 2 × 7
  Gender     n  Mean Variance    SD    SK     K
  <fct>  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1 Male    4305  24.7     50.9  7.14 1.46  14.8 
2 Female  4514  25.4     63.6  7.98 0.994  4.55

Out9 <-
  Data |>
  filter(BMI < 70) |>
  group_by(Gender) |>
    summarise(
    n        = length(BMI)
  , Mean     = mean(BMI)
  , Variance = var(BMI)
  , SD       = sd(BMI)
  , SK       = skewness(BMI)
  , K        = kurtosis(BMI)
  )

Out9

# A tibble: 2 × 7
  Gender     n  Mean Variance    SD    SK     K
  <fct>  <int> <dbl>    <dbl> <dbl> <dbl> <dbl>
1 Male    4304  24.7     48.3  6.95 0.773  4.11
2 Female  4512  25.4     62.6  7.91 0.923  4.07

4 Correlation Analysis

4.0.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

Out10 <-
  Data |>
  select(BMI, Age) |>
  var()

Out10

         BMI       Age
BMI 57.55196  81.74215
Age 81.74215 504.62626

Out11 <-
  Data |>
  select(BMI, Age) |>
  cor()
Out11

          BMI       Age
BMI 1.0000000 0.4796573
Age 0.4796573 1.0000000

Out12 <-
  Data |>
  filter(BMI < 70) |>
  select(BMI, Age) |>
  var()
Out12

         BMI      Age
BMI 55.76963  81.6055
Age 81.60550 504.7595

Out13 <-
  Data |>
  filter(BMI < 70) |>
  select(BMI, Age) |>
  cor()
Out13

          BMI       Age
BMI 1.0000000 0.4863829
Age 0.4863829 1.0000000

Data |>
  filter(Gender == "Male") |>
  select(BMI, Age) |>
  var()

         BMI       Age
BMI 50.91307  82.28081
Age 82.28081 531.44024

Data |>
  filter(Gender == "Male") |>
  select(BMI, Age) |>
  cor()

         BMI      Age
BMI 1.000000 0.500215
Age 0.500215 1.000000

Data |>
  filter(Gender == "Female") |>
  select(BMI, Age) |>
  var()

         BMI      Age
BMI 63.63634  81.5858
Age 81.58580 478.7229

Data |>
  filter(Gender == "Female") |>
  select(BMI, Age) |>
  cor()

          BMI       Age
BMI 1.0000000 0.4674336
Age 0.4674336 1.0000000

5 An Introduction to Linear Models

5.1 Regression Analysis

Quantifying dependency of a normal response on quantitative explanatory variable(s)

Figure 5.1: Population Regression Function

5.1.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm1Plot1 <-
  ggplot(data = Data, mapping = aes(x = Age, y = BMI)) +
  geom_point() +
  theme_classic()

fm1Plot1

fm1 <- lm(formula = BMI ~ Age, data = Data)
summary(fm1)$coef

              Estimate  Std. Error  t value Pr(>|t|)
(Intercept) 20.3072818 0.116956424 173.6312        0
Age          0.1619855 0.003155804  51.3294        0

anova(fm1)

Analysis of Variance Table

Response: BMI
            Df Sum Sq Mean Sq F value    Pr(>F)    
Age          1 116760  116760  2634.7 < 2.2e-16 ***
Residuals 8817 390734      44                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(emmeans)
fm1Plot2 <-
  emmip(
      object     = fm1
    , formula    =  ~ Age
    , cov.reduce = range
    ) +
  scale_x_continuous(breaks = pretty_breaks(12)) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Age", y = "BMI") +
  theme_classic()
fm1Plot2

library(ggpmisc)
fm1Plot3 <-
  ggplot(data = Data, mapping = aes(x = Age, y = BMI)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) +
  stat_poly_eq(
      formula = y ~ x
    , eq.with.lhs = FALSE
    , mapping = aes(
                   label = paste("widehat(italic(BMI))", "~`=`~", gsub("italic(x)","Age",..eq.label..,fixed = TRUE),  "~~~", ..rr.label..,  "~~~",  ..adj.rr.label.., "~~~",  ..p.value.label.. ,  sep = "")
                  )
    , size  = 4.5
    , parse = TRUE
    ) +
  theme_classic()

fm1Plot3

5.2 Analysis of Variance (ANOVA)

Comparing means of Normal dependent variable for levels of different factor(s)

Figure 5.2: Analysis of Variance

5.2.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm2 <- lm(formula = BMI ~ Gender, data = Data)
anova(fm2)

Analysis of Variance Table

Response: BMI
            Df Sum Sq Mean Sq F value    Pr(>F)    
Gender       1   1173 1172.53  20.418 6.304e-06 ***
Residuals 8817 506321   57.43                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fm2Plot1 <-
  emmip(
      object     = fm2
    , formula    =  ~ Gender
    ) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Gender", y = "BMI") +
  theme_classic()
fm2Plot1

5.3 Analysis of Covariance (ANCOVA)

Quantifying dependency of a normal response on quantitative explanatory variable(s)

Comparing means of Normal dependent variable for levels of different factor(s)

Figure 5.3: Analysis of Covariance

5.3.1 Example

The following data shows the ID, Gender (Male, Female), Age, Race (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other/Mixed), BMI, and BMI.Cat (Under Weight, Normal Weight, Over Weight, Obese) from the National Health and Nutrition Examination Survey (NHANES). The survey is conducted by the National Center for Health Statistics (NCHS), and data are publicly available at: https://www.cdc.gov/nchs/nhanes.htm . NHANES data are reported in well over one thousand peer-reviewed journal publications every year.

fm3 <- lm(formula = BMI ~ Age*Gender, data = Data)
anova(fm3)

Analysis of Variance Table

Response: BMI
             Df Sum Sq Mean Sq   F value    Pr(>F)    
Age           1 116760  116760 2647.6079 < 2.2e-16 ***
Gender        1   1722    1722   39.0430 4.338e-10 ***
Age:Gender    1    270     270    6.1294   0.01331 *  
Residuals  8815 388742      44                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fm3)$coef

                    Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)      20.06946217 0.165999451 120.900775  0.000000e+00
Age               0.15482609 0.004390925  35.260475 6.652534e-255
GenderFemale      0.42441011 0.233467368   1.817856  6.912004e-02
Age:GenderFemale  0.01559778 0.006300190   2.475763  1.331382e-02

fm3Plot1 <-
  emmip(
      object     = fm3
    , formula    = Gender ~ Age
    , cov.reduce = range
    ) +
  scale_x_continuous(breaks = pretty_breaks(12)) +
  scale_y_continuous(breaks = pretty_breaks(12)) +
  labs(x = "Age", y = "BMI") +
  theme_classic()
fm3Plot1

6 An Introduction to R

R is a free, open-source programming language and software environment for statistical computing, bioinformatics, visualization and general computing. R provides a wide variety of statistical and graphical techniques, and is highly extensible. The latest version of R can be obtained from https://cran.r-project.org/bin/windows/base/.

Statistics: The Art & Science of Learning from Data

Blogs

1 Introduction

1.1 Statistics

1.2 Variable

1.3 Measurement

1.4 Measurement Scales

2 Exploring Data with Graphs & Numerical Summaries

2.1 Graphs

2.2 Numerical Summaries

2.2.1 Example

3 Descriptive Statistics

3.0.1 Example

4 Correlation Analysis

4.0.1 Example

5 An Introduction to Linear Models

5.1 Regression Analysis

5.1.1 Example

5.2 Analysis of Variance (ANOVA)

5.2.1 Example

5.3 Analysis of Covariance (ANCOVA)

5.3.1 Example

6 An Introduction to R

R R for Windows (32/64 bit)

RStudio RStudio

Manual An Introduction to R

Ref Card R Reference Card

New York Times New York Times

Nature Article Nature Article