# Statistics

• Statistics is the science of uncertainty & variability
• Statistics turns data into information
• Data -> Information -> Knowledge -> Wisdom
• Data Driven Decisions (3Ds)
• Statistics is the interpretation of Science
• Statistics is the Art & Science of learning from data

# Variable

• Characteristic that may vary from individual to individual
• Height, Weight, CGPA etc

# Measurement

• Process of assigning numbers or labels to objects or states in accordance with logically accepted rules

# Measurement Scales

• Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories
• Ordinal Scale: Obersvations may be ranked
• Interval Scale: Difference between obersvations is meaningful
• Ratio Scale: Ratio between obersvations is meaningful & true zero point

## Nominal Data

### Example

The following data shows the gender of a sample of twenty students from the University of Agriculture, Faisalabad.

Student Gender
1 Male
2 Male
3 Female
4 Female
5 Female
6 Female
7 Male
8 Male
9 Male
10 Male
11 Female
12 Female
13 Male
14 Male
15 Female
16 Female
17 Male
18 Male
19 Male
20 Male

if (!require("tidyverse")) install.packages("tidyverse")
# library(tidyverse)
df1 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
)
df1
# A tibble: 20 x 2
Student Gender
<dbl> <chr>
1       1 Male
2       2 Male
3       3 Female
4       4 Female
5       5 Female
6       6 Female
7       7 Male
8       8 Male
9       9 Male
10      10 Male
11      11 Female
12      12 Female
13      13 Male
14      14 Male
15      15 Female
16      16 Female
17      17 Male
18      18 Male
19      19 Male
20      20 Male  

#### Frequency Table

df1Freq <-
df1 %>%
dplyr::count(Gender) %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
)
df1Freq
# A tibble: 2 x 4
Gender     f    rf    pf
<chr>  <int> <dbl> <dbl>
1 Female     8   0.4    40
2 Male      12   0.6    60

#### Simple Bar Chart

ggplot(
data = df1
,  mapping = aes(x = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

### Example

Following data presents the number of nucleotides of gene sequence (A, C, G, T). This is illustrated by the Zyxin gene which plays an important role in cell adhesion (Golub et al, 1999). The accession number (X94991.1) of one of its variants can be found in a data base like NCBI (UniGene). Given data will be used to illustrate the construction of pie chart from the frequency table of four nucleotides.

A C G T
410 789 573 394

Data from the GenBank can also be imported directly by the following code.

# install.packages(pkgs = "ape", repo = "http://cran.r-project.org", dependencies =TRUE)
# library(ape)
# df2

#### Frequency Table

df3 <- tibble(
G.Sequence   = c("A", "C", "G", "T")
, Nucleotides  = c(410, 789, 573, 394)
)
df3
# A tibble: 4 x 2
G.Sequence Nucleotides
<chr>            <dbl>
1 A                  410
2 C                  789
3 G                  573
4 T                  394

#### Simple Bar Chart

ggplot(
data = df3
, mapping = aes(x = G.Sequence, y = Nucleotides)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gene Sequence", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

## Ordinal Data

### Example

The following data shows the grades of a sample of twenty students from the University of Agriculture, Faisalabad.

1 A
2 B
3 B
4 C
5 A
6 D
7 F
8 C
9 B
10 D
11 F
12 A
13 B
14 B
15 C
16 D
17 C
18 B
19 C
20 D

df4 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Grade  = c("A", "B", "B", "C", "A", "D", "F", "C", "B", "D", "F", "A", "B", "B", "C", "D", "C",  "B", "C", "D")
)
df4
# A tibble: 20 x 2
<dbl> <chr>
1       1 A
2       2 B
3       3 B
4       4 C
5       5 A
6       6 D
7       7 F
8       8 C
9       9 B
10      10 D
11      11 F
12      12 A
13      13 B
14      14 B
15      15 C
16      16 D
17      17 C
18      18 B
19      19 C
20      20 D    

#### Frequency Table

df4Freq <-
df4 %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
, cf = cumsum(f)
)
df4Freq
# A tibble: 5 x 5
<chr> <int> <dbl> <dbl> <int>
1 A         3  0.15    15     3
2 B         6  0.3     30     9
3 C         5  0.25    25    14
4 D         4  0.2     20    18
5 F         2  0.1     10    20

#### Simple Bar Chart

ggplot(
data = df4
,  mapping = aes(x = Grade)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Grades", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

## Two Way Contingency Table

### Example

The following data shows the gender and residental status of a sample of twenty students from the University of Agriculture, Faisalabad.

Student Gender RS
1 Male Boarding
2 Male Non-Boarding
3 Female Non-Boarding
4 Female Boarding
5 Female Boarding
6 Female Non-Boarding
7 Male Non-Boarding
8 Male Boarding
9 Male Non-Boarding
10 Male Non-Boarding
11 Female Non-Boarding
12 Female Boarding
13 Male Non-Boarding
14 Male Non-Boarding
15 Female Non-Boarding
16 Female Non-Boarding
17 Male Non-Boarding
18 Male Non-Boarding
19 Male Non-Boarding
20 Male Boarding

df5 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
, RS      = c("B", "NB", "NB", "B", "B", "NB", "NB", "B", "NB", "NB", "NB", "B", "NB", "NB", "NB", "NB", "NB", "NB", "NB", "B")
)
df5
# A tibble: 20 x 3
Student Gender RS
<dbl> <chr>  <chr>
1       1 Male   B
2       2 Male   NB
3       3 Female NB
4       4 Female B
5       5 Female B
6       6 Female NB
7       7 Male   NB
8       8 Male   B
9       9 Male   NB
10      10 Male   NB
11      11 Female NB
12      12 Female B
13      13 Male   NB
14      14 Male   NB
15      15 Female NB
16      16 Female NB
17      17 Male   NB
18      18 Male   NB
19      19 Male   NB
20      20 Male   B    

#### Cross Tables

df5 %>%
dplyr::count(Gender) %>%
dplyr::rename(f = n)
# A tibble: 2 x 2
Gender     f
<chr>  <int>
1 Female     8
2 Male      12
df5 %>%
dplyr::count(RS) %>%
dplyr::rename(f = n)
# A tibble: 2 x 2
RS        f
<chr> <int>
1 B         6
2 NB       14
df5 %>%
dplyr::count(Gender, RS) %>%
dplyr::rename(f = n)
# A tibble: 4 x 3
Gender RS        f
<chr>  <chr> <int>
1 Female B         3
2 Female NB        5
3 Male   B         3
4 Male   NB        9
if (!require("janitor")) install.packages("janitor")
# library(janitor)
df5CrossTab <-
df5 %>%
janitor::tabyl(dat = ., var1 = Gender, var2 = RS) %>%
janitor::adorn_totals(dat = ., where = c("row", "col"))
df5CrossTab
 Gender B NB Total
Female 3  5     8
Male 3  9    12
Total 6 14    20

#### Simple Bar Charts

ggplot(
data = df5
,  mapping = aes(x = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

ggplot(
data = df5
,  mapping = aes(x = RS)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

#### Multiple Bar Charts

ggplot(
data = df5
,  mapping = aes(x = Gender, fill = RS)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

ggplot(
data = df5
,  mapping = aes(x = RS, fill = Gender)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

#### Component Bar Charts

ggplot(
data = df5
,  mapping = aes(x = Gender, fill = RS)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

ggplot(
data = df5
,  mapping = aes(x = RS, fill = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

### Example

Source: OMB Statistical Policy Working Paper 22. https://www.hhs.gov/sites/default/files/spwp22.pdf Following data set consists of information concerning delinquent children. Recorded variables are Number of Delinquent Children by County and Education Level of Household Head.

#### Cross Tables

df6 <- tibble(
Delinquent = gl(n = 4, k = 4, length = 16, labels = c("Alpha", "Beta", "Gamma", "Delta"))
, EduLevel   = gl(n = 4, k = 1, length = 16, labels = c("Low", "Medium", "High", "Very High"))
, Freq       = c(15, 0, 5, 0, 20, 10, 10, 15, 5, 10, 10, 0, 10, 15, 5, 5)
)

df6
# A tibble: 16 x 3
Delinquent EduLevel   Freq
<fct>      <fct>     <dbl>
1 Alpha      Low          15
2 Alpha      Medium        0
3 Alpha      High          5
4 Alpha      Very High     0
5 Beta       Low          20
6 Beta       Medium       10
7 Beta       High         10
8 Beta       Very High    15
9 Gamma      Low           5
10 Gamma      Medium       10
11 Gamma      High         10
12 Gamma      Very High     0
13 Delta      Low          10
14 Delta      Medium       15
15 Delta      High          5
16 Delta      Very High     5
df6 %>%
xtabs(data = ., Freq ~ Delinquent)
Delinquent
Alpha  Beta Gamma Delta
20    55    25    35 
df6 %>%
xtabs(data = ., Freq ~ EduLevel)
EduLevel
Low    Medium      High Very High
50        35        30        20 
df6 %>%
xtabs(data = ., Freq ~ Delinquent + EduLevel)
          EduLevel
Delinquent Low Medium High Very High
Alpha  15      0    5         0
Beta   20     10   10        15
Gamma   5     10   10         0
Delta  10     15    5         5

#### Multiple Bar Charts

ggplot(
data = df6
,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

ggplot(
data = df6
,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

#### Component Bar Charts

ggplot(
data = df6
,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

ggplot(
data = df6
,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

## Count Data

### Example

The following data shows the number of notebook a sample of twenty students keeping.

Student Notebook
1 3
2 1
3 0
4 2
5 2
6 4
7 5
8 1
9 1
10 2
11 3
12 4
13 2
14 5
15 1
16 5
17 4
18 2
19 2
20 3

df7 <- tibble::tibble(
Student  = seq(from = 1, to = 20, by = 1)
, Notebook = c(3, 1, 0, 2, 2, 4, 5, 1, 1, 2, 3, 4, 2, 5, 1, 5, 4, 2, 2, 3)
)
df7
# A tibble: 20 x 2
Student Notebook
<dbl>    <dbl>
1       1        3
2       2        1
3       3        0
4       4        2
5       5        2
6       6        4
7       7        5
8       8        1
9       9        1
10      10        2
11      11        3
12      12        4
13      13        2
14      14        5
15      15        1
16      16        5
17      17        4
18      18        2
19      19        2
20      20        3

#### Frequency Table

df7Freq <-
df7 %>%
dplyr::count(Notebook) %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
, cf = cumsum(f)
)
df7Freq
# A tibble: 6 x 5
Notebook     f    rf    pf    cf
<dbl> <int> <dbl> <dbl> <int>
1        0     1  0.05     5     1
2        1     4  0.2     20     5
3        2     6  0.3     30    11
4        3     3  0.15    15    14
5        4     3  0.15    15    17
6        5     3  0.15    15    20

#### Simple Bar Chart

ggplot(
data = df7
,  mapping = aes(x = Notebook)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Notebooks", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))

## Continuous Data

### Example

The following data is the final plant height (cm) of thirty plants of wheat. 87 91 89 88 89 91 87 92 90 98 95 97 96 100 101 96 98 99 98 100 102 99 101 105 103 107 105 106 107 112

df8 <- tibble::tibble(
PlantHeight = c(
87,   91, 89, 88, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100,    101,
96,   98, 99, 98, 100,    102,    99, 101,    105,    103,    107,    105,    106,    107,    112
)
)
df8
# A tibble: 30 x 1
PlantHeight
<dbl>
1          87
2          91
3          89
4          88
5          89
6          91
7          87
8          92
9          90
10          98
# ... with 20 more rows

#### Frequency Distribution

df9 <- df8 %>%
summarize(
R = max(PlantHeight) - min(PlantHeight)
, k = floor(1 + 3.3*log10(length(PlantHeight)))
, h = R/k
)

df8Freq <- df8 %>%
mutate(
Classes = cut(
x              = PlantHeight
, breaks         = df9$k , include.lowest = TRUE , right = FALSE ) ) %>% count(Classes) %>% tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>% rename(f = n) %>% mutate( LB = readr::parse_number(x = LB) , UB = readr::parse_number(x = UB) , rf = f/sum(f) , pf = f/sum(f)*100 , cf = cumsum(f) , MidPoint = (LB + UB)/2 ) df8Freq # A tibble: 5 x 8 Classes LB UB f rf pf cf MidPoint <fct> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> 1 [87,92) 87 92 8 0.267 26.7 8 89.5 2 [92,97) 92 97 4 0.133 13.3 12 94.5 3 [97,102) 97 102 10 0.333 33.3 22 99.5 4 [102,107) 102 107 5 0.167 16.7 27 104. 5 [107,112] 107 112 3 0.1 10 30 110.  #### Histogram ggplot( data = df8 , mapping = aes(x = PlantHeight)) + geom_histogram() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Histogram for Plant Height", x = "Plant Height", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = f)) + geom_point() + geom_line() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Frequency for Plant Height", x = "Mid Point", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = cf)) + geom_point()+ geom_line() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + theme(plot.title = element_text(hjust = 0.5)) #### Stem and Leaf Plot stem(x = df8$PlantHeight, scale = 1, width = 80, atom = 1e-08)

The decimal point is 1 digit(s) to the right of the |

8 | 77899
9 | 0112
9 | 566788899
10 | 001123
10 | 55677
11 | 2

#### Box Plot

ggplot(data = df8 , aes( y = PlantHeight)) +
geom_boxplot()+
theme_bw()

### Example

The golub table contains gene expression values from 3051 genes taken from 38 Leukemia patients. Twenty seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and eleven as acute myeloid leukemia (AML). The golub.gnames table contains information on the gene, including gene index, manufacturing ID, and biological name. Following table presents the gene expression value by their tumor type.

tumortype genevalue tumortype genevalue
ALL 2.10892 ALL 1.78352
ALL 1.52405 ALL 0.45827
ALL 1.96403 ALL 2.18119
ALL 2.33597 ALL 2.31428
ALL 1.85111 ALL 1.99927
ALL 1.99391 ALL 1.36844
ALL 2.06597 ALL 2.37351
ALL 1.81649 ALL 1.83485
ALL 2.17622 AML 0.88941
ALL 1.80861 AML 1.45014
ALL 2.44562 AML 0.42904
ALL 1.90496 AML 0.82667
ALL 2.76610 AML 0.63637
ALL 1.32551 AML 1.02250
ALL 2.59385 AML 0.12758
ALL 1.92776 AML -0.74333
ALL 1.10546 AML 0.73784
ALL 1.27645 AML 0.49470
ALL 1.83051 AML 1.12058
df10 <- tibble(
genevalue = c(
2.10892, 1.52405, 1.96403, 2.33597, 1.85111, 1.99391
, 2.06597, 1.81649, 2.17622, 1.80861, 2.44562, 1.90496
, 2.76610, 1.32551, 2.59385, 1.92776, 1.10546, 1.27645
, 1.83051, 1.78352, 0.45827, 2.18119, 2.31428, 1.99927
, 1.36844, 2.37351, 1.83485, 0.88941, 1.45014, 0.42904
, 0.82667, 0.63637, 1.02250, 0.12758, -0.74333, 0.73784
, 0.49470, 1.12058
)
, tumortype = rep(c("ALL","AML"), c(27, 11))
)

df10
# A tibble: 38 x 2
genevalue tumortype
<dbl> <chr>
1      2.11 ALL
2      1.52 ALL
3      1.96 ALL
4      2.34 ALL
5      1.85 ALL
6      1.99 ALL
7      2.07 ALL
8      1.82 ALL
9      2.18 ALL
10      1.81 ALL
# ... with 28 more rows

#### Frequency Distribution

df11 <- df10 %>%
summarize(
R = max(genevalue) - min(genevalue)
, k = floor(1 + 3.3*log10(length(genevalue)))
, h = R/k
)

df10Freq <- df10 %>%
mutate(
Classes = cut(
x              = genevalue
, breaks         = df11$k , include.lowest = TRUE , right = FALSE ) ) %>% count(Classes) %>% tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>% rename(f = n) %>% mutate( LB = readr::parse_number(x = LB) , UB = readr::parse_number(x = UB) , rf = f/sum(f) , pf = f/sum(f)*100 , cf = cumsum(f) , MidPoint = (LB + UB)/2 ) df10Freq # A tibble: 6 x 8 Classes LB UB f rf pf cf MidPoint <fct> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> 1 [-0.747,-0.158) -0.747 -0.158 1 0.0263 2.63 1 -0.452 2 [-0.158,0.426) -0.158 0.426 1 0.0263 2.63 2 0.134 3 [0.426,1.01) 0.426 1.01 7 0.184 18.4 9 0.718 4 [1.01,1.6) 1.01 1.6 8 0.211 21.1 17 1.31 5 [1.6,2.18) 1.6 2.18 15 0.395 39.5 32 1.89 6 [2.18,2.77] 2.18 2.77 6 0.158 15.8 38 2.48  #### Histogram ggplot( data = df10 , mapping = aes(x = genevalue)) + geom_histogram() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Histogram for Gene Value", x = "Gene Value", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = f)) + geom_point() + geom_line() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Frequency for Gene Value", x = "Mid Point", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = cf)) + geom_point()+ geom_line() + scale_y_continuous(expand = c(0, 0)) + theme_bw()+ labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + theme(plot.title = element_text(hjust = 0.5)) #### Stem and Leaf Plot stem(x = df10$genevalue, scale = 1, width = 80, atom = 1e-08)

The decimal point is at the |

-0 | 7
-0 |
0 | 14
0 | 556789
1 | 011334
1 | 5588888999
2 | 00011223344
2 | 68

#### Box Plot

ggplot(data = df10 , aes(x = tumortype, y = genevalue)) +
geom_boxplot()+
theme_bw()

# Measures of Centeral Tendency

df7 %>%
summarize(
n       = length(Notebook)
, Mean    = mean(Notebook)
, Median  = median(Notebook)
, Minimum = min(Notebook)
, Maximum = max(Notebook)
, Q1      = quantile(x = Notebook, probs = 0.25)
, Q2      = quantile(x = Notebook, probs = 0.50)
, Q3      = quantile(x = Notebook, probs = 0.75)
)
# A tibble: 1 x 8
n  Mean Median Minimum Maximum    Q1    Q2    Q3
<int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    20   2.6      2       0       5  1.75     2     4
df8 %>%
summarize(
n       = length(PlantHeight)
, Mean    = mean(PlantHeight)
, Median  = median(PlantHeight)
, Minimum = min(PlantHeight)
, Maximum = max(PlantHeight)
, Q1      = quantile(x = PlantHeight, probs = 0.25)
, Q2      = quantile(x = PlantHeight, probs = 0.50)
, Q3      = quantile(x = PlantHeight, probs = 0.75)
)
# A tibble: 1 x 8
n  Mean Median Minimum Maximum    Q1    Q2    Q3
<int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    30  97.6     98      87     112  91.2    98  102.
df10 %>%
group_by(tumortype) %>%
summarize(
n       = length(genevalue)
, Mean    = mean(genevalue)
, Median  = median(genevalue)
, Minimum = min(genevalue)
, Maximum = max(genevalue)
, Q1      = quantile(x = genevalue, probs = 0.25)
, Q2      = quantile(x = genevalue, probs = 0.50)
, Q3      = quantile(x = genevalue, probs = 0.75)
)
# A tibble: 2 x 9
tumortype     n  Mean Median Minimum Maximum    Q1    Q2    Q3
<chr>     <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1 ALL          27 1.89   1.93    0.458    2.77 1.80  1.93  2.18
2 AML          11 0.636  0.738  -0.743    1.45 0.462 0.738 0.956

# Measures of Dispersion

df7 %>%
summarize(
IQR      = IQR(Notebook)
, Variance = var(Notebook)
,  SD      = sd(Notebook)
)
# A tibble: 1 x 3
IQR Variance    SD
<dbl>    <dbl> <dbl>
1  2.25     2.25  1.50
df8 %>%
summarize(
IQR      = IQR(PlantHeight)
, Variance = var(PlantHeight)
,  SD      = sd(PlantHeight)
)
# A tibble: 1 x 3
IQR Variance    SD
<dbl>    <dbl> <dbl>
1  10.5     45.0  6.71
df10 %>%
group_by(tumortype) %>%
summarize(
IQR      = IQR(genevalue)
, Variance = var(genevalue)
,  SD      = sd(genevalue)
)
# A tibble: 2 x 4
tumortype   IQR Variance    SD
<chr>     <dbl>    <dbl> <dbl>
1 ALL       0.383    0.241 0.491
2 AML       0.494    0.338 0.582

# Measures of Skewness

df7 %>%
summarize(
SK = sum((Notebook - mean(Notebook))^3)/(n()*(sd(Notebook))^3)
)
# A tibble: 1 x 1
SK
<dbl>
1 0.217
df8 %>%
summarize(
SK = sum((PlantHeight - mean(PlantHeight))^3)/(n()*(sd(PlantHeight))^3)
)
# A tibble: 1 x 1
SK
<dbl>
1 0.0574
df10 %>%
group_by(tumortype) %>%
summarize(
SK = sum((genevalue - mean(genevalue))^3)/(n()*(sd(genevalue))^3)
)
# A tibble: 2 x 2
tumortype     SK
<chr>      <dbl>
1 ALL       -0.802
2 AML       -0.937

# Measures of Skewness

df7 %>%
summarize(
K = sum((Notebook - mean(Notebook))^4)/(n()*(sd(Notebook))^4) - 3
)
# A tibble: 1 x 1
K
<dbl>
1 -1.19
df8 %>%
summarize(
K = sum((PlantHeight - mean(PlantHeight))^4)/(n()*(sd(PlantHeight))^4) - 3
)
# A tibble: 1 x 1
K
<dbl>
1 -0.960
df10 %>%
group_by(tumortype) %>%
summarize(
K = sum((genevalue - mean(genevalue))^4)/(n()*(sd(genevalue))^4) - 3
)
# A tibble: 2 x 2
tumortype     K
<chr>     <dbl>
1 ALL       0.855
2 AML       0.345