Bioinformatics with R
Statistics
- Statistics is the science of uncertainty & variability
- Statistics turns data into information
- Data -> Information -> Knowledge -> Wisdom
- Data Driven Decisions (3Ds)
- Statistics is the interpretation of Science
- Statistics is the Art & Science of learning from data
Variable
- Characteristic that may vary from individual to individual
- Height, Weight, CGPA etc
Measurement
- Process of assigning numbers or labels to objects or states in accordance with logically accepted rules
Measurement Scales
- Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories
- Ordinal Scale: Obersvations may be ranked
- Interval Scale: Difference between obersvations is meaningful
- Ratio Scale: Ratio between obersvations is meaningful & true zero point
Nominal Data
Example
The following data shows the gender of a sample of twenty students from the University of Agriculture, Faisalabad.
Student | Gender |
---|---|
1 | Male |
2 | Male |
3 | Female |
4 | Female |
5 | Female |
6 | Female |
7 | Male |
8 | Male |
9 | Male |
10 | Male |
11 | Female |
12 | Female |
13 | Male |
14 | Male |
15 | Female |
16 | Female |
17 | Male |
18 | Male |
19 | Male |
20 | Male |
if (!require("tidyverse")) install.packages("tidyverse")
# library(tidyverse)
df1 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Gender = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
)
df1
# A tibble: 20 x 2
Student Gender
<dbl> <chr>
1 1 Male
2 2 Male
3 3 Female
4 4 Female
5 5 Female
6 6 Female
7 7 Male
8 8 Male
9 9 Male
10 10 Male
11 11 Female
12 12 Female
13 13 Male
14 14 Male
15 15 Female
16 16 Female
17 17 Male
18 18 Male
19 19 Male
20 20 Male
Frequency Table
df1Freq <-
df1 %>%
dplyr::count(Gender) %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
)
df1Freq
# A tibble: 2 x 4
Gender f rf pf
<chr> <int> <dbl> <dbl>
1 Female 8 0.4 40
2 Male 12 0.6 60
Simple Bar Chart
ggplot(
data = df1
, mapping = aes(x = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Example
Following data presents the number of nucleotides of gene sequence (A, C, G, T). This is illustrated by the Zyxin gene which plays an important role in cell adhesion (Golub et al, 1999). The accession number (X94991.1) of one of its variants can be found in a data base like NCBI (UniGene). Given data will be used to illustrate the construction of pie chart from the frequency table of four nucleotides.
A | C | G | T |
---|---|---|---|
410 | 789 | 573 | 394 |
Data from the GenBank can also be imported directly by the following code.
# install.packages(pkgs = "ape", repo = "http://cran.r-project.org", dependencies =TRUE)
# library(ape)
# df2 <- table(read.GenBank(c("X94991.1"),as.character=TRUE))
# df2
Frequency Table
df3 <- tibble(
G.Sequence = c("A", "C", "G", "T")
, Nucleotides = c(410, 789, 573, 394)
)
df3
# A tibble: 4 x 2
G.Sequence Nucleotides
<chr> <dbl>
1 A 410
2 C 789
3 G 573
4 T 394
Simple Bar Chart
ggplot(
data = df3
, mapping = aes(x = G.Sequence, y = Nucleotides)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gene Sequence", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Ordinal Data
Example
The following data shows the grades of a sample of twenty students from the University of Agriculture, Faisalabad.
Student | Grade |
---|---|
1 | A |
2 | B |
3 | B |
4 | C |
5 | A |
6 | D |
7 | F |
8 | C |
9 | B |
10 | D |
11 | F |
12 | A |
13 | B |
14 | B |
15 | C |
16 | D |
17 | C |
18 | B |
19 | C |
20 | D |
df4 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Grade = c("A", "B", "B", "C", "A", "D", "F", "C", "B", "D", "F", "A", "B", "B", "C", "D", "C", "B", "C", "D")
)
df4
# A tibble: 20 x 2
Student Grade
<dbl> <chr>
1 1 A
2 2 B
3 3 B
4 4 C
5 5 A
6 6 D
7 7 F
8 8 C
9 9 B
10 10 D
11 11 F
12 12 A
13 13 B
14 14 B
15 15 C
16 16 D
17 17 C
18 18 B
19 19 C
20 20 D
Frequency Table
df4Freq <-
df4 %>%
dplyr::count(Grade) %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
, cf = cumsum(f)
)
df4Freq
# A tibble: 5 x 5
Grade f rf pf cf
<chr> <int> <dbl> <dbl> <int>
1 A 3 0.15 15 3
2 B 6 0.3 30 9
3 C 5 0.25 25 14
4 D 4 0.2 20 18
5 F 2 0.1 10 20
Simple Bar Chart
ggplot(
data = df4
, mapping = aes(x = Grade)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Grades", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Two Way Contingency Table
Example
The following data shows the gender and residental status of a sample of twenty students from the University of Agriculture, Faisalabad.
Student | Gender | RS |
---|---|---|
1 | Male | Boarding |
2 | Male | Non-Boarding |
3 | Female | Non-Boarding |
4 | Female | Boarding |
5 | Female | Boarding |
6 | Female | Non-Boarding |
7 | Male | Non-Boarding |
8 | Male | Boarding |
9 | Male | Non-Boarding |
10 | Male | Non-Boarding |
11 | Female | Non-Boarding |
12 | Female | Boarding |
13 | Male | Non-Boarding |
14 | Male | Non-Boarding |
15 | Female | Non-Boarding |
16 | Female | Non-Boarding |
17 | Male | Non-Boarding |
18 | Male | Non-Boarding |
19 | Male | Non-Boarding |
20 | Male | Boarding |
df5 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Gender = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
, RS = c("B", "NB", "NB", "B", "B", "NB", "NB", "B", "NB", "NB", "NB", "B", "NB", "NB", "NB", "NB", "NB", "NB", "NB", "B")
)
df5
# A tibble: 20 x 3
Student Gender RS
<dbl> <chr> <chr>
1 1 Male B
2 2 Male NB
3 3 Female NB
4 4 Female B
5 5 Female B
6 6 Female NB
7 7 Male NB
8 8 Male B
9 9 Male NB
10 10 Male NB
11 11 Female NB
12 12 Female B
13 13 Male NB
14 14 Male NB
15 15 Female NB
16 16 Female NB
17 17 Male NB
18 18 Male NB
19 19 Male NB
20 20 Male B
Cross Tables
df5 %>%
dplyr::count(Gender) %>%
dplyr::rename(f = n)
# A tibble: 2 x 2
Gender f
<chr> <int>
1 Female 8
2 Male 12
df5 %>%
dplyr::count(RS) %>%
dplyr::rename(f = n)
# A tibble: 2 x 2
RS f
<chr> <int>
1 B 6
2 NB 14
df5 %>%
dplyr::count(Gender, RS) %>%
dplyr::rename(f = n)
# A tibble: 4 x 3
Gender RS f
<chr> <chr> <int>
1 Female B 3
2 Female NB 5
3 Male B 3
4 Male NB 9
if (!require("janitor")) install.packages("janitor")
# library(janitor)
df5CrossTab <-
df5 %>%
janitor::tabyl(dat = ., var1 = Gender, var2 = RS) %>%
janitor::adorn_totals(dat = ., where = c("row", "col"))
df5CrossTab
Gender B NB Total
Female 3 5 8
Male 3 9 12
Total 6 14 20
Simple Bar Charts
ggplot(
data = df5
, mapping = aes(x = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(
data = df5
, mapping = aes(x = RS)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Multiple Bar Charts
ggplot(
data = df5
, mapping = aes(x = Gender, fill = RS)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(
data = df5
, mapping = aes(x = RS, fill = Gender)) +
geom_bar(position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Component Bar Charts
ggplot(
data = df5
, mapping = aes(x = Gender, fill = RS)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(
data = df5
, mapping = aes(x = RS, fill = Gender)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Example
Source: OMB Statistical Policy Working Paper 22. https://www.hhs.gov/sites/default/files/spwp22.pdf Following data set consists of information concerning delinquent children. Recorded variables are Number of Delinquent Children by County and Education Level of Household Head.
Cross Tables
df6 <- tibble(
Delinquent = gl(n = 4, k = 4, length = 16, labels = c("Alpha", "Beta", "Gamma", "Delta"))
, EduLevel = gl(n = 4, k = 1, length = 16, labels = c("Low", "Medium", "High", "Very High"))
, Freq = c(15, 0, 5, 0, 20, 10, 10, 15, 5, 10, 10, 0, 10, 15, 5, 5)
)
df6
# A tibble: 16 x 3
Delinquent EduLevel Freq
<fct> <fct> <dbl>
1 Alpha Low 15
2 Alpha Medium 0
3 Alpha High 5
4 Alpha Very High 0
5 Beta Low 20
6 Beta Medium 10
7 Beta High 10
8 Beta Very High 15
9 Gamma Low 5
10 Gamma Medium 10
11 Gamma High 10
12 Gamma Very High 0
13 Delta Low 10
14 Delta Medium 15
15 Delta High 5
16 Delta Very High 5
df6 %>%
xtabs(data = ., Freq ~ Delinquent)
Delinquent
Alpha Beta Gamma Delta
20 55 25 35
df6 %>%
xtabs(data = ., Freq ~ EduLevel)
EduLevel
Low Medium High Very High
50 35 30 20
df6 %>%
xtabs(data = ., Freq ~ Delinquent + EduLevel)
EduLevel
Delinquent Low Medium High Very High
Alpha 15 0 5 0
Beta 20 10 10 15
Gamma 5 10 10 0
Delta 10 15 5 5
Multiple Bar Charts
ggplot(
data = df6
, mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(
data = df6
, mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Multiple Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Component Bar Charts
ggplot(
data = df6
, mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(
data = df6
, mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
geom_bar(stat = "identity") +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Component Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Count Data
Example
The following data shows the number of notebook a sample of twenty students keeping.
Student | Notebook |
---|---|
1 | 3 |
2 | 1 |
3 | 0 |
4 | 2 |
5 | 2 |
6 | 4 |
7 | 5 |
8 | 1 |
9 | 1 |
10 | 2 |
11 | 3 |
12 | 4 |
13 | 2 |
14 | 5 |
15 | 1 |
16 | 5 |
17 | 4 |
18 | 2 |
19 | 2 |
20 | 3 |
df7 <- tibble::tibble(
Student = seq(from = 1, to = 20, by = 1)
, Notebook = c(3, 1, 0, 2, 2, 4, 5, 1, 1, 2, 3, 4, 2, 5, 1, 5, 4, 2, 2, 3)
)
df7
# A tibble: 20 x 2
Student Notebook
<dbl> <dbl>
1 1 3
2 2 1
3 3 0
4 4 2
5 5 2
6 6 4
7 7 5
8 8 1
9 9 1
10 10 2
11 11 3
12 12 4
13 13 2
14 14 5
15 15 1
16 16 5
17 17 4
18 18 2
19 19 2
20 20 3
Frequency Table
df7Freq <-
df7 %>%
dplyr::count(Notebook) %>%
dplyr::rename(f = n) %>%
dplyr::mutate(
rf = f/sum(f)
, pf = rf*100
, cf = cumsum(f)
)
df7Freq
# A tibble: 6 x 5
Notebook f rf pf cf
<dbl> <int> <dbl> <dbl> <int>
1 0 1 0.05 5 1
2 1 4 0.2 20 5
3 2 6 0.3 30 11
4 3 3 0.15 15 14
5 4 3 0.15 15 17
6 5 3 0.15 15 20
Simple Bar Chart
ggplot(
data = df7
, mapping = aes(x = Notebook)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0)) +
labs(title = "Simple Bar Chart", x = "Notebooks", y = "Frequency") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
Continuous Data
Example
The following data is the final plant height (cm) of thirty plants of wheat. 87 91 89 88 89 91 87 92 90 98 95 97 96 100 101 96 98 99 98 100 102 99 101 105 103 107 105 106 107 112
df8 <- tibble::tibble(
PlantHeight = c(
87, 91, 89, 88, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100, 101,
96, 98, 99, 98, 100, 102, 99, 101, 105, 103, 107, 105, 106, 107, 112
)
)
df8
# A tibble: 30 x 1
PlantHeight
<dbl>
1 87
2 91
3 89
4 88
5 89
6 91
7 87
8 92
9 90
10 98
# ... with 20 more rows
Frequency Distribution
df9 <- df8 %>%
summarize(
R = max(PlantHeight) - min(PlantHeight)
, k = floor(1 + 3.3*log10(length(PlantHeight)))
, h = R/k
)
df8Freq <- df8 %>%
mutate(
Classes = cut(
x = PlantHeight
, breaks = df9$k
, include.lowest = TRUE
, right = FALSE
)
) %>%
count(Classes) %>%
tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
rename(f = n) %>%
mutate(
LB = readr::parse_number(x = LB)
, UB = readr::parse_number(x = UB)
, rf = f/sum(f)
, pf = f/sum(f)*100
, cf = cumsum(f)
, MidPoint = (LB + UB)/2
)
df8Freq
# A tibble: 5 x 8
Classes LB UB f rf pf cf MidPoint
<fct> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl>
1 [87,92) 87 92 8 0.267 26.7 8 89.5
2 [92,97) 92 97 4 0.133 13.3 12 94.5
3 [97,102) 97 102 10 0.333 33.3 22 99.5
4 [102,107) 102 107 5 0.167 16.7 27 104.
5 [107,112] 107 112 3 0.1 10 30 110.
Histogram
ggplot(
data = df8
, mapping = aes(x = PlantHeight)) +
geom_histogram() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Histogram for Plant Height", x = "Plant Height", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = f)) +
geom_point() +
geom_line() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Frequency for Plant Height", x = "Mid Point", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = cf)) +
geom_point()+
geom_line() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") +
theme(plot.title = element_text(hjust = 0.5))
Stem and Leaf Plot
stem(x = df8$PlantHeight, scale = 1, width = 80, atom = 1e-08)
The decimal point is 1 digit(s) to the right of the |
8 | 77899
9 | 0112
9 | 566788899
10 | 001123
10 | 55677
11 | 2
Box Plot
ggplot(data = df8 , aes( y = PlantHeight)) +
geom_boxplot()+
theme_bw()
Example
The golub table contains gene expression values from 3051 genes taken from 38 Leukemia patients. Twenty seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and eleven as acute myeloid leukemia (AML). The golub.gnames table contains information on the gene, including gene index, manufacturing ID, and biological name. Following table presents the gene expression value by their tumor type.
tumortype | genevalue | tumortype | genevalue |
---|---|---|---|
ALL | 2.10892 | ALL | 1.78352 |
ALL | 1.52405 | ALL | 0.45827 |
ALL | 1.96403 | ALL | 2.18119 |
ALL | 2.33597 | ALL | 2.31428 |
ALL | 1.85111 | ALL | 1.99927 |
ALL | 1.99391 | ALL | 1.36844 |
ALL | 2.06597 | ALL | 2.37351 |
ALL | 1.81649 | ALL | 1.83485 |
ALL | 2.17622 | AML | 0.88941 |
ALL | 1.80861 | AML | 1.45014 |
ALL | 2.44562 | AML | 0.42904 |
ALL | 1.90496 | AML | 0.82667 |
ALL | 2.76610 | AML | 0.63637 |
ALL | 1.32551 | AML | 1.02250 |
ALL | 2.59385 | AML | 0.12758 |
ALL | 1.92776 | AML | -0.74333 |
ALL | 1.10546 | AML | 0.73784 |
ALL | 1.27645 | AML | 0.49470 |
ALL | 1.83051 | AML | 1.12058 |
df10 <- tibble(
genevalue = c(
2.10892, 1.52405, 1.96403, 2.33597, 1.85111, 1.99391
, 2.06597, 1.81649, 2.17622, 1.80861, 2.44562, 1.90496
, 2.76610, 1.32551, 2.59385, 1.92776, 1.10546, 1.27645
, 1.83051, 1.78352, 0.45827, 2.18119, 2.31428, 1.99927
, 1.36844, 2.37351, 1.83485, 0.88941, 1.45014, 0.42904
, 0.82667, 0.63637, 1.02250, 0.12758, -0.74333, 0.73784
, 0.49470, 1.12058
)
, tumortype = rep(c("ALL","AML"), c(27, 11))
)
df10
# A tibble: 38 x 2
genevalue tumortype
<dbl> <chr>
1 2.11 ALL
2 1.52 ALL
3 1.96 ALL
4 2.34 ALL
5 1.85 ALL
6 1.99 ALL
7 2.07 ALL
8 1.82 ALL
9 2.18 ALL
10 1.81 ALL
# ... with 28 more rows
Frequency Distribution
df11 <- df10 %>%
summarize(
R = max(genevalue) - min(genevalue)
, k = floor(1 + 3.3*log10(length(genevalue)))
, h = R/k
)
df10Freq <- df10 %>%
mutate(
Classes = cut(
x = genevalue
, breaks = df11$k
, include.lowest = TRUE
, right = FALSE
)
) %>%
count(Classes) %>%
tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
rename(f = n) %>%
mutate(
LB = readr::parse_number(x = LB)
, UB = readr::parse_number(x = UB)
, rf = f/sum(f)
, pf = f/sum(f)*100
, cf = cumsum(f)
, MidPoint = (LB + UB)/2
)
df10Freq
# A tibble: 6 x 8
Classes LB UB f rf pf cf MidPoint
<fct> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl>
1 [-0.747,-0.158) -0.747 -0.158 1 0.0263 2.63 1 -0.452
2 [-0.158,0.426) -0.158 0.426 1 0.0263 2.63 2 0.134
3 [0.426,1.01) 0.426 1.01 7 0.184 18.4 9 0.718
4 [1.01,1.6) 1.01 1.6 8 0.211 21.1 17 1.31
5 [1.6,2.18) 1.6 2.18 15 0.395 39.5 32 1.89
6 [2.18,2.77] 2.18 2.77 6 0.158 15.8 38 2.48
Histogram
ggplot(
data = df10
, mapping = aes(x = genevalue)) +
geom_histogram() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Histogram for Gene Value", x = "Gene Value", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = f)) +
geom_point() +
geom_line() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Frequency for Gene Value", x = "Mid Point", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = cf)) +
geom_point()+
geom_line() +
scale_y_continuous(expand = c(0, 0)) +
theme_bw()+
labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") +
theme(plot.title = element_text(hjust = 0.5))
Stem and Leaf Plot
stem(x = df10$genevalue, scale = 1, width = 80, atom = 1e-08)
The decimal point is at the |
-0 | 7
-0 |
0 | 14
0 | 556789
1 | 011334
1 | 5588888999
2 | 00011223344
2 | 68
Box Plot
ggplot(data = df10 , aes(x = tumortype, y = genevalue)) +
geom_boxplot()+
theme_bw()
Measures of Centeral Tendency
df7 %>%
summarize(
n = length(Notebook)
, Mean = mean(Notebook)
, Median = median(Notebook)
, Minimum = min(Notebook)
, Maximum = max(Notebook)
, Q1 = quantile(x = Notebook, probs = 0.25)
, Q2 = quantile(x = Notebook, probs = 0.50)
, Q3 = quantile(x = Notebook, probs = 0.75)
)
# A tibble: 1 x 8
n Mean Median Minimum Maximum Q1 Q2 Q3
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20 2.6 2 0 5 1.75 2 4
df8 %>%
summarize(
n = length(PlantHeight)
, Mean = mean(PlantHeight)
, Median = median(PlantHeight)
, Minimum = min(PlantHeight)
, Maximum = max(PlantHeight)
, Q1 = quantile(x = PlantHeight, probs = 0.25)
, Q2 = quantile(x = PlantHeight, probs = 0.50)
, Q3 = quantile(x = PlantHeight, probs = 0.75)
)
# A tibble: 1 x 8
n Mean Median Minimum Maximum Q1 Q2 Q3
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30 97.6 98 87 112 91.2 98 102.
df10 %>%
group_by(tumortype) %>%
summarize(
n = length(genevalue)
, Mean = mean(genevalue)
, Median = median(genevalue)
, Minimum = min(genevalue)
, Maximum = max(genevalue)
, Q1 = quantile(x = genevalue, probs = 0.25)
, Q2 = quantile(x = genevalue, probs = 0.50)
, Q3 = quantile(x = genevalue, probs = 0.75)
)
# A tibble: 2 x 9
tumortype n Mean Median Minimum Maximum Q1 Q2 Q3
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL 27 1.89 1.93 0.458 2.77 1.80 1.93 2.18
2 AML 11 0.636 0.738 -0.743 1.45 0.462 0.738 0.956
Measures of Dispersion
df7 %>%
summarize(
IQR = IQR(Notebook)
, Variance = var(Notebook)
, SD = sd(Notebook)
)
# A tibble: 1 x 3
IQR Variance SD
<dbl> <dbl> <dbl>
1 2.25 2.25 1.50
df8 %>%
summarize(
IQR = IQR(PlantHeight)
, Variance = var(PlantHeight)
, SD = sd(PlantHeight)
)
# A tibble: 1 x 3
IQR Variance SD
<dbl> <dbl> <dbl>
1 10.5 45.0 6.71
df10 %>%
group_by(tumortype) %>%
summarize(
IQR = IQR(genevalue)
, Variance = var(genevalue)
, SD = sd(genevalue)
)
# A tibble: 2 x 4
tumortype IQR Variance SD
<chr> <dbl> <dbl> <dbl>
1 ALL 0.383 0.241 0.491
2 AML 0.494 0.338 0.582
Measures of Skewness
df7 %>%
summarize(
SK = sum((Notebook - mean(Notebook))^3)/(n()*(sd(Notebook))^3)
)
# A tibble: 1 x 1
SK
<dbl>
1 0.217
df8 %>%
summarize(
SK = sum((PlantHeight - mean(PlantHeight))^3)/(n()*(sd(PlantHeight))^3)
)
# A tibble: 1 x 1
SK
<dbl>
1 0.0574
df10 %>%
group_by(tumortype) %>%
summarize(
SK = sum((genevalue - mean(genevalue))^3)/(n()*(sd(genevalue))^3)
)
# A tibble: 2 x 2
tumortype SK
<chr> <dbl>
1 ALL -0.802
2 AML -0.937
Measures of Skewness
df7 %>%
summarize(
K = sum((Notebook - mean(Notebook))^4)/(n()*(sd(Notebook))^4) - 3
)
# A tibble: 1 x 1
K
<dbl>
1 -1.19
df8 %>%
summarize(
K = sum((PlantHeight - mean(PlantHeight))^4)/(n()*(sd(PlantHeight))^4) - 3
)
# A tibble: 1 x 1
K
<dbl>
1 -0.960
df10 %>%
group_by(tumortype) %>%
summarize(
K = sum((genevalue - mean(genevalue))^4)/(n()*(sd(genevalue))^4) - 3
)
# A tibble: 2 x 2
tumortype K
<chr> <dbl>
1 ALL 0.855
2 AML 0.345