Bioinformatics with R

Blogs

20 September 2018

Statistics

Statistics is the science of uncertainty & variability

Statistics turns data into information

Data -> Information -> Knowledge -> Wisdom

Data Driven Decisions (3Ds)

Statistics is the interpretation of Science

Statistics is the Art & Science of learning from data

Variable

Characteristic that may vary from individual to individual

Height, Weight, CGPA etc

Measurement

Process of assigning numbers or labels to objects or states in accordance with logically accepted rules

Measurement Scales

Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories

Ordinal Scale: Obersvations may be ranked

Interval Scale: Difference between obersvations is meaningful

Ratio Scale: Ratio between obersvations is meaningful & true zero point

Nominal Data

Example

The following data shows the gender of a sample of twenty students from the University of Agriculture, Faisalabad.

Student	Gender
1	Male
2	Male
3	Female
4	Female
5	Female
6	Female
7	Male
8	Male
9	Male
10	Male
11	Female
12	Female
13	Male
14	Male
15	Female
16	Female
17	Male
18	Male
19	Male
20	Male

if (!require("tidyverse")) install.packages("tidyverse")
# library(tidyverse)
df1 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
  )
df1

# A tibble: 20 x 2
   Student Gender
     <dbl> <chr> 
 1       1 Male  
 2       2 Male  
 3       3 Female
 4       4 Female
 5       5 Female
 6       6 Female
 7       7 Male  
 8       8 Male  
 9       9 Male  
10      10 Male  
11      11 Female
12      12 Female
13      13 Male  
14      14 Male  
15      15 Female
16      16 Female
17      17 Male  
18      18 Male  
19      19 Male  
20      20 Male

Frequency Table

df1Freq <- 
      df1 %>%
      dplyr::count(Gender) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100  
      )
df1Freq

# A tibble: 2 x 4
  Gender     f    rf    pf
  <chr>  <int> <dbl> <dbl>
1 Female     8   0.4    40
2 Male      12   0.6    60

Simple Bar Chart

ggplot(
       data = df1
    ,  mapping = aes(x = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Example

Following data presents the number of nucleotides of gene sequence (A, C, G, T). This is illustrated by the Zyxin gene which plays an important role in cell adhesion (Golub et al, 1999). The accession number (X94991.1) of one of its variants can be found in a data base like NCBI (UniGene). Given data will be used to illustrate the construction of pie chart from the frequency table of four nucleotides.

A	C	G	T
410	789	573	394

Data from the GenBank can also be imported directly by the following code.

# install.packages(pkgs = "ape", repo = "http://cran.r-project.org", dependencies =TRUE)
# library(ape)
# df2 <- table(read.GenBank(c("X94991.1"),as.character=TRUE))
# df2

Frequency Table

df3 <- tibble(
    G.Sequence   = c("A", "C", "G", "T")
  , Nucleotides  = c(410, 789, 573, 394)
  )
df3

# A tibble: 4 x 2
  G.Sequence Nucleotides
  <chr>            <dbl>
1 A                  410
2 C                  789
3 G                  573
4 T                  394

Simple Bar Chart

ggplot(
        data = df3
      , mapping = aes(x = G.Sequence, y = Nucleotides)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gene Sequence", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Ordinal Data

Example

The following data shows the grades of a sample of twenty students from the University of Agriculture, Faisalabad.

Student	Grade
1	A
2	B
3	B
4	C
5	A
6	D
7	F
8	C
9	B
10	D
11	F
12	A
13	B
14	B
15	C
16	D
17	C
18	B
19	C
20	D

df4 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Grade  = c("A", "B", "B", "C", "A", "D", "F", "C", "B", "D", "F", "A", "B", "B", "C", "D", "C",  "B", "C", "D")
  )
df4

# A tibble: 20 x 2
   Student Grade
     <dbl> <chr>
 1       1 A    
 2       2 B    
 3       3 B    
 4       4 C    
 5       5 A    
 6       6 D    
 7       7 F    
 8       8 C    
 9       9 B    
10      10 D    
11      11 F    
12      12 A    
13      13 B    
14      14 B    
15      15 C    
16      16 D    
17      17 C    
18      18 B    
19      19 C    
20      20 D

Frequency Table

df4Freq <- 
      df4 %>%
      dplyr::count(Grade) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100
      , cf = cumsum(f)
      )
df4Freq

# A tibble: 5 x 5
  Grade     f    rf    pf    cf
  <chr> <int> <dbl> <dbl> <int>
1 A         3  0.15    15     3
2 B         6  0.3     30     9
3 C         5  0.25    25    14
4 D         4  0.2     20    18
5 F         2  0.1     10    20

Simple Bar Chart

ggplot(
       data = df4
    ,  mapping = aes(x = Grade)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Grades", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Two Way Contingency Table

Example

The following data shows the gender and residental status of a sample of twenty students from the University of Agriculture, Faisalabad.

Student	Gender	RS
1	Male	Boarding
2	Male	Non-Boarding
3	Female	Non-Boarding
4	Female	Boarding
5	Female	Boarding
6	Female	Non-Boarding
7	Male	Non-Boarding
8	Male	Boarding
9	Male	Non-Boarding
10	Male	Non-Boarding
11	Female	Non-Boarding
12	Female	Boarding
13	Male	Non-Boarding
14	Male	Non-Boarding
15	Female	Non-Boarding
16	Female	Non-Boarding
17	Male	Non-Boarding
18	Male	Non-Boarding
19	Male	Non-Boarding
20	Male	Boarding

df5 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
  , RS      = c("B", "NB", "NB", "B", "B", "NB", "NB", "B", "NB", "NB", "NB", "B", "NB", "NB", "NB", "NB", "NB", "NB", "NB", "B")
  )
df5

# A tibble: 20 x 3
   Student Gender RS   
     <dbl> <chr>  <chr>
 1       1 Male   B    
 2       2 Male   NB   
 3       3 Female NB   
 4       4 Female B    
 5       5 Female B    
 6       6 Female NB   
 7       7 Male   NB   
 8       8 Male   B    
 9       9 Male   NB   
10      10 Male   NB   
11      11 Female NB   
12      12 Female B    
13      13 Male   NB   
14      14 Male   NB   
15      15 Female NB   
16      16 Female NB   
17      17 Male   NB   
18      18 Male   NB   
19      19 Male   NB   
20      20 Male   B

Cross Tables

df5 %>%
  dplyr::count(Gender) %>%
  dplyr::rename(f = n)

# A tibble: 2 x 2
  Gender     f
  <chr>  <int>
1 Female     8
2 Male      12

df5 %>%
  dplyr::count(RS) %>%
  dplyr::rename(f = n)

# A tibble: 2 x 2
  RS        f
  <chr> <int>
1 B         6
2 NB       14

df5 %>%
  dplyr::count(Gender, RS) %>%
  dplyr::rename(f = n)

# A tibble: 4 x 3
  Gender RS        f
  <chr>  <chr> <int>
1 Female B         3
2 Female NB        5
3 Male   B         3
4 Male   NB        9

if (!require("janitor")) install.packages("janitor")
# library(janitor)
df5CrossTab <- 
      df5 %>%
      janitor::tabyl(dat = ., var1 = Gender, var2 = RS) %>%
      janitor::adorn_totals(dat = ., where = c("row", "col"))
df5CrossTab

 Gender B NB Total
 Female 3  5     8
   Male 3  9    12
  Total 6 14    20

Simple Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Multiple Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender, fill = RS)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS, fill = Gender)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Component Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender, fill = RS)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS, fill = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Example

Source: OMB Statistical Policy Working Paper 22. https://www.hhs.gov/sites/default/files/spwp22.pdf Following data set consists of information concerning delinquent children. Recorded variables are Number of Delinquent Children by County and Education Level of Household Head.

Cross Tables

df6 <- tibble(
  Delinquent = gl(n = 4, k = 4, length = 16, labels = c("Alpha", "Beta", "Gamma", "Delta"))
, EduLevel   = gl(n = 4, k = 1, length = 16, labels = c("Low", "Medium", "High", "Very High"))
, Freq       = c(15, 0, 5, 0, 20, 10, 10, 15, 5, 10, 10, 0, 10, 15, 5, 5)
)

df6

# A tibble: 16 x 3
   Delinquent EduLevel   Freq
   <fct>      <fct>     <dbl>
 1 Alpha      Low          15
 2 Alpha      Medium        0
 3 Alpha      High          5
 4 Alpha      Very High     0
 5 Beta       Low          20
 6 Beta       Medium       10
 7 Beta       High         10
 8 Beta       Very High    15
 9 Gamma      Low           5
10 Gamma      Medium       10
11 Gamma      High         10
12 Gamma      Very High     0
13 Delta      Low          10
14 Delta      Medium       15
15 Delta      High          5
16 Delta      Very High     5

df6 %>%
  xtabs(data = ., Freq ~ Delinquent)

Delinquent
Alpha  Beta Gamma Delta 
   20    55    25    35

df6 %>%
  xtabs(data = ., Freq ~ EduLevel)

EduLevel
      Low    Medium      High Very High 
       50        35        30        20

df6 %>%
  xtabs(data = ., Freq ~ Delinquent + EduLevel)

          EduLevel
Delinquent Low Medium High Very High
     Alpha  15      0    5         0
     Beta   20     10   10        15
     Gamma   5     10   10         0
     Delta  10     15    5         5

Multiple Bar Charts

ggplot(
       data = df6
    ,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df6
    ,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Component Bar Charts

ggplot(
       data = df6
    ,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df6
    ,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Count Data

Example

The following data shows the number of notebook a sample of twenty students keeping.

Student	Notebook
1	3
2	1
3	0
4	2
5	2
6	4
7	5
8	1
9	1
10	2
11	3
12	4
13	2
14	5
15	1
16	5
17	4
18	2
19	2
20	3

df7 <- tibble::tibble(
    Student  = seq(from = 1, to = 20, by = 1)
  , Notebook = c(3, 1, 0, 2, 2, 4, 5, 1, 1, 2, 3, 4, 2, 5, 1, 5, 4, 2, 2, 3)
  )
df7

# A tibble: 20 x 2
   Student Notebook
     <dbl>    <dbl>
 1       1        3
 2       2        1
 3       3        0
 4       4        2
 5       5        2
 6       6        4
 7       7        5
 8       8        1
 9       9        1
10      10        2
11      11        3
12      12        4
13      13        2
14      14        5
15      15        1
16      16        5
17      17        4
18      18        2
19      19        2
20      20        3

Frequency Table

df7Freq <- 
      df7 %>%
      dplyr::count(Notebook) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100
      , cf = cumsum(f)
      )
df7Freq

# A tibble: 6 x 5
  Notebook     f    rf    pf    cf
     <dbl> <int> <dbl> <dbl> <int>
1        0     1  0.05     5     1
2        1     4  0.2     20     5
3        2     6  0.3     30    11
4        3     3  0.15    15    14
5        4     3  0.15    15    17
6        5     3  0.15    15    20

Simple Bar Chart

ggplot(
       data = df7
    ,  mapping = aes(x = Notebook)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Notebooks", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Continuous Data

Example

The following data is the final plant height (cm) of thirty plants of wheat. 87 91 89 88 89 91 87 92 90 98 95 97 96 100 101 96 98 99 98 100 102 99 101 105 103 107 105 106 107 112

df8 <- tibble::tibble(
  PlantHeight = c(
  87,   91, 89, 88, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100,    101,
  96,   98, 99, 98, 100,    102,    99, 101,    105,    103,    107,    105,    106,    107,    112
  )
)
df8

# A tibble: 30 x 1
   PlantHeight
         <dbl>
 1          87
 2          91
 3          89
 4          88
 5          89
 6          91
 7          87
 8          92
 9          90
10          98
# ... with 20 more rows

Frequency Distribution

df9 <- df8 %>% 
  summarize(
    R = max(PlantHeight) - min(PlantHeight)
  , k = floor(1 + 3.3*log10(length(PlantHeight)))
  , h = R/k
    )

df8Freq <- df8 %>% 
  mutate(
          Classes = cut(
                         x              = PlantHeight
                       , breaks         = df9$k
                       , include.lowest = TRUE
                       , right          = FALSE
                       )
          ) %>%
  count(Classes) %>% 
  tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
  rename(f = n) %>%
  mutate(
    LB = readr::parse_number(x = LB)
  , UB = readr::parse_number(x = UB)
  , rf = f/sum(f)
  , pf = f/sum(f)*100
  , cf = cumsum(f)
  , MidPoint = (LB + UB)/2
  ) 
  
df8Freq

# A tibble: 5 x 8
  Classes      LB    UB     f    rf    pf    cf MidPoint
  <fct>     <dbl> <dbl> <int> <dbl> <dbl> <int>    <dbl>
1 [87,92)      87    92     8 0.267  26.7     8     89.5
2 [92,97)      92    97     4 0.133  13.3    12     94.5
3 [97,102)     97   102    10 0.333  33.3    22     99.5
4 [102,107)   102   107     5 0.167  16.7    27    104. 
5 [107,112]   107   112     3 0.1    10      30    110.

Histogram

ggplot(
       data = df8
     , mapping = aes(x = PlantHeight)) + 
  geom_histogram() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Histogram for Plant Height", x = "Plant Height", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = f))  + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Frequency for Plant Height", x = "Mid Point", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = cf)) + 
  geom_point()+ 
  geom_line() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

Stem and Leaf Plot

stem(x = df8$PlantHeight, scale = 1, width = 80, atom = 1e-08)


  The decimal point is 1 digit(s) to the right of the |

   8 | 77899
   9 | 0112
   9 | 566788899
  10 | 001123
  10 | 55677
  11 | 2

Box Plot

ggplot(data = df8 , aes( y = PlantHeight)) + 
  geom_boxplot()+
  theme_bw()

Example

The golub table contains gene expression values from 3051 genes taken from 38 Leukemia patients. Twenty seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and eleven as acute myeloid leukemia (AML). The golub.gnames table contains information on the gene, including gene index, manufacturing ID, and biological name. Following table presents the gene expression value by their tumor type.

tumortype	genevalue	tumortype	genevalue
ALL	2.10892	ALL	1.78352
ALL	1.52405	ALL	0.45827
ALL	1.96403	ALL	2.18119
ALL	2.33597	ALL	2.31428
ALL	1.85111	ALL	1.99927
ALL	1.99391	ALL	1.36844
ALL	2.06597	ALL	2.37351
ALL	1.81649	ALL	1.83485
ALL	2.17622	AML	0.88941
ALL	1.80861	AML	1.45014
ALL	2.44562	AML	0.42904
ALL	1.90496	AML	0.82667
ALL	2.76610	AML	0.63637
ALL	1.32551	AML	1.02250
ALL	2.59385	AML	0.12758
ALL	1.92776	AML	-0.74333
ALL	1.10546	AML	0.73784
ALL	1.27645	AML	0.49470
ALL	1.83051	AML	1.12058

df10 <- tibble(
   genevalue = c(
              2.10892, 1.52405, 1.96403, 2.33597, 1.85111, 1.99391
            , 2.06597, 1.81649, 2.17622, 1.80861, 2.44562, 1.90496
            , 2.76610, 1.32551, 2.59385, 1.92776, 1.10546, 1.27645
            , 1.83051, 1.78352, 0.45827, 2.18119, 2.31428, 1.99927
            , 1.36844, 2.37351, 1.83485, 0.88941, 1.45014, 0.42904
            , 0.82667, 0.63637, 1.02250, 0.12758, -0.74333, 0.73784
            , 0.49470, 1.12058
            )
  , tumortype = rep(c("ALL","AML"), c(27, 11))
)

df10

# A tibble: 38 x 2
   genevalue tumortype
       <dbl> <chr>    
 1      2.11 ALL      
 2      1.52 ALL      
 3      1.96 ALL      
 4      2.34 ALL      
 5      1.85 ALL      
 6      1.99 ALL      
 7      2.07 ALL      
 8      1.82 ALL      
 9      2.18 ALL      
10      1.81 ALL      
# ... with 28 more rows

Frequency Distribution

df11 <- df10 %>% 
  summarize(
    R = max(genevalue) - min(genevalue)
  , k = floor(1 + 3.3*log10(length(genevalue)))
  , h = R/k
    )

df10Freq <- df10 %>% 
  mutate(
          Classes = cut(
                         x              = genevalue
                       , breaks         = df11$k
                       , include.lowest = TRUE
                       , right          = FALSE
                       )
          ) %>%
  count(Classes) %>% 
  tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
  rename(f = n) %>%
  mutate(
    LB = readr::parse_number(x = LB)
  , UB = readr::parse_number(x = UB)
  , rf = f/sum(f)
  , pf = f/sum(f)*100
  , cf = cumsum(f)
  , MidPoint = (LB + UB)/2
  ) 
  
df10Freq

# A tibble: 6 x 8
  Classes             LB     UB     f     rf    pf    cf MidPoint
  <fct>            <dbl>  <dbl> <int>  <dbl> <dbl> <int>    <dbl>
1 [-0.747,-0.158) -0.747 -0.158     1 0.0263  2.63     1   -0.452
2 [-0.158,0.426)  -0.158  0.426     1 0.0263  2.63     2    0.134
3 [0.426,1.01)     0.426  1.01      7 0.184  18.4      9    0.718
4 [1.01,1.6)       1.01   1.6       8 0.211  21.1     17    1.31 
5 [1.6,2.18)       1.6    2.18     15 0.395  39.5     32    1.89 
6 [2.18,2.77]      2.18   2.77      6 0.158  15.8     38    2.48

Histogram

ggplot(
       data = df10
     , mapping = aes(x = genevalue)) + 
  geom_histogram() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Histogram for Gene Value", x = "Gene Value", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = f))  + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Frequency for Gene Value", x = "Mid Point", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = cf)) + 
  geom_point()+ 
  geom_line() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

Stem and Leaf Plot

stem(x = df10$genevalue, scale = 1, width = 80, atom = 1e-08)


  The decimal point is at the |

  -0 | 7
  -0 | 
   0 | 14
   0 | 556789
   1 | 011334
   1 | 5588888999
   2 | 00011223344
   2 | 68

Box Plot

ggplot(data = df10 , aes(x = tumortype, y = genevalue)) + 
  geom_boxplot()+
  theme_bw()

Measures of Centeral Tendency

df7 %>%
  summarize(
            n       = length(Notebook)
          , Mean    = mean(Notebook)
          , Median  = median(Notebook)
          , Minimum = min(Notebook) 
          , Maximum = max(Notebook) 
          , Q1      = quantile(x = Notebook, probs = 0.25)
          , Q2      = quantile(x = Notebook, probs = 0.50)
          , Q3      = quantile(x = Notebook, probs = 0.75)
  )

# A tibble: 1 x 8
      n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    20   2.6      2       0       5  1.75     2     4

df8 %>%
  summarize(
            n       = length(PlantHeight)
          , Mean    = mean(PlantHeight)
          , Median  = median(PlantHeight)
          , Minimum = min(PlantHeight) 
          , Maximum = max(PlantHeight) 
          , Q1      = quantile(x = PlantHeight, probs = 0.25)
          , Q2      = quantile(x = PlantHeight, probs = 0.50)
          , Q3      = quantile(x = PlantHeight, probs = 0.75)
  )

# A tibble: 1 x 8
      n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    30  97.6     98      87     112  91.2    98  102.

df10 %>%
  group_by(tumortype) %>%
  summarize(
            n       = length(genevalue)
          , Mean    = mean(genevalue)
          , Median  = median(genevalue)
          , Minimum = min(genevalue) 
          , Maximum = max(genevalue) 
          , Q1      = quantile(x = genevalue, probs = 0.25)
          , Q2      = quantile(x = genevalue, probs = 0.50)
          , Q3      = quantile(x = genevalue, probs = 0.75)
  )

# A tibble: 2 x 9
  tumortype     n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <chr>     <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1 ALL          27 1.89   1.93    0.458    2.77 1.80  1.93  2.18 
2 AML          11 0.636  0.738  -0.743    1.45 0.462 0.738 0.956

Measures of Dispersion

df7 %>%
  summarize(
            IQR      = IQR(Notebook)
          , Variance = var(Notebook)
          ,  SD      = sd(Notebook)
          )

# A tibble: 1 x 3
    IQR Variance    SD
  <dbl>    <dbl> <dbl>
1  2.25     2.25  1.50

df8 %>%
  summarize(
            IQR      = IQR(PlantHeight)
          , Variance = var(PlantHeight)
          ,  SD      = sd(PlantHeight)
          )

# A tibble: 1 x 3
    IQR Variance    SD
  <dbl>    <dbl> <dbl>
1  10.5     45.0  6.71

df10 %>%
  group_by(tumortype) %>%
  summarize(
            IQR      = IQR(genevalue)
          , Variance = var(genevalue)
          ,  SD      = sd(genevalue)
          )

# A tibble: 2 x 4
  tumortype   IQR Variance    SD
  <chr>     <dbl>    <dbl> <dbl>
1 ALL       0.383    0.241 0.491
2 AML       0.494    0.338 0.582

Measures of Skewness

df7 %>% 
  summarize(
           SK = sum((Notebook - mean(Notebook))^3)/(n()*(sd(Notebook))^3)
          )

# A tibble: 1 x 1
     SK
  <dbl>
1 0.217

df8 %>% 
  summarize(
           SK = sum((PlantHeight - mean(PlantHeight))^3)/(n()*(sd(PlantHeight))^3)
          )

# A tibble: 1 x 1
      SK
   <dbl>
1 0.0574

df10 %>% 
  group_by(tumortype) %>%
  summarize(
           SK = sum((genevalue - mean(genevalue))^3)/(n()*(sd(genevalue))^3)
          )

# A tibble: 2 x 2
  tumortype     SK
  <chr>      <dbl>
1 ALL       -0.802
2 AML       -0.937

Measures of Skewness

df7 %>%
  summarize(
            K = sum((Notebook - mean(Notebook))^4)/(n()*(sd(Notebook))^4) - 3
          )

# A tibble: 1 x 1
      K
  <dbl>
1 -1.19

df8 %>%
  summarize(
            K = sum((PlantHeight - mean(PlantHeight))^4)/(n()*(sd(PlantHeight))^4) - 3
          )

# A tibble: 1 x 1
       K
   <dbl>
1 -0.960

df10 %>% 
  group_by(tumortype) %>%
  summarize(
            K = sum((genevalue - mean(genevalue))^4)/(n()*(sd(genevalue))^4) - 3
          )

# A tibble: 2 x 2
  tumortype     K
  <chr>     <dbl>
1 ALL       0.855
2 AML       0.345

All posts by date