Bioinformatics with R

Blogs

Statistics

  • Statistics is the science of uncertainty & variability
  • Statistics turns data into information
  • Data -> Information -> Knowledge -> Wisdom
  • Data Driven Decisions (3Ds)
  • Statistics is the interpretation of Science
  • Statistics is the Art & Science of learning from data

 

Variable

  • Characteristic that may vary from individual to individual
  • Height, Weight, CGPA etc

 

Measurement

  • Process of assigning numbers or labels to objects or states in accordance with logically accepted rules

 

Measurement Scales

  • Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories
  • Ordinal Scale: Obersvations may be ranked
  • Interval Scale: Difference between obersvations is meaningful
  • Ratio Scale: Ratio between obersvations is meaningful & true zero point

 

Nominal Data

Example

The following data shows the gender of a sample of twenty students from the University of Agriculture, Faisalabad.

 

Student Gender
1 Male
2 Male
3 Female
4 Female
5 Female
6 Female
7 Male
8 Male
9 Male
10 Male
11 Female
12 Female
13 Male
14 Male
15 Female
16 Female
17 Male
18 Male
19 Male
20 Male

 

if (!require("tidyverse")) install.packages("tidyverse")
# library(tidyverse)
df1 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
  )
df1
# A tibble: 20 x 2
   Student Gender
     <dbl> <chr> 
 1       1 Male  
 2       2 Male  
 3       3 Female
 4       4 Female
 5       5 Female
 6       6 Female
 7       7 Male  
 8       8 Male  
 9       9 Male  
10      10 Male  
11      11 Female
12      12 Female
13      13 Male  
14      14 Male  
15      15 Female
16      16 Female
17      17 Male  
18      18 Male  
19      19 Male  
20      20 Male  

Frequency Table

df1Freq <- 
      df1 %>%
      dplyr::count(Gender) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100  
      )
df1Freq
# A tibble: 2 x 4
  Gender     f    rf    pf
  <chr>  <int> <dbl> <dbl>
1 Female     8   0.4    40
2 Male      12   0.6    60

Simple Bar Chart

ggplot(
       data = df1
    ,  mapping = aes(x = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

 

Example

Following data presents the number of nucleotides of gene sequence (A, C, G, T). This is illustrated by the Zyxin gene which plays an important role in cell adhesion (Golub et al, 1999). The accession number (X94991.1) of one of its variants can be found in a data base like NCBI (UniGene). Given data will be used to illustrate the construction of pie chart from the frequency table of four nucleotides.

A C G T
410 789 573 394

Data from the GenBank can also be imported directly by the following code.

# install.packages(pkgs = "ape", repo = "http://cran.r-project.org", dependencies =TRUE)
# library(ape)
# df2 <- table(read.GenBank(c("X94991.1"),as.character=TRUE))
# df2

Frequency Table

df3 <- tibble(
    G.Sequence   = c("A", "C", "G", "T")
  , Nucleotides  = c(410, 789, 573, 394)
  )
df3
# A tibble: 4 x 2
  G.Sequence Nucleotides
  <chr>            <dbl>
1 A                  410
2 C                  789
3 G                  573
4 T                  394

Simple Bar Chart

ggplot(
        data = df3
      , mapping = aes(x = G.Sequence, y = Nucleotides)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gene Sequence", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Ordinal Data

Example

The following data shows the grades of a sample of twenty students from the University of Agriculture, Faisalabad.

 

Student Grade
1 A
2 B
3 B
4 C
5 A
6 D
7 F
8 C
9 B
10 D
11 F
12 A
13 B
14 B
15 C
16 D
17 C
18 B
19 C
20 D

 

df4 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Grade  = c("A", "B", "B", "C", "A", "D", "F", "C", "B", "D", "F", "A", "B", "B", "C", "D", "C",  "B", "C", "D")
  )
df4
# A tibble: 20 x 2
   Student Grade
     <dbl> <chr>
 1       1 A    
 2       2 B    
 3       3 B    
 4       4 C    
 5       5 A    
 6       6 D    
 7       7 F    
 8       8 C    
 9       9 B    
10      10 D    
11      11 F    
12      12 A    
13      13 B    
14      14 B    
15      15 C    
16      16 D    
17      17 C    
18      18 B    
19      19 C    
20      20 D    

Frequency Table

df4Freq <- 
      df4 %>%
      dplyr::count(Grade) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100
      , cf = cumsum(f)
      )
df4Freq
# A tibble: 5 x 5
  Grade     f    rf    pf    cf
  <chr> <int> <dbl> <dbl> <int>
1 A         3  0.15    15     3
2 B         6  0.3     30     9
3 C         5  0.25    25    14
4 D         4  0.2     20    18
5 F         2  0.1     10    20

Simple Bar Chart

ggplot(
       data = df4
    ,  mapping = aes(x = Grade)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Grades", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

 

Two Way Contingency Table

Example

The following data shows the gender and residental status of a sample of twenty students from the University of Agriculture, Faisalabad.

 

Student Gender RS
1 Male Boarding
2 Male Non-Boarding
3 Female Non-Boarding
4 Female Boarding
5 Female Boarding
6 Female Non-Boarding
7 Male Non-Boarding
8 Male Boarding
9 Male Non-Boarding
10 Male Non-Boarding
11 Female Non-Boarding
12 Female Boarding
13 Male Non-Boarding
14 Male Non-Boarding
15 Female Non-Boarding
16 Female Non-Boarding
17 Male Non-Boarding
18 Male Non-Boarding
19 Male Non-Boarding
20 Male Boarding

 

df5 <- tibble::tibble(
    Student = seq(from = 1, to = 20, by = 1)
  , Gender  = rep(x = c("Male", "Female", "Male", "Female", "Male", "Female", "Male"), c(2, 4, 4, 2, 2, 2, 4))
  , RS      = c("B", "NB", "NB", "B", "B", "NB", "NB", "B", "NB", "NB", "NB", "B", "NB", "NB", "NB", "NB", "NB", "NB", "NB", "B")
  )
df5
# A tibble: 20 x 3
   Student Gender RS   
     <dbl> <chr>  <chr>
 1       1 Male   B    
 2       2 Male   NB   
 3       3 Female NB   
 4       4 Female B    
 5       5 Female B    
 6       6 Female NB   
 7       7 Male   NB   
 8       8 Male   B    
 9       9 Male   NB   
10      10 Male   NB   
11      11 Female NB   
12      12 Female B    
13      13 Male   NB   
14      14 Male   NB   
15      15 Female NB   
16      16 Female NB   
17      17 Male   NB   
18      18 Male   NB   
19      19 Male   NB   
20      20 Male   B    

Cross Tables

df5 %>%
  dplyr::count(Gender) %>%
  dplyr::rename(f = n)
# A tibble: 2 x 2
  Gender     f
  <chr>  <int>
1 Female     8
2 Male      12
df5 %>%
  dplyr::count(RS) %>%
  dplyr::rename(f = n)
# A tibble: 2 x 2
  RS        f
  <chr> <int>
1 B         6
2 NB       14
df5 %>%
  dplyr::count(Gender, RS) %>%
  dplyr::rename(f = n)
# A tibble: 4 x 3
  Gender RS        f
  <chr>  <chr> <int>
1 Female B         3
2 Female NB        5
3 Male   B         3
4 Male   NB        9
if (!require("janitor")) install.packages("janitor")
# library(janitor)
df5CrossTab <- 
      df5 %>%
      janitor::tabyl(dat = ., var1 = Gender, var2 = RS) %>%
      janitor::adorn_totals(dat = ., where = c("row", "col"))
df5CrossTab
 Gender B NB Total
 Female 3  5     8
   Male 3  9    12
  Total 6 14    20

Simple Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Multiple Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender, fill = RS)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS, fill = Gender)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Component Bar Charts

ggplot(
       data = df5
    ,  mapping = aes(x = Gender, fill = RS)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Gender", fill = "Residental Status", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df5
    ,  mapping = aes(x = RS, fill = Gender)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Residental Status", fill = "Gender", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Example

Source: OMB Statistical Policy Working Paper 22. https://www.hhs.gov/sites/default/files/spwp22.pdf Following data set consists of information concerning delinquent children. Recorded variables are Number of Delinquent Children by County and Education Level of Household Head.

Cross Tables

df6 <- tibble(
  Delinquent = gl(n = 4, k = 4, length = 16, labels = c("Alpha", "Beta", "Gamma", "Delta"))
, EduLevel   = gl(n = 4, k = 1, length = 16, labels = c("Low", "Medium", "High", "Very High"))
, Freq       = c(15, 0, 5, 0, 20, 10, 10, 15, 5, 10, 10, 0, 10, 15, 5, 5)
)

df6
# A tibble: 16 x 3
   Delinquent EduLevel   Freq
   <fct>      <fct>     <dbl>
 1 Alpha      Low          15
 2 Alpha      Medium        0
 3 Alpha      High          5
 4 Alpha      Very High     0
 5 Beta       Low          20
 6 Beta       Medium       10
 7 Beta       High         10
 8 Beta       Very High    15
 9 Gamma      Low           5
10 Gamma      Medium       10
11 Gamma      High         10
12 Gamma      Very High     0
13 Delta      Low          10
14 Delta      Medium       15
15 Delta      High          5
16 Delta      Very High     5
df6 %>%
  xtabs(data = ., Freq ~ Delinquent)
Delinquent
Alpha  Beta Gamma Delta 
   20    55    25    35 
df6 %>%
  xtabs(data = ., Freq ~ EduLevel)
EduLevel
      Low    Medium      High Very High 
       50        35        30        20 
df6 %>%
  xtabs(data = ., Freq ~ Delinquent + EduLevel)
          EduLevel
Delinquent Low Medium High Very High
     Alpha  15      0    5         0
     Beta   20     10   10        15
     Gamma   5     10   10         0
     Delta  10     15    5         5

Multiple Bar Charts

ggplot(
       data = df6
    ,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df6
    ,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Multiple Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Component Bar Charts

ggplot(
       data = df6
    ,  mapping = aes(x = Delinquent, y = Freq, fill = EduLevel)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Delinquent", fill = "Education Level", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(
       data = df6
    ,  mapping = aes(x = EduLevel, y = Freq, fill = Delinquent)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Component Bar Chart", x = "Education Level", fill = "Delinquent", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

Count Data

Example

The following data shows the number of notebook a sample of twenty students keeping.

 

Student Notebook
1 3
2 1
3 0
4 2
5 2
6 4
7 5
8 1
9 1
10 2
11 3
12 4
13 2
14 5
15 1
16 5
17 4
18 2
19 2
20 3

 

df7 <- tibble::tibble(
    Student  = seq(from = 1, to = 20, by = 1)
  , Notebook = c(3, 1, 0, 2, 2, 4, 5, 1, 1, 2, 3, 4, 2, 5, 1, 5, 4, 2, 2, 3)
  )
df7
# A tibble: 20 x 2
   Student Notebook
     <dbl>    <dbl>
 1       1        3
 2       2        1
 3       3        0
 4       4        2
 5       5        2
 6       6        4
 7       7        5
 8       8        1
 9       9        1
10      10        2
11      11        3
12      12        4
13      13        2
14      14        5
15      15        1
16      16        5
17      17        4
18      18        2
19      19        2
20      20        3

Frequency Table

df7Freq <- 
      df7 %>%
      dplyr::count(Notebook) %>%
      dplyr::rename(f = n) %>%
      dplyr::mutate(
        rf = f/sum(f)
      , pf = rf*100
      , cf = cumsum(f)
      )
df7Freq
# A tibble: 6 x 5
  Notebook     f    rf    pf    cf
     <dbl> <int> <dbl> <dbl> <int>
1        0     1  0.05     5     1
2        1     4  0.2     20     5
3        2     6  0.3     30    11
4        3     3  0.15    15    14
5        4     3  0.15    15    17
6        5     3  0.15    15    20

Simple Bar Chart

ggplot(
       data = df7
    ,  mapping = aes(x = Notebook)) +
  geom_bar() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(title = "Simple Bar Chart", x = "Notebooks", y = "Frequency") + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

 

Continuous Data

Example

The following data is the final plant height (cm) of thirty plants of wheat. 87 91 89 88 89 91 87 92 90 98 95 97 96 100 101 96 98 99 98 100 102 99 101 105 103 107 105 106 107 112

df8 <- tibble::tibble(
  PlantHeight = c(
  87,   91, 89, 88, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100,    101,
  96,   98, 99, 98, 100,    102,    99, 101,    105,    103,    107,    105,    106,    107,    112
  )
)
df8
# A tibble: 30 x 1
   PlantHeight
         <dbl>
 1          87
 2          91
 3          89
 4          88
 5          89
 6          91
 7          87
 8          92
 9          90
10          98
# ... with 20 more rows

Frequency Distribution

df9 <- df8 %>% 
  summarize(
    R = max(PlantHeight) - min(PlantHeight)
  , k = floor(1 + 3.3*log10(length(PlantHeight)))
  , h = R/k
    )

df8Freq <- df8 %>% 
  mutate(
          Classes = cut(
                         x              = PlantHeight
                       , breaks         = df9$k
                       , include.lowest = TRUE
                       , right          = FALSE
                       )
          ) %>%
  count(Classes) %>% 
  tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
  rename(f = n) %>%
  mutate(
    LB = readr::parse_number(x = LB)
  , UB = readr::parse_number(x = UB)
  , rf = f/sum(f)
  , pf = f/sum(f)*100
  , cf = cumsum(f)
  , MidPoint = (LB + UB)/2
  ) 
  
df8Freq
# A tibble: 5 x 8
  Classes      LB    UB     f    rf    pf    cf MidPoint
  <fct>     <dbl> <dbl> <int> <dbl> <dbl> <int>    <dbl>
1 [87,92)      87    92     8 0.267  26.7     8     89.5
2 [92,97)      92    97     4 0.133  13.3    12     94.5
3 [97,102)     97   102    10 0.333  33.3    22     99.5
4 [102,107)   102   107     5 0.167  16.7    27    104. 
5 [107,112]   107   112     3 0.1    10      30    110. 

Histogram

ggplot(
       data = df8
     , mapping = aes(x = PlantHeight)) + 
  geom_histogram() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Histogram for Plant Height", x = "Plant Height", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = f))  + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Frequency for Plant Height", x = "Mid Point", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df8Freq, mapping = aes(x = MidPoint, y = cf)) + 
  geom_point()+ 
  geom_line() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

Stem and Leaf Plot

stem(x = df8$PlantHeight, scale = 1, width = 80, atom = 1e-08)

  The decimal point is 1 digit(s) to the right of the |

   8 | 77899
   9 | 0112
   9 | 566788899
  10 | 001123
  10 | 55677
  11 | 2

Box Plot

ggplot(data = df8 , aes( y = PlantHeight)) + 
  geom_boxplot()+
  theme_bw()

 

Example

The golub table contains gene expression values from 3051 genes taken from 38 Leukemia patients. Twenty seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and eleven as acute myeloid leukemia (AML). The golub.gnames table contains information on the gene, including gene index, manufacturing ID, and biological name. Following table presents the gene expression value by their tumor type.

tumortype genevalue tumortype genevalue
ALL 2.10892 ALL 1.78352
ALL 1.52405 ALL 0.45827
ALL 1.96403 ALL 2.18119
ALL 2.33597 ALL 2.31428
ALL 1.85111 ALL 1.99927
ALL 1.99391 ALL 1.36844
ALL 2.06597 ALL 2.37351
ALL 1.81649 ALL 1.83485
ALL 2.17622 AML 0.88941
ALL 1.80861 AML 1.45014
ALL 2.44562 AML 0.42904
ALL 1.90496 AML 0.82667
ALL 2.76610 AML 0.63637
ALL 1.32551 AML 1.02250
ALL 2.59385 AML 0.12758
ALL 1.92776 AML -0.74333
ALL 1.10546 AML 0.73784
ALL 1.27645 AML 0.49470
ALL 1.83051 AML 1.12058
df10 <- tibble(
   genevalue = c(
              2.10892, 1.52405, 1.96403, 2.33597, 1.85111, 1.99391
            , 2.06597, 1.81649, 2.17622, 1.80861, 2.44562, 1.90496
            , 2.76610, 1.32551, 2.59385, 1.92776, 1.10546, 1.27645
            , 1.83051, 1.78352, 0.45827, 2.18119, 2.31428, 1.99927
            , 1.36844, 2.37351, 1.83485, 0.88941, 1.45014, 0.42904
            , 0.82667, 0.63637, 1.02250, 0.12758, -0.74333, 0.73784
            , 0.49470, 1.12058
            )
  , tumortype = rep(c("ALL","AML"), c(27, 11))
)

df10
# A tibble: 38 x 2
   genevalue tumortype
       <dbl> <chr>    
 1      2.11 ALL      
 2      1.52 ALL      
 3      1.96 ALL      
 4      2.34 ALL      
 5      1.85 ALL      
 6      1.99 ALL      
 7      2.07 ALL      
 8      1.82 ALL      
 9      2.18 ALL      
10      1.81 ALL      
# ... with 28 more rows

Frequency Distribution

df11 <- df10 %>% 
  summarize(
    R = max(genevalue) - min(genevalue)
  , k = floor(1 + 3.3*log10(length(genevalue)))
  , h = R/k
    )

df10Freq <- df10 %>% 
  mutate(
          Classes = cut(
                         x              = genevalue
                       , breaks         = df11$k
                       , include.lowest = TRUE
                       , right          = FALSE
                       )
          ) %>%
  count(Classes) %>% 
  tidyr::separate(col = Classes, into = c("LB", "UB"), sep = ",", remove = FALSE) %>%
  rename(f = n) %>%
  mutate(
    LB = readr::parse_number(x = LB)
  , UB = readr::parse_number(x = UB)
  , rf = f/sum(f)
  , pf = f/sum(f)*100
  , cf = cumsum(f)
  , MidPoint = (LB + UB)/2
  ) 
  
df10Freq
# A tibble: 6 x 8
  Classes             LB     UB     f     rf    pf    cf MidPoint
  <fct>            <dbl>  <dbl> <int>  <dbl> <dbl> <int>    <dbl>
1 [-0.747,-0.158) -0.747 -0.158     1 0.0263  2.63     1   -0.452
2 [-0.158,0.426)  -0.158  0.426     1 0.0263  2.63     2    0.134
3 [0.426,1.01)     0.426  1.01      7 0.184  18.4      9    0.718
4 [1.01,1.6)       1.01   1.6       8 0.211  21.1     17    1.31 
5 [1.6,2.18)       1.6    2.18     15 0.395  39.5     32    1.89 
6 [2.18,2.77]      2.18   2.77      6 0.158  15.8     38    2.48 

Histogram

ggplot(
       data = df10
     , mapping = aes(x = genevalue)) + 
  geom_histogram() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Histogram for Gene Value", x = "Gene Value", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = f))  + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Frequency for Gene Value", x = "Mid Point", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = df10Freq, mapping = aes(x = MidPoint, y = cf)) + 
  geom_point()+ 
  geom_line() +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw()+
    labs(title = "Cummulative Frequency Polygon", x = "Mid Point", y = "Cummulative Frequency") + 
  theme(plot.title = element_text(hjust = 0.5))

Stem and Leaf Plot

stem(x = df10$genevalue, scale = 1, width = 80, atom = 1e-08)

  The decimal point is at the |

  -0 | 7
  -0 | 
   0 | 14
   0 | 556789
   1 | 011334
   1 | 5588888999
   2 | 00011223344
   2 | 68

Box Plot

ggplot(data = df10 , aes(x = tumortype, y = genevalue)) + 
  geom_boxplot()+
  theme_bw()

 

Measures of Centeral Tendency

df7 %>%
  summarize(
            n       = length(Notebook)
          , Mean    = mean(Notebook)
          , Median  = median(Notebook)
          , Minimum = min(Notebook) 
          , Maximum = max(Notebook) 
          , Q1      = quantile(x = Notebook, probs = 0.25)
          , Q2      = quantile(x = Notebook, probs = 0.50)
          , Q3      = quantile(x = Notebook, probs = 0.75)
  )
# A tibble: 1 x 8
      n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    20   2.6      2       0       5  1.75     2     4
df8 %>%
  summarize(
            n       = length(PlantHeight)
          , Mean    = mean(PlantHeight)
          , Median  = median(PlantHeight)
          , Minimum = min(PlantHeight) 
          , Maximum = max(PlantHeight) 
          , Q1      = quantile(x = PlantHeight, probs = 0.25)
          , Q2      = quantile(x = PlantHeight, probs = 0.50)
          , Q3      = quantile(x = PlantHeight, probs = 0.75)
  )
# A tibble: 1 x 8
      n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1    30  97.6     98      87     112  91.2    98  102.
df10 %>%
  group_by(tumortype) %>%
  summarize(
            n       = length(genevalue)
          , Mean    = mean(genevalue)
          , Median  = median(genevalue)
          , Minimum = min(genevalue) 
          , Maximum = max(genevalue) 
          , Q1      = quantile(x = genevalue, probs = 0.25)
          , Q2      = quantile(x = genevalue, probs = 0.50)
          , Q3      = quantile(x = genevalue, probs = 0.75)
  )
# A tibble: 2 x 9
  tumortype     n  Mean Median Minimum Maximum    Q1    Q2    Q3
  <chr>     <int> <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>
1 ALL          27 1.89   1.93    0.458    2.77 1.80  1.93  2.18 
2 AML          11 0.636  0.738  -0.743    1.45 0.462 0.738 0.956

 

Measures of Dispersion

df7 %>%
  summarize(
            IQR      = IQR(Notebook)
          , Variance = var(Notebook)
          ,  SD      = sd(Notebook)
          )
# A tibble: 1 x 3
    IQR Variance    SD
  <dbl>    <dbl> <dbl>
1  2.25     2.25  1.50
df8 %>%
  summarize(
            IQR      = IQR(PlantHeight)
          , Variance = var(PlantHeight)
          ,  SD      = sd(PlantHeight)
          )
# A tibble: 1 x 3
    IQR Variance    SD
  <dbl>    <dbl> <dbl>
1  10.5     45.0  6.71
df10 %>%
  group_by(tumortype) %>%
  summarize(
            IQR      = IQR(genevalue)
          , Variance = var(genevalue)
          ,  SD      = sd(genevalue)
          )
# A tibble: 2 x 4
  tumortype   IQR Variance    SD
  <chr>     <dbl>    <dbl> <dbl>
1 ALL       0.383    0.241 0.491
2 AML       0.494    0.338 0.582

 

Measures of Skewness

df7 %>% 
  summarize(
           SK = sum((Notebook - mean(Notebook))^3)/(n()*(sd(Notebook))^3)
          )
# A tibble: 1 x 1
     SK
  <dbl>
1 0.217
df8 %>% 
  summarize(
           SK = sum((PlantHeight - mean(PlantHeight))^3)/(n()*(sd(PlantHeight))^3)
          )
# A tibble: 1 x 1
      SK
   <dbl>
1 0.0574
df10 %>% 
  group_by(tumortype) %>%
  summarize(
           SK = sum((genevalue - mean(genevalue))^3)/(n()*(sd(genevalue))^3)
          )
# A tibble: 2 x 2
  tumortype     SK
  <chr>      <dbl>
1 ALL       -0.802
2 AML       -0.937

 

Measures of Skewness

df7 %>%
  summarize(
            K = sum((Notebook - mean(Notebook))^4)/(n()*(sd(Notebook))^4) - 3
          )
# A tibble: 1 x 1
      K
  <dbl>
1 -1.19
df8 %>%
  summarize(
            K = sum((PlantHeight - mean(PlantHeight))^4)/(n()*(sd(PlantHeight))^4) - 3
          )
# A tibble: 1 x 1
       K
   <dbl>
1 -0.960
df10 %>% 
  group_by(tumortype) %>%
  summarize(
            K = sum((genevalue - mean(genevalue))^4)/(n()*(sd(genevalue))^4) - 3
          )
# A tibble: 2 x 2
  tumortype     K
  <chr>     <dbl>
1 ALL       0.855
2 AML       0.345
comments powered by Disqus