1 Building Empirical Models

Major Themes of Chapter 6

Empirical Models: Building mathematical relationships from data
Simple Linear Regression: Modeling relationships between two variables
Multiple Regression: Extending to multiple predictor variables
Model Adequacy: Checking assumptions and validating models
Advanced Topics: Polynomial models, categorical variables, and variable selection

Learning Objectives

After careful study of this chapter, you should be able to do the following:

Understand the fundamental concepts of empirical modeling and regression.
Apply least squares estimation to fit simple linear regression models.
Conduct hypothesis tests and construct confidence intervals for regression parameters.
Make predictions with appropriate uncertainty quantification.
Check model adequacy using residual analysis and diagnostic plots.
Understand the relationship between correlation and regression.
Fit and interpret multiple regression models.
Work with polynomial models and categorical variables.
Apply variable selection techniques for model building.

1.1 Introduction to Empirical Models

What are Empirical Models?

Empirical models are mathematical relationships derived from observed data rather than theoretical principles. They are essential in engineering for:

Process Optimization: Understanding how input variables affect outputs
Quality Control: Relating product characteristics to process conditions
Design: Predicting performance based on design parameters
Troubleshooting: Identifying factors that influence system behavior

Types of Empirical Models:

Linear Models: y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \epsilon
Polynomial Models: y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \epsilon
Interaction Models: y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \epsilon

Key Assumptions:

Linear relationship between predictors and response
Independent observations
Constant variance (homoscedasticity)
Normally distributed errors

1.2 Simple Linear Regression

1.2.1 Least Squares Estimation

Simple Linear Regression Model

The simple linear regression model relates a response variable y to a single predictor variable x:

y_i = \beta_0 + \beta_1 x_i + \epsilon_i

where:

y_i = response variable for observation i
x_i = predictor variable for observation i
\beta_0 = intercept parameter
\beta_1 = slope parameter
\epsilon_i = random error term, \epsilon_i \sim N(0, \sigma^2)

Least Squares Estimators:

\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

where:

S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})
S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2
S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2

1.2.2 Example: Tensile Strength vs. Hardness

Example

A materials engineer wants to relate the tensile strength of steel specimens to their hardness. Data from 20 specimens:

[1] "Steel Strength vs Hardness Data Summary:"

       n Hardness_Mean Hardness_SD Hardness_Min Hardness_Max Strength_Mean
   <int>         <num>       <num>        <num>        <num>         <num>
1:    20      62.03234    12.53884     41.68238     78.27333      169.2171
   Strength_SD Strength_Min Strength_Max
         <num>        <num>        <num>
1:     29.0103     120.2922     220.4809

[1] "First 10 observations:"

    specimen hardness strength
       <int>    <num>    <num>
 1:        1 51.50310 153.5504
 2:        2 71.53221 196.7090
 3:        3 56.35908 159.1039
 4:        4 75.32070 204.1872
 5:        5 77.61869 204.6000
 6:        6 41.82226 133.8510
 7:        7 61.12422 171.7934
 8:        8 75.69676 188.5090
 9:        9 62.05740 175.7543
10:       10 58.26459 156.8791

library(collapse)
library(data.table)
library(ggplot2)
library(scales)
library(kableExtra)
library(gridExtra)
library(tidyr)
library(corrplot)

# Steel strength vs hardness data (simple linear regression)
set.seed(123)
steel_data <- data.table(
  specimen = 1:20,
  hardness = runif(20, min = 40, max = 80),
  strength = NULL
)

# Create realistic relationship with some noise
steel_data[, strength := 15 + 2.5 * hardness + rnorm(20, mean = 0, sd = 8)]

# Summary statistics
steel_summary <- steel_data %>%
  fsummarise(
    n = fnobs(hardness),
    Hardness_Mean = fmean(hardness),
    Hardness_SD = fsd(hardness),
    Hardness_Min = fmin(hardness),
    Hardness_Max = fmax(hardness),
    Strength_Mean = fmean(strength),
    Strength_SD = fsd(strength),
    Strength_Min = fmin(strength),
    Strength_Max = fmax(strength)
  )

print("Steel Strength vs Hardness Data Summary:")
print(steel_summary)

# Detailed data view
print("First 10 observations:")
print(head(steel_data, 10))

# Create data directory if needed
if (!dir.exists("data")) dir.create("data")

# Write data to CSV
fwrite(steel_data, "data/steel_strength.csv")

Manual Calculation:

Given: n = 20, \bar{x} = hardness mean, \bar{y} = strength mean

Calculate Sums:
- S_{xx} = \sum(x_i - \bar{x})^2
- S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})
- S_{yy} = \sum(y_i - \bar{y})^2
Estimate Parameters: \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
Fitted Model: \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x

R Output
R Code

[1] "Manual Simple Linear Regression Results:"

        Parameter Manual_Estimate                               Interpretation
           <char>           <num>                                       <char>
1: β₀ (Intercept)         30.4493            Strength when hardness = 0: 30.45
2:     β₁ (Slope)          2.2370 Strength increases by 2.24 per unit hardness
3:             σ²         57.8584                        Error variance: 57.86
4:             R²          0.9349                93.5 % of variation explained

[1] "R's lm() verification:"


Call:
lm(formula = strength ~ hardness, data = steel_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2758  -5.0561  -0.6948   5.4929  15.1408 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.4493     8.7991   3.461  0.00279 ** 
hardness      2.2370     0.1392  16.074 4.03e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.606 on 18 degrees of freedom
Multiple R-squared:  0.9349,    Adjusted R-squared:  0.9313 
F-statistic: 258.4 on 1 and 18 DF,  p-value: 4.031e-12

[1] "Manual vs R Function Comparison:"

     Parameter    Manual R_Function   Difference
        <char>     <num>      <num>        <num>
1:   Intercept 30.449343  30.449343 7.815970e-14
2:       Slope  2.237023   2.237023 8.881784e-16
3:   R-squared  0.934870   0.934870 0.000000e+00
4: Residual SE  7.606471   7.606471 2.664535e-15

# Simple linear regression - manual calculations
n <- fnobs(steel_data$hardness)
x_bar <- fmean(steel_data$hardness)
y_bar <- fmean(steel_data$strength)

# Calculate sums of squares and cross-products
x <- steel_data$hardness
y <- steel_data$strength

# Manual calculation of sums
Sxx <- sum((x - x_bar)^2)
Syy <- sum((y - y_bar)^2)
Sxy <- sum((x - x_bar) * (y - y_bar))

# Parameter estimates
beta1_hat <- Sxy / Sxx
beta0_hat <- y_bar - beta1_hat * x_bar

# Sum of squares
SSR <- beta1_hat * Sxy # Regression sum of squares
SSE <- Syy - beta1_hat * Sxy # Error sum of squares
SST <- Syy # Total sum of squares

# Mean squares
MSR <- SSR / 1 # df = 1 for simple regression
MSE <- SSE / (n - 2) # df = n - 2
s_squared <- MSE

# R-squared
R_squared <- SSR / SST

# Manual calculations summary
manual_results <- data.table(
  Parameter = c("β₀ (Intercept)", "β₁ (Slope)", "σ²", "R²"),
  Manual_Estimate = c(
    round(beta0_hat, 4),
    round(beta1_hat, 4),
    round(s_squared, 4),
    round(R_squared, 4)
  ),
  Interpretation = c(
    paste("Strength when hardness = 0:", round(beta0_hat, 2)),
    paste("Strength increases by", round(beta1_hat, 2), "per unit hardness"),
    paste("Error variance:", round(s_squared, 2)),
    paste(round(R_squared * 100, 1), "% of variation explained")
  )
)

print("Manual Simple Linear Regression Results:")
print(manual_results)

# Verification with R's lm function
lm_model <- lm(strength ~ hardness, data = steel_data)
lm_summary <- summary(lm_model)

print("R's lm() verification:")
print(lm_summary)

# Compare manual vs R calculations
comparison <- data.table(
  Parameter = c("Intercept", "Slope", "R-squared", "Residual SE"),
  Manual = c(beta0_hat, beta1_hat, R_squared, sqrt(MSE)),
  R_Function = c(
    coef(lm_model)[1], coef(lm_model)[2],
    summary(lm_model)$r.squared, summary(lm_model)$sigma
  ),
  Difference = c(
    abs(beta0_hat - coef(lm_model)[1]),
    abs(beta1_hat - coef(lm_model)[2]),
    abs(R_squared - summary(lm_model)$r.squared),
    abs(sqrt(MSE) - summary(lm_model)$sigma)
  )
)

print("Manual vs R Function Comparison:")
print(comparison)

R Output
R Code

Figure 1: Simple linear regression: Tensile strength vs. hardness

# Simple linear regression visualization
p1 <- ggplot(data = steel_data, mapping = aes(x = hardness, y = strength)) +
  geom_point(size = 3, alpha = 0.7, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(
    x = "Hardness",
    y = "Tensile Strength",
    title = "Simple Linear Regression: Tensile Strength vs Hardness",
    subtitle = paste("ŷ =", round(beta0_hat, 2), "+", round(beta1_hat, 2), "x, R² =", round(R_squared, 3))
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# Residuals vs fitted
fitted_values <- beta0_hat + beta1_hat * steel_data$hardness
residuals <- steel_data$strength - fitted_values

residual_data <- data.table(
  fitted = fitted_values,
  residuals = residuals,
  hardness = steel_data$hardness
)

p2 <- ggplot(data = residual_data, mapping = aes(x = fitted, y = residuals)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(
    x = "Fitted Values",
    y = "Residuals",
    title = "Residuals vs Fitted Values"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine plots
grid.arrange(p1, p2, nrow = 1, ncol = 2)

1.2.3 Testing Hypotheses in Simple Linear Regression

Hypothesis Tests for Regression Parameters

Test for Slope (Significance of Regression):

H₀: β₁ = 0 vs H₁: β₁ ≠ 0

Test Statistic:

t_0 = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\sqrt{MS_E/S_{xx}}}

where MS_E = \frac{SS_E}{n-2} and SS_E = S_{yy} - \hat{\beta}_1 S_{xy}

Test for Intercept:

H₀: β₀ = 0 vs H₁: β₀ ≠ 0

Test Statistic:

t_0 = \frac{\hat{\beta}_0}{SE(\hat{\beta}_0)} = \frac{\hat{\beta}_0}{\sqrt{MS_E(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}})}}

ANOVA Table for Regression:

Source	SS	df	MS	F
Regression	SS_R = \hat{\beta}_1 S_{xy}	1	MS_R	MS_R/MS_E
Error	SS_E = S_{yy} - \hat{\beta}_1 S_{xy}	n-2	MS_E
Total	SS_T = S_{yy}	n-1

[1] "Hypothesis Test Results:"

                   Test Null_Hypothesis Test_Statistic P_Value Critical_Value
                 <char>          <char>          <num>   <num>          <num>
1:         Slope β₁ = 0          β₁ = 0        16.0739  0.0000          2.101
2:     Intercept β₀ = 0          β₀ = 0         3.4605  0.0028          2.101
3: Overall Model F-test          β₁ = 0       258.3704  0.0000          4.414
    Decision                               Conclusion
      <char>                                   <char>
1: Reject H₀          Significant linear relationship
2: Reject H₀ Intercept significantly different from 0
3: Reject H₀                     Model is significant

[1] "ANOVA Table:"

       Source       SS    df       MS F_Statistic P_Value
       <char>    <num> <num>    <num>       <num>   <num>
1: Regression 14948.90     1 14948.90    258.3704       0
2:      Error  1041.45    18    57.86          NA      NA
3:      Total 15990.35    19       NA          NA      NA

# Hypothesis testing for regression parameters
alpha <- 0.05
df_error <- n - 2
t_critical <- qt(1 - alpha / 2, df_error)

# Standard errors
SE_beta1 <- sqrt(MSE / Sxx)
SE_beta0 <- sqrt(MSE * (1 / n + x_bar^2 / Sxx))

# Test for slope (β₁ = 0)
t_stat_beta1 <- beta1_hat / SE_beta1
p_value_beta1 <- 2 * (1 - pt(abs(t_stat_beta1), df_error))

# Test for intercept (β₀ = 0)
t_stat_beta0 <- beta0_hat / SE_beta0
p_value_beta0 <- 2 * (1 - pt(abs(t_stat_beta0), df_error))

# F-test for overall significance
F_stat <- MSR / MSE
p_value_F <- 1 - pf(F_stat, 1, df_error)

# Hypothesis test results
hypothesis_tests <- data.table(
  Test = c("Slope β₁ = 0", "Intercept β₀ = 0", "Overall Model F-test"),
  Null_Hypothesis = c("β₁ = 0", "β₀ = 0", "β₁ = 0"),
  Test_Statistic = c(round(t_stat_beta1, 4), round(t_stat_beta0, 4), round(F_stat, 4)),
  P_Value = c(round(p_value_beta1, 4), round(p_value_beta0, 4), round(p_value_F, 4)),
  Critical_Value = c(
    round(t_critical, 3), round(t_critical, 3),
    round(qf(1 - alpha, 1, df_error), 3)
  ),
  Decision = c(
    ifelse(p_value_beta1 < alpha, "Reject H₀", "Fail to reject H₀"),
    ifelse(p_value_beta0 < alpha, "Reject H₀", "Fail to reject H₀"),
    ifelse(p_value_F < alpha, "Reject H₀", "Fail to reject H₀")
  ),
  Conclusion = c(
    ifelse(p_value_beta1 < alpha, "Significant linear relationship", "No significant relationship"),
    ifelse(p_value_beta0 < alpha, "Intercept significantly different from 0", "Intercept not significant"),
    ifelse(p_value_F < alpha, "Model is significant", "Model is not significant")
  )
)

print("Hypothesis Test Results:")
print(hypothesis_tests)

# ANOVA table
anova_table <- data.table(
  Source = c("Regression", "Error", "Total"),
  SS = c(round(SSR, 2), round(SSE, 2), round(SST, 2)),
  df = c(1, df_error, n - 1),
  MS = c(round(MSR, 2), round(MSE, 2), NA),
  F_Statistic = c(round(F_stat, 4), NA, NA),
  P_Value = c(round(p_value_F, 4), NA, NA)
)

print("ANOVA Table:")
print(anova_table)

1.2.5 Confidence Intervals in Simple Linear Regression

Confidence Intervals

For Slope β₁:

\hat{\beta}_1 \pm t_{\alpha/2, n-2} \sqrt{\frac{MS_E}{S_{xx}}}

For Intercept β₀:

\hat{\beta}_0 \pm t_{\alpha/2, n-2} \sqrt{MS_E\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)}

For Mean Response at x₀:

\hat{y}_0 \pm t_{\alpha/2, n-2} \sqrt{MS_E\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}

Coefficient of Determination:

R^2 = \frac{SS_R}{SS_T} = \frac{\hat{\beta}_1 S_{xy}}{S_{yy}} = 1 - \frac{SS_E}{SS_T}

R^2 represents the proportion of total variation explained by the regression.

1.2.6 Prediction of a Future Observation

Prediction Intervals

For predicting a single future observation at x₀:

\hat{y}_0 \pm t_{\alpha/2, n-2} \sqrt{MS_E\left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}

Key Differences:

Confidence Interval: For the mean response (narrower)
Prediction Interval: For individual observation (wider due to additional uncertainty)

The extra “1” in the prediction interval accounts for the variability of individual observations around the mean response.

1.2.7 Example: Confidence and Prediction Intervals

R Output
R Code

[1] "95% Confidence Intervals for Mean Response:"

   Hardness Predicted CI_Lower CI_Upper CI_Width
      <num>     <num>    <num>    <num>    <num>
1: 41.68238    123.69   116.75   130.63    13.88
2: 50.83012    144.16   139.31   149.00     9.69
3: 59.97786    164.62   161.00   168.24     7.25
4: 69.12560    185.08   180.95   189.22     8.26
5: 78.27333    205.55   199.61   211.49    11.89

[1] "95% Prediction Intervals for Individual Observations:"

   Hardness Predicted PI_Lower PI_Upper PI_Width
      <num>     <num>    <num>    <num>    <num>
1: 41.68238    123.69   106.27   141.12    34.85
2: 50.83012    144.16   127.46   160.86    33.40
3: 59.97786    164.62   148.23   181.01    32.77
4: 69.12560    185.08   168.58   201.59    33.01
5: 78.27333    205.55   188.50   222.60    34.10

[1] "Interval Width Comparison:"

   Hardness CI_Width PI_Width PI_vs_CI_Ratio
      <num>    <num>    <num>          <num>
1: 41.68238    13.88    34.85           2.51
2: 50.83012     9.69    33.40           3.45
3: 59.97786     7.25    32.77           4.52
4: 69.12560     8.26    33.01           4.00
5: 78.27333    11.89    34.10           2.87

# Confidence and prediction intervals
# Choose some specific x values for demonstration
x_new <- seq(fmin(steel_data$hardness), fmax(steel_data$hardness), length.out = 5)

# Confidence intervals for mean response
CI_results <- data.table()
PI_results <- data.table()

for (x0 in x_new) {
  # Predicted value
  y_hat <- beta0_hat + beta1_hat * x0

  # Confidence interval for mean response
  SE_mean <- sqrt(MSE * (1 / n + (x0 - x_bar)^2 / Sxx))
  margin_CI <- t_critical * SE_mean
  CI_lower <- y_hat - margin_CI
  CI_upper <- y_hat + margin_CI

  # Prediction interval for individual observation
  SE_pred <- sqrt(MSE * (1 + 1 / n + (x0 - x_bar)^2 / Sxx))
  margin_PI <- t_critical * SE_pred
  PI_lower <- y_hat - margin_PI
  PI_upper <- y_hat + margin_PI

  CI_results <- rbind(CI_results, data.table(
    Hardness = x0,
    Predicted = round(y_hat, 2),
    CI_Lower = round(CI_lower, 2),
    CI_Upper = round(CI_upper, 2),
    CI_Width = round(CI_upper - CI_lower, 2)
  ))

  PI_results <- rbind(PI_results, data.table(
    Hardness = x0,
    Predicted = round(y_hat, 2),
    PI_Lower = round(PI_lower, 2),
    PI_Upper = round(PI_upper, 2),
    PI_Width = round(PI_upper - PI_lower, 2)
  ))
}

print("95% Confidence Intervals for Mean Response:")
print(CI_results)

print("95% Prediction Intervals for Individual Observations:")
print(PI_results)

# Compare CI and PI widths
interval_comparison <- data.table(
  Hardness = x_new,
  CI_Width = CI_results$CI_Width,
  PI_Width = PI_results$PI_Width,
  PI_vs_CI_Ratio = round(PI_results$PI_Width / CI_results$CI_Width, 2)
)

print("Interval Width Comparison:")
print(interval_comparison)

R Output
R Code

Figure 2: Confidence and prediction intervals

# Confidence and prediction intervals visualization
# Create fine grid for smooth curves
x_grid <- seq(fmin(steel_data$hardness), fmax(steel_data$hardness), length.out = 100)
interval_data <- data.table()

for (x0 in x_grid) {
  y_hat <- beta0_hat + beta1_hat * x0

  # Confidence interval
  SE_mean <- sqrt(MSE * (1 / n + (x0 - x_bar)^2 / Sxx))
  CI_lower <- y_hat - t_critical * SE_mean
  CI_upper <- y_hat + t_critical * SE_mean

  # Prediction interval
  SE_pred <- sqrt(MSE * (1 + 1 / n + (x0 - x_bar)^2 / Sxx))
  PI_lower <- y_hat - t_critical * SE_pred
  PI_upper <- y_hat + t_critical * SE_pred

  interval_data <- rbind(interval_data, data.table(
    hardness = x0,
    fitted = y_hat,
    CI_lower = CI_lower,
    CI_upper = CI_upper,
    PI_lower = PI_lower,
    PI_upper = PI_upper
  ))
}

ggplot() +
  # Prediction intervals (wider, lighter)
  geom_ribbon(
    data = interval_data, aes(x = hardness, ymin = PI_lower, ymax = PI_upper),
    alpha = 0.2, fill = "red"
  ) +
  # Confidence intervals (narrower, darker)
  geom_ribbon(
    data = interval_data, aes(x = hardness, ymin = CI_lower, ymax = CI_upper),
    alpha = 0.3, fill = "blue"
  ) +
  # Regression line
  geom_line(data = interval_data, aes(x = hardness, y = fitted), color = "black", linewidth = 1) +
  # Data points
  geom_point(data = steel_data, aes(x = hardness, y = strength), size = 3, alpha = 0.7) +
  labs(
    x = "Hardness",
    y = "Tensile Strength",
    title = "Confidence and Prediction Intervals",
    subtitle = "Blue = 95% CI for mean response, Red = 95% PI for individual observations"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

1.2.8 Checking Model Adequacy

Residual Analysis

Residuals: e_i = y_i - \hat{y}_i

Standardized Residuals: d_i = \frac{e_i}{\sqrt{MS_E}}

Studentized Residuals: More accurate for outlier detection

Key Diagnostic Plots:

Residuals vs. Fitted Values: Check for constant variance
Normal Q-Q Plot: Check normality assumption
Residuals vs. Order: Check for independence
Residuals vs. Predictor: Check linearity assumption

What to Look For:

Random scatter (no patterns)
Constant spread across fitted values
Points following straight line in Q-Q plot
No obvious trends or cycles

1.2.9 Example: Model Diagnostics

R Output
R Code

[1] "Residual Analysis Summary:"

   Mean_Residuals SD_Residuals Min_Residuals Max_Residuals Mean_Std_Residuals
            <num>        <num>         <num>         <num>              <num>
1:   7.105427e-15     7.403595     -11.27576      15.14077       9.325873e-16
   SD_Std_Residuals
              <num>
1:        0.9733285

[1] "No obvious outliers detected (|std residual| > 2)"

[1] "Shapiro-Wilk normality test for residuals:"


    Shapiro-Wilk normality test

data:  residuals
W = 0.96347, p-value = 0.6152

[1] "Durbin-Watson statistic: 1.7519"

[1] "(Values near 2 indicate no autocorrelation)"

# Model adequacy checking - residual analysis
# Calculate residuals and standardized residuals
residuals_data <- data.table(
  fitted = fitted_values,
  residuals = residuals,
  standardized_residuals = residuals / sqrt(MSE),
  observation = 1:n,
  hardness = steel_data$hardness
)

# Summary statistics for residuals
residual_summary <- residuals_data %>%
  fsummarise(
    Mean_Residuals = fmean(residuals),
    SD_Residuals = fsd(residuals),
    Min_Residuals = fmin(residuals),
    Max_Residuals = fmax(residuals),
    Mean_Std_Residuals = fmean(standardized_residuals),
    SD_Std_Residuals = fsd(standardized_residuals)
  )

print("Residual Analysis Summary:")
print(residual_summary)

# Check for outliers (|standardized residual| > 2)
outliers <- residuals_data[abs(standardized_residuals) > 2]
if (nrow(outliers) > 0) {
  print("Potential outliers (|std residual| > 2):")
  print(outliers)
} else {
  print("No obvious outliers detected (|std residual| > 2)")
}

# Normality test for residuals
shapiro_test <- shapiro.test(residuals)
print("Shapiro-Wilk normality test for residuals:")
print(shapiro_test)

# Durbin-Watson test for independence (if data has natural order)
dw_stat <- sum(diff(residuals)^2) / sum(residuals^2)
print(paste("Durbin-Watson statistic:", round(dw_stat, 4)))
print("(Values near 2 indicate no autocorrelation)")

R Output
R Code

# Residual analysis plots
# 1. Residuals vs Fitted
p1 <- ggplot(data = residuals_data, aes(x = fitted, y = residuals)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 2. Normal Q-Q plot
p2 <- ggplot(data = residuals_data, aes(sample = standardized_residuals)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(
    x = "Theoretical Quantiles", y = "Sample Quantiles",
    title = "Normal Q-Q Plot of Residuals"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 3. Scale-location plot
p3 <- ggplot(data = residuals_data, aes(x = fitted, y = sqrt(abs(standardized_residuals)))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    x = "Fitted Values", y = "√|Standardized Residuals|",
    title = "Scale-Location Plot"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 4. Residuals vs Order
p4 <- ggplot(data = residuals_data, aes(x = observation, y = residuals)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_line(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Observation Order", y = "Residuals", title = "Residuals vs Order") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine all diagnostic plots
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

1.2.10 Correlation and Regression

Correlation vs. Regression

Correlation Coefficient:

r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}

Relationship to Regression:

R^2 = r^2 in simple linear regression
Correlation measures strength of linear relationship
Regression provides prediction equation

Important Notes:

Correlation ≠ Causation
Both assume linear relationship
Sensitive to outliers
-1 \leq r \leq 1

Interpretation of |r|:

0.0-0.3: Weak relationship
0.3-0.7: Moderate relationship
0.7-1.0: Strong relationship

1.2.11 Example: Correlation Analysis

R Output
R Code

[1] "Correlation vs Regression Analysis:"

                       Measure  Value                          Interpretation
                        <char>  <num>                                  <char>
1: Correlation coefficient (r) 0.9669              Strong linear relationship
2:                          r² 0.9349                  93.5 % shared variance
3:          R² from regression 0.9349 93.5 % variance explained by regression
4:        Difference |r² - R²| 0.0000                       Perfect agreement

[1] "Correlation Significance Test:"

                       Test                             H0 t_statistic p_value
                     <char>                         <char>       <num>   <num>
1: Correlation significance ρ = 0 (no linear relationship)     16.0739       0
    decision              conclusion
      <char>                  <char>
1: Reject H₀ Significant correlation

# Correlation analysis
correlation_coef <- cor(steel_data$hardness, steel_data$strength)
correlation_squared <- correlation_coef^2

# Verify relationship between correlation and R-squared
correlation_results <- data.table(
  Measure = c("Correlation coefficient (r)", "r²", "R² from regression", "Difference |r² - R²|"),
  Value = c(
    round(correlation_coef, 4),
    round(correlation_squared, 4),
    round(R_squared, 4),
    round(abs(correlation_squared - R_squared), 6)
  ),
  Interpretation = c(
    ifelse(abs(correlation_coef) > 0.7, "Strong linear relationship",
      ifelse(abs(correlation_coef) > 0.3, "Moderate linear relationship", "Weak linear relationship")
    ),
    paste(round(correlation_squared * 100, 1), "% shared variance"),
    paste(round(R_squared * 100, 1), "% variance explained by regression"),
    ifelse(abs(correlation_squared - R_squared) < 0.001, "Perfect agreement", "Some difference")
  )
)

print("Correlation vs Regression Analysis:")
print(correlation_results)

# Test significance of correlation
t_stat_cor <- correlation_coef * sqrt((n - 2) / (1 - correlation_coef^2))
p_value_cor <- 2 * (1 - pt(abs(t_stat_cor), n - 2))

correlation_test <- data.table(
  Test = "Correlation significance",
  H0 = "ρ = 0 (no linear relationship)",
  t_statistic = round(t_stat_cor, 4),
  p_value = round(p_value_cor, 4),
  decision = ifelse(p_value_cor < alpha, "Reject H₀", "Fail to reject H₀"),
  conclusion = ifelse(p_value_cor < alpha, "Significant correlation", "No significant correlation")
)

print("Correlation Significance Test:")
print(correlation_test)

1.3 Multiple Regression

1.3.1 Estimation of Parameters in Multiple Regression

Multiple Linear Regression Model

The multiple regression model with k predictors:

y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} + \epsilon_i

Matrix Form:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

where:

\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix}

Least Squares Solution:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Properties:

Unbiased: E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}
Minimum variance among linear unbiased estimators
Normally distributed under normality assumption

1.3.2 Example: Chemical Process Yield

Example

A chemical engineer studies how reaction temperature and catalyst concentration affect yield:

[1] "Chemical Process Data Summary:"

       n Temp_Mean  Temp_SD Conc_Mean  Conc_SD Yield_Mean Yield_SD
   <int>     <num>    <num>     <num>    <num>      <num>    <num>
1:    25  175.6184 13.62157  21.08764 6.724034   192.3572 14.55564

[1] "Correlation Matrix:"

              temperature concentration  yield
temperature        1.0000       -0.0674 0.7262
concentration     -0.0674        1.0000 0.5652
yield              0.7262        0.5652 1.0000

# Multiple regression example - Chemical process yield
set.seed(456)
chemical_data <- data.table(
  run = 1:25,
  temperature = runif(25, min = 150, max = 200),
  concentration = runif(25, min = 10, max = 30),
  yield = NULL
)

# Create realistic multiple regression relationship
chemical_data[, yield := 20 + 0.8 * temperature + 1.5 * concentration + rnorm(25, mean = 0, sd = 5)]

# Summary statistics
chemical_summary <- chemical_data %>%
  fsummarise(
    n = fnobs(run),
    Temp_Mean = fmean(temperature),
    Temp_SD = fsd(temperature),
    Conc_Mean = fmean(concentration),
    Conc_SD = fsd(concentration),
    Yield_Mean = fmean(yield),
    Yield_SD = fsd(yield)
  )

print("Chemical Process Data Summary:")
print(chemical_summary)

# Correlation matrix
cor_matrix <- cor(chemical_data[, .(temperature, concentration, yield)])
print("Correlation Matrix:")
print(round(cor_matrix, 4))

# Write data to CSV
fwrite(chemical_data, "data/chemical_yield.csv")

Multiple Regression Model:

Yield = \beta_0 + \beta_1 \times Temperature + \beta_2 \times Concentration + \epsilon

Manual Matrix Calculation:

Design Matrix X: \mathbf{X} = \begin{bmatrix} 1 & x_{1,temp} & x_{1,conc} \\ 1 & x_{2,temp} & x_{2,conc} \\ \vdots & \vdots & \vdots \\ 1 & x_{n,temp} & x_{n,conc} \end{bmatrix}
Parameter Estimates: \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

R Output
R Code

[1] "Manual Multiple Regression Results:"

            Parameter Estimate Std_Error t_value p_value
               <char>    <num>     <num>   <num>   <num>
1:     β₀ (Intercept)  20.1051   12.8589  1.5635  0.1322
2:   β₁ (Temperature)   0.8205    0.0699 11.7400  0.0000
3: β₂ (Concentration)   1.3355    0.1416  9.4333  0.0000

[1] "Model Summary Statistics:"

     Statistic    Value
        <char>    <num>
1:          R²   0.9063
2: Adjusted R²   0.8978
3: Residual SE   4.6530
4: F-statistic 106.4279

[1] "R's lm() verification:"


Call:
lm(formula = yield ~ temperature + concentration, data = chemical_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6632 -1.8041 -0.8522  1.3582 10.2857 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.10507   12.85893   1.564    0.132    
temperature    0.82047    0.06989  11.740 6.05e-11 ***
concentration  1.33553    0.14158   9.433 3.45e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.653 on 22 degrees of freedom
Multiple R-squared:  0.9063,    Adjusted R-squared:  0.8978 
F-statistic: 106.4 on 2 and 22 DF,  p-value: 4.873e-12

# Multiple regression - manual matrix calculations
# Design matrix X
n_multi <- nrow(chemical_data)
X <- cbind(1, chemical_data$temperature, chemical_data$concentration)
y <- chemical_data$yield

# Manual matrix calculations
XtX <- t(X) %*% X
Xty <- t(X) %*% y
beta_hat <- solve(XtX) %*% Xty

# Extract coefficients
beta0_multi <- beta_hat[1, 1]
beta1_multi <- beta_hat[2, 1]
beta2_multi <- beta_hat[3, 1]

# Calculate fitted values and residuals
y_fitted <- X %*% beta_hat
residuals_multi <- y - y_fitted

# Sum of squares
SST_multi <- sum((y - mean(y))^2)
SSE_multi <- sum(residuals_multi^2)
SSR_multi <- SST_multi - SSE_multi

# Degrees of freedom
k <- 2 # number of predictors
df_regression <- k
df_error_multi <- n_multi - k - 1
df_total <- n_multi - 1

# Mean squares
MSR_multi <- SSR_multi / df_regression
MSE_multi <- SSE_multi / df_error_multi

# R-squared and adjusted R-squared
R_squared_multi <- SSR_multi / SST_multi
R_squared_adj <- 1 - (SSE_multi / df_error_multi) / (SST_multi / df_total)

# Standard errors of coefficients
var_covar_matrix <- MSE_multi * solve(XtX)
SE_beta0_multi <- sqrt(var_covar_matrix[1, 1])
SE_beta1_multi <- sqrt(var_covar_matrix[2, 2])
SE_beta2_multi <- sqrt(var_covar_matrix[3, 3])

# Manual multiple regression results
multiple_manual_results <- data.table(
  Parameter = c("β₀ (Intercept)", "β₁ (Temperature)", "β₂ (Concentration)"),
  Estimate = c(round(beta0_multi, 4), round(beta1_multi, 4), round(beta2_multi, 4)),
  Std_Error = c(round(SE_beta0_multi, 4), round(SE_beta1_multi, 4), round(SE_beta2_multi, 4)),
  t_value = c(
    round(beta0_multi / SE_beta0_multi, 4),
    round(beta1_multi / SE_beta1_multi, 4),
    round(beta2_multi / SE_beta2_multi, 4)
  ),
  p_value = c(
    round(2 * (1 - pt(abs(beta0_multi / SE_beta0_multi), df_error_multi)), 4),
    round(2 * (1 - pt(abs(beta1_multi / SE_beta1_multi), df_error_multi)), 4),
    round(2 * (1 - pt(abs(beta2_multi / SE_beta2_multi), df_error_multi)), 4)
  )
)

print("Manual Multiple Regression Results:")
print(multiple_manual_results)

# Model summary statistics
model_stats <- data.table(
  Statistic = c("R²", "Adjusted R²", "Residual SE", "F-statistic"),
  Value = c(
    round(R_squared_multi, 4),
    round(R_squared_adj, 4),
    round(sqrt(MSE_multi), 4),
    round(MSR_multi / MSE_multi, 4)
  )
)

print("Model Summary Statistics:")
print(model_stats)

# Verify with R's lm function
lm_multi <- lm(yield ~ temperature + concentration, data = chemical_data)
print("R's lm() verification:")
print(summary(lm_multi))

R Output
R Code

Figure 4: Multiple regression visualization

# Multiple regression visualization
# Create 3D-like visualization using different perspectives

# Partial regression plots
p1 <- ggplot(data = chemical_data, aes(x = temperature, y = yield)) +
  geom_point(size = 3, alpha = 0.7, color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(x = "Temperature", y = "Yield", title = "Yield vs Temperature") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

p2 <- ggplot(data = chemical_data, aes(x = concentration, y = yield)) +
  geom_point(size = 3, alpha = 0.7, color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(x = "Concentration", y = "Yield", title = "Yield vs Concentration") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Fitted vs actual
fitted_multi <- as.vector(y_fitted)
p3 <- ggplot(
  data = data.table(fitted = fitted_multi, actual = y),
  aes(x = fitted, y = actual)
) +
  geom_point(size = 3, alpha = 0.7, color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Fitted Values", y = "Actual Values", title = "Fitted vs Actual") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Residuals vs fitted
residuals_multi_vec <- as.vector(residuals_multi)
p4 <- ggplot(
  data = data.table(fitted = fitted_multi, residuals = residuals_multi_vec),
  aes(x = fitted, y = residuals)
) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

1.3.3 Inferences in Multiple Regression

Hypothesis Testing in Multiple Regression

Overall Significance Test:

H₀: β₁ = β₂ = … = βₖ = 0 vs H₁: At least one βⱼ ≠ 0

F-Test:

F_0 = \frac{MS_R}{MS_E} = \frac{SS_R/k}{SS_E/(n-k-1)}

Individual Parameter Tests:

H₀: βⱼ = 0 vs H₁: βⱼ ≠ 0

t-Test:

t_0 = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}

ANOVA Table:

Source	SS	df	MS	F
Regression	SS_R	k	MS_R	MS_R/MS_E
Error	SS_E	n-k-1	MS_E
Total	SS_T	n-1

Adjusted R²:

R_{adj}^2 = 1 - \frac{SS_E/(n-k-1)}{SS_T/(n-1)} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}

1.3.4 Example: Multiple Regression Inference

R Output
R Code

[1] "Multiple Regression Hypothesis Test Results:"

            Parameter Test_Type Null_Hypothesis Test_Statistic P_Value
               <char>    <char>          <char>          <num>   <num>
1:      Overall Model    F-test     β₁ = β₂ = 0       106.4279  0.0000
2:     β₀ (Intercept)    t-test          β₀ = 0         1.5635  0.1322
3:   β₁ (Temperature)    t-test          β₁ = 0        11.7400  0.0000
4: β₂ (Concentration)    t-test          β₂ = 0         9.4333  0.0000
            Decision                   Interpretation
              <char>                           <char>
1:         Reject H₀             Model is significant
2: Fail to reject H₀        Intercept not significant
3:         Reject H₀   Temperature effect significant
4:         Reject H₀ Concentration effect significant

[1] "Multiple Regression ANOVA Table:"

       Source      SS    df      MS F_Statistic P_Value
       <char>   <num> <num>   <num>       <num>   <num>
1: Regression 4608.48     2 2304.24    106.4279       0
2:      Error  476.32    22   21.65          NA      NA
3:      Total 5084.80    24      NA          NA      NA

[1] "95% Confidence Intervals for Parameters:"

Error in Math.data.frame(structure(list(Parameter = c("β₀ (Intercept)", : non-numeric-alike variable(s) in data frame: Parameter

# Multiple regression inference
alpha_multi <- 0.05
t_critical_multi <- qt(1 - alpha_multi / 2, df_error_multi)
F_critical_multi <- qf(1 - alpha_multi, df_regression, df_error_multi)

# Overall F-test
F_stat_multi <- MSR_multi / MSE_multi
p_value_F_multi <- 1 - pf(F_stat_multi, df_regression, df_error_multi)

# Individual t-tests
t_stats <- c(
  beta0_multi / SE_beta0_multi,
  beta1_multi / SE_beta1_multi,
  beta2_multi / SE_beta2_multi
)

p_values_t <- 2 * (1 - pt(abs(t_stats), df_error_multi))

# Hypothesis test results
hypothesis_tests_multi <- data.table(
  Parameter = c("Overall Model", "β₀ (Intercept)", "β₁ (Temperature)", "β₂ (Concentration)"),
  Test_Type = c("F-test", "t-test", "t-test", "t-test"),
  Null_Hypothesis = c("β₁ = β₂ = 0", "β₀ = 0", "β₁ = 0", "β₂ = 0"),
  Test_Statistic = c(round(F_stat_multi, 4), round(t_stats, 4)),
  P_Value = c(round(p_value_F_multi, 4), round(p_values_t, 4)),
  Decision = c(
    ifelse(p_value_F_multi < alpha_multi, "Reject H₀", "Fail to reject H₀"),
    ifelse(p_values_t < alpha_multi, "Reject H₀", "Fail to reject H₀")
  ),
  Interpretation = c(
    ifelse(p_value_F_multi < alpha_multi, "Model is significant", "Model not significant"),
    ifelse(p_values_t[1] < alpha_multi, "Intercept significant", "Intercept not significant"),
    ifelse(p_values_t[2] < alpha_multi, "Temperature effect significant", "Temperature not significant"),
    ifelse(p_values_t[3] < alpha_multi, "Concentration effect significant", "Concentration not significant")
  )
)

print("Multiple Regression Hypothesis Test Results:")
print(hypothesis_tests_multi)

# ANOVA table for multiple regression
anova_table_multi <- data.table(
  Source = c("Regression", "Error", "Total"),
  SS = c(round(SSR_multi, 2), round(SSE_multi, 2), round(SST_multi, 2)),
  df = c(df_regression, df_error_multi, df_total),
  MS = c(round(MSR_multi, 2), round(MSE_multi, 2), NA),
  F_Statistic = c(round(F_stat_multi, 4), NA, NA),
  P_Value = c(round(p_value_F_multi, 4), NA, NA)
)

print("Multiple Regression ANOVA Table:")
print(anova_table_multi)

# Confidence intervals for parameters
CI_multi <- data.table(
  Parameter = c("β₀ (Intercept)", "β₁ (Temperature)", "β₂ (Concentration)"),
  Estimate = c(beta0_multi, beta1_multi, beta2_multi),
  Lower_95CI = c(
    beta0_multi - t_critical_multi * SE_beta0_multi,
    beta1_multi - t_critical_multi * SE_beta1_multi,
    beta2_multi - t_critical_multi * SE_beta2_multi
  ),
  Upper_95CI = c(
    beta0_multi + t_critical_multi * SE_beta0_multi,
    beta1_multi + t_critical_multi * SE_beta1_multi,
    beta2_multi + t_critical_multi * SE_beta2_multi
  )
)

print("95% Confidence Intervals for Parameters:")
print(round(CI_multi, 4))

1.3.5 Checking Model Adequacy

Multiple Regression Diagnostics

Key Diagnostic Plots:

Residuals vs. Fitted: Check linearity and constant variance
Normal Q-Q Plot: Check normality of residuals
Scale-Location: Check homoscedasticity
Residuals vs. Leverage: Identify influential points

Multicollinearity Detection:

Variance Inflation Factor (VIF):

VIF_j = \frac{1}{1-R_j^2}

where R_j^2 is the R² from regressing x_j on other predictors

Rule of thumb: VIF > 10 indicates serious multicollinearity

Influential Observations:

Leverage: h_{ii} (diagonal elements of hat matrix)
Cook’s Distance: Measures influence on fitted values
Studentized Residuals: Better for outlier detection

1.3.6 Example: Model Diagnostics for Multiple Regression

R Output
R Code

[1] "Multiple Regression Diagnostic Summary:"

                    Diagnostic  Value Threshold            Concern
                        <char>  <num>    <char>             <char>
1: Max |Standardized Residual| 2.2105       2-3 Potential outliers
2:  Max |Studentized Residual| 2.3367       2-3 Potential outliers
3:                Max Leverage 0.1960    > 0.24                 OK
4:         Max Cook's Distance 0.2256         1                 OK
5:             VIF Temperature 1.0046        10                 OK
6:           VIF Concentration 1.0046        10                 OK

[1] "Flagged observations:"
   observation standardized_resid studentized_resid leverage cooks_distance
         <int>              <num>             <num>    <num>          <num>
1:           8             -1.862            -1.989    0.124          0.163
2:          11              2.211             2.337    0.105          0.191
3:          20              1.895             2.066    0.159          0.226
   outlier high_leverage influential
    <lgcl>        <lgcl>      <lgcl>
1:   FALSE         FALSE        TRUE
2:    TRUE         FALSE        TRUE
3:   FALSE         FALSE        TRUE

# Multiple regression diagnostics
# Calculate leverages (diagonal elements of hat matrix)
H <- X %*% solve(XtX) %*% t(X)
leverages <- diag(H)

# Standardized and studentized residuals
standardized_resid_multi <- residuals_multi_vec / sqrt(MSE_multi)
studentized_resid_multi <- residuals_multi_vec / sqrt(MSE_multi * (1 - leverages))

# Cook's distance
cooks_distance <- (standardized_resid_multi^2 / (k + 1)) * (leverages / (1 - leverages))

# VIF calculations (for multicollinearity)
# VIF for temperature (regress temp on concentration)
temp_on_conc <- lm(temperature ~ concentration, data = chemical_data)
R2_temp <- summary(temp_on_conc)$r.squared
VIF_temp <- 1 / (1 - R2_temp)

# VIF for concentration (regress conc on temperature)
conc_on_temp <- lm(concentration ~ temperature, data = chemical_data)
R2_conc <- summary(conc_on_temp)$r.squared
VIF_conc <- 1 / (1 - R2_conc)

# Diagnostic summary
diagnostic_summary <- data.table(
  Diagnostic = c(
    "Max |Standardized Residual|", "Max |Studentized Residual|",
    "Max Leverage", "Max Cook's Distance", "VIF Temperature", "VIF Concentration"
  ),
  Value = c(
    round(max(abs(standardized_resid_multi)), 4),
    round(max(abs(studentized_resid_multi)), 4),
    round(max(leverages), 4),
    round(max(cooks_distance), 4),
    round(VIF_temp, 4),
    round(VIF_conc, 4)
  ),
  Threshold = c("2-3", "2-3", paste(">", round(2 * (k + 1) / n_multi, 3)), "1", "10", "10"),
  Concern = c(
    ifelse(max(abs(standardized_resid_multi)) > 2, "Potential outliers", "OK"),
    ifelse(max(abs(studentized_resid_multi)) > 2, "Potential outliers", "OK"),
    ifelse(max(leverages) > 2 * (k + 1) / n_multi, "High leverage points", "OK"),
    ifelse(max(cooks_distance) > 1, "Influential observations", "OK"),
    ifelse(VIF_temp > 10, "Serious multicollinearity", "OK"),
    ifelse(VIF_conc > 10, "Serious multicollinearity", "OK")
  )
)

print("Multiple Regression Diagnostic Summary:")
print(diagnostic_summary)

# Identify specific problematic observations
problem_obs <- data.table(
  observation = 1:n_multi,
  standardized_resid = round(standardized_resid_multi, 3),
  studentized_resid = round(studentized_resid_multi, 3),
  leverage = round(leverages, 3),
  cooks_distance = round(cooks_distance, 3)
)

# Flag potential problems
problem_obs[, outlier := abs(standardized_resid) > 2]
problem_obs[, high_leverage := leverage > 2 * (k + 1) / n_multi]
problem_obs[, influential := cooks_distance > 4 / n_multi] # Conservative threshold

flagged_obs <- problem_obs[outlier == TRUE | high_leverage == TRUE | influential == TRUE]

if (nrow(flagged_obs) > 0) {
  print("Flagged observations:")
  print(flagged_obs)
} else {
  print("No observations flagged for potential problems")
}

R Output
R Code

Figure 5: Multiple regression diagnostic plots

# Multiple regression diagnostic plots
multi_diagnostic_data <- data.table(
  fitted = fitted_multi,
  residuals = residuals_multi_vec,
  standardized_resid = standardized_resid_multi,
  studentized_resid = studentized_resid_multi,
  leverage = leverages,
  cooks_distance = cooks_distance,
  observation = 1:n_multi
)

# 1. Residuals vs Fitted
p1_multi <- ggplot(data = multi_diagnostic_data, aes(x = fitted, y = residuals)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 2. Normal Q-Q plot
p2_multi <- ggplot(data = multi_diagnostic_data, aes(sample = standardized_resid)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(
    x = "Theoretical Quantiles", y = "Standardized Residuals",
    title = "Normal Q-Q Plot"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 3. Scale-Location
p3_multi <- ggplot(data = multi_diagnostic_data, aes(x = fitted, y = sqrt(abs(standardized_resid)))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    x = "Fitted Values", y = "√|Standardized Residuals|",
    title = "Scale-Location"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 4. Residuals vs Leverage
p4_multi <- ggplot(data = multi_diagnostic_data, aes(x = leverage, y = studentized_resid)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "red") +
  geom_vline(xintercept = 2 * (k + 1) / n_multi, linetype = "dashed", color = "blue") +
  labs(
    x = "Leverage", y = "Studentized Residuals",
    title = "Residuals vs Leverage"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1_multi, p2_multi, p3_multi, p4_multi, nrow = 2, ncol = 2)

1.4 Other Aspects of Regression

1.4.1 Polynomial Models

Polynomial Regression

When the relationship is curved, use polynomial models:

Second-Order (Quadratic):

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon

Third-Order (Cubic):

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon

Interpretation:

β₁: Linear effect
β₂: Quadratic effect (curvature)
β₃: Cubic effect (inflection)

Model Selection:

Start with linear model
Add higher-order terms if needed
Use hypothesis tests to determine necessary degree
Be cautious of overfitting

1.4.2 Example: Polynomial Regression

Example

A process engineer studies how temperature affects reaction rate and suspects a nonlinear relationship:

[1] "Reaction Rate vs Temperature Data:"

    temperature      rate
          <num>     <num>
 1:         100  69.57229
 2:         110  66.01070
 3:         120  77.32496
 4:         130  82.34042
 5:         140  84.94795
 6:         150  88.67155
 7:         160  92.08906
 8:         170  97.41559
 9:         180  98.66312
10:         190 107.59609

# Polynomial regression example
set.seed(789)
reaction_data <- data.table(
  temperature = seq(100, 300, by = 10),
  rate = NULL
)

# Create nonlinear relationship
reaction_data[, rate := 5 + 0.8 * temperature - 0.002 * temperature^2 +
  0.000003 * temperature^3 + rnorm(21, mean = 0, sd = 3)]

print("Reaction Rate vs Temperature Data:")
print(head(reaction_data, 10))

# Write data to CSV
fwrite(reaction_data, "data/reaction_rate.csv")

Model Comparison:

Linear: Rate = \beta_0 + \beta_1 \times Temperature + \epsilon
Quadratic: Rate = \beta_0 + \beta_1 \times Temperature + \beta_2 \times Temperature^2 + \epsilon
Cubic: Rate = \beta_0 + \beta_1 \times Temperature + \beta_2 \times Temperature^2 + \beta_3 \times Temperature^3 + \epsilon

R Output
R Code

[1] "Polynomial Model Comparison:"

Error in Math.data.frame(structure(list(Model = c("Linear", "Quadratic", : non-numeric-alike variable(s) in data frame: Model, Formula

[1] "F-test: Linear vs Quadratic"

Analysis of Variance Table

Model 1: rate ~ temperature
Model 2: rate ~ temperature + I(temperature^2)
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1     19 103.809                              
2     18  86.945  1    16.864 3.4914 0.07805 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[1] "F-test: Quadratic vs Cubic"

Analysis of Variance Table

Model 1: rate ~ temperature + I(temperature^2)
Model 2: rate ~ temperature + I(temperature^2) + I(temperature^3)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     18 86.945                           
2     17 86.571  1   0.37395 0.0734 0.7897

[1] "Best model based on Adjusted R²: Quadratic"

[1] "Quadratic Model Summary:"

Call:
lm(formula = rate ~ temperature + I(temperature^2), data = reaction_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6671 -1.2600  0.1417  1.5729  3.1584 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      20.9024617  5.5825087   3.744  0.00148 ** 
temperature       0.4917544  0.0592276   8.303 1.44e-07 ***
I(temperature^2) -0.0002742  0.0001467  -1.869  0.07805 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.198 on 18 degrees of freedom
Multiple R-squared:  0.9923,    Adjusted R-squared:  0.9915 
F-statistic:  1165 on 2 and 18 DF,  p-value: < 2.2e-16

# Fit polynomial models of different degrees
# Linear model
lm_linear <- lm(rate ~ temperature, data = reaction_data)

# Quadratic model
lm_quad <- lm(rate ~ temperature + I(temperature^2), data = reaction_data)

# Cubic model
lm_cubic <- lm(rate ~ temperature + I(temperature^2) + I(temperature^3), data = reaction_data)

# Model comparison
model_comparison <- data.table(
  Model = c("Linear", "Quadratic", "Cubic"),
  Formula = c(
    "rate ~ temperature",
    "rate ~ temperature + temperature²",
    "rate ~ temperature + temperature² + temperature³"
  ),
  R_squared = c(
    summary(lm_linear)$r.squared,
    summary(lm_quad)$r.squared,
    summary(lm_cubic)$r.squared
  ),
  Adj_R_squared = c(
    summary(lm_linear)$adj.r.squared,
    summary(lm_quad)$adj.r.squared,
    summary(lm_cubic)$adj.r.squared
  ),
  AIC = c(AIC(lm_linear), AIC(lm_quad), AIC(lm_cubic)),
  BIC = c(BIC(lm_linear), BIC(lm_quad), BIC(lm_cubic)),
  F_statistic = c(
    summary(lm_linear)$fstatistic[1],
    summary(lm_quad)$fstatistic[1],
    summary(lm_cubic)$fstatistic[1]
  ),
  p_value = c(
    pf(summary(lm_linear)$fstatistic[1],
      summary(lm_linear)$fstatistic[2],
      summary(lm_linear)$fstatistic[3],
      lower.tail = FALSE
    ),
    pf(summary(lm_quad)$fstatistic[1],
      summary(lm_quad)$fstatistic[2],
      summary(lm_quad)$fstatistic[3],
      lower.tail = FALSE
    ),
    pf(summary(lm_cubic)$fstatistic[1],
      summary(lm_cubic)$fstatistic[2],
      summary(lm_cubic)$fstatistic[3],
      lower.tail = FALSE
    )
  )
)

print("Polynomial Model Comparison:")
print(round(model_comparison, 4))

# Sequential F-tests for additional terms
# Test if quadratic term improves linear model
anova_linear_quad <- anova(lm_linear, lm_quad)
print("F-test: Linear vs Quadratic")
print(anova_linear_quad)

# Test if cubic term improves quadratic model
anova_quad_cubic <- anova(lm_quad, lm_cubic)
print("F-test: Quadratic vs Cubic")
print(anova_quad_cubic)

# Detailed summary of best model (based on adj R²)
best_model_idx <- which.max(model_comparison$Adj_R_squared)
best_model_name <- model_comparison$Model[best_model_idx]
print(paste("Best model based on Adjusted R²:", best_model_name))

if (best_model_name == "Cubic") {
  print("Cubic Model Summary:")
  print(summary(lm_cubic))
} else if (best_model_name == "Quadratic") {
  print("Quadratic Model Summary:")
  print(summary(lm_quad))
} else {
  print("Linear Model Summary:")
  print(summary(lm_linear))
}

R Output
R Code

Figure 6: Polynomial regression models comparison

# Polynomial regression visualization
# Create prediction data for smooth curves
temp_pred <- seq(min(reaction_data$temperature), max(reaction_data$temperature), length.out = 100)
pred_data <- data.table(temperature = temp_pred)

# Predictions from each model
pred_linear <- predict(lm_linear, newdata = pred_data)
pred_quad <- predict(lm_quad, newdata = pred_data)
pred_cubic <- predict(lm_cubic, newdata = pred_data)

# Combine predictions
plot_data <- data.table(
  temperature = rep(temp_pred, 3),
  predicted_rate = c(pred_linear, pred_quad, pred_cubic),
  model = rep(c("Linear", "Quadratic", "Cubic"), each = length(temp_pred))
)

# Plot all models
ggplot() +
  geom_point(
    data = reaction_data, aes(x = temperature, y = rate),
    size = 3, alpha = 0.7, color = "black"
  ) +
  geom_line(
    data = plot_data, aes(x = temperature, y = predicted_rate, color = model),
    linewidth = 1.2
  ) +
  scale_color_manual(values = c("Linear" = "red", "Quadratic" = "blue", "Cubic" = "green")) +
  labs(
    x = "Temperature",
    y = "Reaction Rate",
    title = "Polynomial Regression Models Comparison",
    subtitle = "Comparing linear, quadratic, and cubic fits",
    color = "Model"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

1.4.3 Categorical Regressors

Dummy Variables for Categorical Predictors

When predictors are categorical, use dummy variables:

Binary Categorical Variable (2 levels):

For variable with levels A and B:

x = 0 for level A
x = 1 for level B

Multi-level Categorical Variable (k levels):

Use k-1 dummy variables:

x_1 = \begin{cases} 1 & \text{if level 2} \\ 0 & \text{otherwise} \end{cases}

x_2 = \begin{cases} 1 & \text{if level 3} \\ 0 & \text{otherwise} \end{cases}

\vdots

Interpretation:

Reference level: All dummy variables = 0
β₀: Mean response for reference level
βⱼ: Difference in mean response between level j+1 and reference level

1.4.4 Example: Categorical Variables

Example

A quality engineer studies how supplier and heat treatment affect strength:

[1] "Categorical Variables Example - First 15 observations:"

    observation supplier heat_treatment  strength
          <int>   <char>         <char>     <num>
 1:           1        A             No  96.20567
 2:           2        A            Yes 121.76799
 3:           3        A             No  98.99465
 4:           4        A            Yes 112.79896
 5:           5        A             No 106.48668
 6:           6        A            Yes 106.10137
 7:           7        A             No  95.80303
 8:           8        A            Yes 119.79145
 9:           9        A             No 102.14922
10:          10        A            Yes 117.88237
11:          11        A             No 100.10912
12:          12        A            Yes 115.25777
13:          13        A             No  92.78922
14:          14        A            Yes 130.48557
15:          15        A             No 103.01518

[1] "Summary by Supplier and Heat Treatment:"

   supplier heat_treatment     n mean_strength sd_strength min_strength
     <char>         <char> <int>         <num>       <num>        <num>
1:        A             No    10      99.89853    4.317355     92.78922
2:        A            Yes    10     117.80654    6.941354    106.10137
3:        B             No    10     111.01700    5.039307    100.85786
4:        B            Yes    10     125.45412    4.000820    117.65973
5:        C             No    10     102.42515    4.899730     94.05402
6:        C            Yes    10     131.78177    6.143025    119.69643
   max_strength
          <num>
1:     106.4867
2:     130.4856
3:     117.6057
4:     129.8654
5:     110.6014
6:     141.1987

# Categorical variables example
set.seed(101112)
strength_cat_data <- data.table(
  observation = 1:60,
  supplier = rep(c("A", "B", "C"), each = 20),
  heat_treatment = rep(c("No", "Yes"), 30),
  strength = NULL
)

# Create realistic effects
# Base strength: 100
# Supplier B: +10, Supplier C: +5 (vs A)
# Heat treatment: +15
# Some interaction: Supplier C with heat treatment gets extra +8
strength_cat_data[, strength := 100 +
  ifelse(supplier == "B", 10, 0) +
  ifelse(supplier == "C", 5, 0) +
  ifelse(heat_treatment == "Yes", 15, 0) +
  ifelse(supplier == "C" & heat_treatment == "Yes", 8, 0) +
  rnorm(60, mean = 0, sd = 5)]

print("Categorical Variables Example - First 15 observations:")
print(head(strength_cat_data, 15))

# Summary by groups
group_summary <- strength_cat_data %>%
  fgroup_by(supplier, heat_treatment) %>%
  fsummarise(
    n = fnobs(strength),
    mean_strength = fmean(strength),
    sd_strength = fsd(strength),
    min_strength = fmin(strength),
    max_strength = fmax(strength)
  )

print("Summary by Supplier and Heat Treatment:")
print(group_summary)

# Write data to CSV
fwrite(strength_cat_data, "data/strength_categorical.csv")

Model with Categorical Variables:

Strength = \beta_0 + \beta_1 \times Supplier_B + \beta_2 \times Supplier_C + \beta_3 \times HeatTreat_{Yes} + \epsilon

Interpretation:

β₀: Mean strength for Supplier A, no heat treatment
β₁: Effect of Supplier B vs. Supplier A
β₂: Effect of Supplier C vs. Supplier A
β₃: Effect of heat treatment

R Output
R Code

[1] "Main Effects Model:"


Call:
lm(formula = strength ~ supplier + heat_treatment, data = strength_cat_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.0348  -4.0101   0.5405   3.7302  13.8116 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         98.569      1.596  61.766  < 2e-16 ***
supplierB            9.383      1.954   4.801 1.22e-05 ***
supplierC            8.251      1.954   4.222 8.99e-05 ***
heat_treatmentYes   20.567      1.596  12.888  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.181 on 56 degrees of freedom
Multiple R-squared:  0.7756,    Adjusted R-squared:  0.7636 
F-statistic: 64.52 on 3 and 56 DF,  p-value: < 2.2e-16

[1] "Model with Interaction:"


Call:
lm(formula = strength ~ supplier * heat_treatment, data = strength_cat_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0853  -2.9002   0.1333   3.1805  12.6790 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   99.899      1.683  59.354  < 2e-16 ***
supplierB                     11.118      2.380   4.671 2.03e-05 ***
supplierC                      2.527      2.380   1.061  0.29319    
heat_treatmentYes             17.908      2.380   7.524 5.82e-10 ***
supplierB:heat_treatmentYes   -3.471      3.366  -1.031  0.30709    
supplierC:heat_treatmentYes   11.449      3.366   3.401  0.00127 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.322 on 54 degrees of freedom
Multiple R-squared:  0.8395,    Adjusted R-squared:  0.8247 
F-statistic: 56.51 on 5 and 54 DF,  p-value: < 2.2e-16

[1] "Manual Dummy Variable Model (should match main effects):"


Call:
lm(formula = strength ~ supplier_B + supplier_C + heat_yes, data = strength_cat_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.0348  -4.0101   0.5405   3.7302  13.8116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   98.569      1.596  61.766  < 2e-16 ***
supplier_B     9.383      1.954   4.801 1.22e-05 ***
supplier_C     8.251      1.954   4.222 8.99e-05 ***
heat_yes      20.567      1.596  12.888  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.181 on 56 degrees of freedom
Multiple R-squared:  0.7756,    Adjusted R-squared:  0.7636 
F-statistic: 64.52 on 3 and 56 DF,  p-value: < 2.2e-16

[1] "Coefficient Interpretation:"

      Coefficient Value                                  Interpretation
           <char> <num>                                          <char>
1:      Intercept 98.57 Mean strength for Supplier A, no heat treatment
2:     Supplier B  9.38         Additional strength for Supplier B vs A
3:     Supplier C  8.25         Additional strength for Supplier C vs A
4: Heat Treatment 20.57         Additional strength from heat treatment

[1] "F-test for overall supplier effect:"

Analysis of Variance Table

Model 1: strength ~ heat_treatment
Model 2: strength ~ supplier + heat_treatment
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     58 3188.6                                  
2     56 2139.2  2    1049.3 13.735 1.401e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[1] "F-test for interaction effect:"

Analysis of Variance Table

Model 1: strength ~ supplier + heat_treatment
Model 2: strength ~ supplier * heat_treatment
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     56 2139.2                                  
2     54 1529.7  2    609.52 10.758 0.0001168 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Regression with categorical variables
# R automatically creates dummy variables with factor variables
strength_cat_data[, supplier := factor(supplier)]
strength_cat_data[, heat_treatment := factor(heat_treatment)]

# Main effects model
lm_cat_main <- lm(strength ~ supplier + heat_treatment, data = strength_cat_data)

# Model with interaction
lm_cat_interact <- lm(strength ~ supplier * heat_treatment, data = strength_cat_data)

print("Main Effects Model:")
print(summary(lm_cat_main))

print("Model with Interaction:")
print(summary(lm_cat_interact))

# Manual dummy variable creation for illustration
strength_cat_data[, supplier_B := ifelse(supplier == "B", 1, 0)]
strength_cat_data[, supplier_C := ifelse(supplier == "C", 1, 0)]
strength_cat_data[, heat_yes := ifelse(heat_treatment == "Yes", 1, 0)]

# Manual model with dummy variables
lm_manual_dummy <- lm(strength ~ supplier_B + supplier_C + heat_yes, data = strength_cat_data)

print("Manual Dummy Variable Model (should match main effects):")
print(summary(lm_manual_dummy))

# Interpret coefficients
coef_interpretation <- data.table(
  Coefficient = c("Intercept", "Supplier B", "Supplier C", "Heat Treatment"),
  Value = round(coef(lm_cat_main), 2),
  Interpretation = c(
    "Mean strength for Supplier A, no heat treatment",
    "Additional strength for Supplier B vs A",
    "Additional strength for Supplier C vs A",
    "Additional strength from heat treatment"
  )
)

print("Coefficient Interpretation:")
print(coef_interpretation)

# Test significance of categorical variables
# Test overall supplier effect
anova_supplier <- anova(lm(strength ~ heat_treatment, data = strength_cat_data), lm_cat_main)
print("F-test for overall supplier effect:")
print(anova_supplier)

# Test interaction effect
anova_interaction <- anova(lm_cat_main, lm_cat_interact)
print("F-test for interaction effect:")
print(anova_interaction)

R Output
R Code

[1] "Model Statistics Comparison:"

Error in Math.data.frame(structure(list(Model = c("Main Effects", "With Interaction": non-numeric-alike variable(s) in data frame: Model

Figure 7: Categorical variables in regression

# Visualization of categorical variables
# Box plots by groups
p1_cat <- ggplot(data = strength_cat_data, aes(x = supplier, y = strength, fill = heat_treatment)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    x = "Supplier", y = "Strength", fill = "Heat Treatment",
    title = "Strength by Supplier and Heat Treatment"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Interaction plot
interaction_means <- strength_cat_data %>%
  fgroup_by(supplier, heat_treatment) %>%
  fsummarise(mean_strength = fmean(strength))

p2_cat <- ggplot(data = interaction_means, aes(
  x = supplier, y = mean_strength,
  color = heat_treatment, group = heat_treatment
)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 4) +
  labs(
    x = "Supplier", y = "Mean Strength", color = "Heat Treatment",
    title = "Interaction Plot"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Residuals from interaction model
resid_interact <- residuals(lm_cat_interact)
fitted_interact <- fitted(lm_cat_interact)

p3_cat <- ggplot(
  data = data.table(fitted = fitted_interact, residuals = resid_interact),
  aes(x = fitted, y = residuals)
) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Model comparison table
model_stats_cat <- data.table(
  Model = c("Main Effects", "With Interaction"),
  R_squared = c(summary(lm_cat_main)$r.squared, summary(lm_cat_interact)$r.squared),
  Adj_R_squared = c(summary(lm_cat_main)$adj.r.squared, summary(lm_cat_interact)$adj.r.squared),
  AIC = c(AIC(lm_cat_main), AIC(lm_cat_interact)),
  F_statistic = c(summary(lm_cat_main)$fstatistic[1], summary(lm_cat_interact)$fstatistic[1])
)

print("Model Statistics Comparison:")
print(round(model_stats_cat, 3))

grid.arrange(p1_cat, p2_cat, p3_cat, nrow = 2, ncol = 2)

1.4.5 Variable Selection Techniques

Model Selection Methods

Forward Selection:

Start with no variables
Add variables one by one
Stop when no improvement

Backward Elimination:

Start with all variables
Remove variables one by one
Stop when all remaining are significant

Stepwise Selection:

Combine forward and backward
Can add or remove at each step
More flexible approach

Selection Criteria:

AIC (Akaike Information Criterion): AIC = n \ln(SS_E/n) + 2(k+1)
BIC (Bayesian Information Criterion): BIC = n \ln(SS_E/n) + (k+1)\ln(n)
Adjusted R²: Penalizes for additional variables
Mallows’ Cp: C_p = \frac{SS_E}{MS_E} + 2(k+1) - n

Best Practices:

Use multiple criteria
Cross-validate final model
Consider subject matter knowledge
Avoid overfitting

1.4.6 Example: Variable Selection

Example

An engineer has multiple potential predictors for product quality and wants to identify the most important ones:

[1] "Quality Prediction Data Summary:"

[1] "Variable Summary:"

[1] "Correlation Matrix:"

              temperature pressure flow_rate     pH catalyst_conc   time
temperature         1.000   -0.143     0.075  0.022         0.142  0.023
pressure           -0.143    1.000    -0.035  0.369        -0.114 -0.172
flow_rate           0.075   -0.035     1.000  0.370         0.084 -0.061
pH                  0.022    0.369     0.370  1.000         0.121 -0.140
catalyst_conc       0.142   -0.114     0.084  0.121         1.000  0.071
time                0.023   -0.172    -0.061 -0.140         0.071  1.000
quality             0.771    0.284     0.157  0.381         0.282 -0.017
              quality
temperature     0.771
pressure        0.284
flow_rate       0.157
pH              0.381
catalyst_conc   0.282
time           -0.017
quality         1.000

# Variable selection example
set.seed(131415)
quality_data <- data.table(
  observation = 1:50,
  temperature = rnorm(50, mean = 200, sd = 20),
  pressure = rnorm(50, mean = 10, sd = 2),
  flow_rate = rnorm(50, mean = 5, sd = 1),
  pH = rnorm(50, mean = 7, sd = 0.5),
  catalyst_conc = rnorm(50, mean = 0.1, sd = 0.02),
  time = rnorm(50, mean = 60, sd = 10),
  quality = NULL
)

# Create realistic relationship where only some variables matter
# Important: temperature, pressure, catalyst_conc
# Less important: pH
# Not important: flow_rate, time
quality_data[, quality := 50 +
  0.3 * temperature +
  2.0 * pressure +
  100 * catalyst_conc +
  3.0 * pH +
  0.1 * flow_rate + # weak effect
  0.05 * time + # very weak effect
  rnorm(50, mean = 0, sd = 3)]

print("Quality Prediction Data Summary:")
predictors <- c("temperature", "pressure", "flow_rate", "pH", "catalyst_conc", "time")
summary_stats <- quality_data[, lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))),
  .SDcols = c(predictors, "quality")
]

print("Variable Summary:")
for (var in names(summary_stats)) {
  cat(sprintf(
    "%s: Mean = %.3f, SD = %.3f\n",
    var, summary_stats[[var]]$mean, summary_stats[[var]]$sd
  ))
}

# Correlation matrix
cor_matrix_quality <- cor(quality_data[, .SD, .SDcols = c(predictors, "quality")])
print("Correlation Matrix:")
print(round(cor_matrix_quality, 3))

# Write data to CSV
fwrite(quality_data, "data/quality_predictors.csv")

Available Predictors:

Temperature, Pressure, Flow_Rate, pH, Catalyst_Conc, Time

Selection Process:

All Possible Models: Evaluate all 2^k possible models
Stepwise Selection: Use automated procedures
Criterion-Based: Compare AIC, BIC, Cp values

R Output
R Code

[1] "Model Selection Summary:"

Error in Math.data.frame(structure(list(Method = c("Full Model", "Stepwise AIC", : non-numeric-alike variable(s) in data frame: Method

[1] "Final Selected Model Summary:"


Call:
lm(formula = quality ~ temperature + pressure + pH + catalyst_conc, 
    data = quality_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5183 -2.0431 -0.2991  2.2480  6.2915 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    64.4306     7.9554   8.099 2.47e-10 ***
temperature     0.2677     0.0208  12.872  < 2e-16 ***
pressure        1.3271     0.2602   5.100 6.58e-06 ***
pH              3.5314     1.0675   3.308  0.00185 ** 
catalyst_conc  78.7116    26.5539   2.964  0.00484 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.074 on 45 degrees of freedom
Multiple R-squared:  0.838, Adjusted R-squared:  0.8236 
F-statistic: 58.19 on 4 and 45 DF,  p-value: < 2.2e-16

[1] "Selected Variables:"

[1] "temperature"   "pressure"      "pH"            "catalyst_conc"

# Variable selection procedures

# Full model with all predictors
lm_full <- lm(quality ~ temperature + pressure + flow_rate + pH + catalyst_conc + time,
  data = quality_data
)

# Stepwise selection using AIC
step_aic <- step(lm_full, direction = "both", trace = FALSE)

# Forward selection starting from intercept only
lm_intercept <- lm(quality ~ 1, data = quality_data)
step_forward <- step(lm_intercept,
  scope = list(lower = lm_intercept, upper = lm_full),
  direction = "forward", trace = FALSE
)

# Backward elimination starting from full model
step_backward <- step(lm_full, direction = "backward", trace = FALSE)

# Compare selected models
model_selection_summary <- data.table(
  Method = c("Full Model", "Stepwise AIC", "Forward AIC", "Backward AIC"),
  R_squared = c(
    summary(lm_full)$r.squared,
    summary(step_aic)$r.squared,
    summary(step_forward)$r.squared,
    summary(step_backward)$r.squared
  ),
  Adj_R_squared = c(
    summary(lm_full)$adj.r.squared,
    summary(step_aic)$adj.r.squared,
    summary(step_forward)$adj.r.squared,
    summary(step_backward)$adj.r.squared
  ),
  AIC = c(AIC(lm_full), AIC(step_aic), AIC(step_forward), AIC(step_backward)),
  BIC = c(BIC(lm_full), BIC(step_aic), BIC(step_forward), BIC(step_backward)),
  n_parameters = c(
    length(coef(lm_full)),
    length(coef(step_aic)),
    length(coef(step_forward)),
    length(coef(step_backward))
  )
)

print("Model Selection Summary:")
print(round(model_selection_summary[, .(Method, R_squared, Adj_R_squared, AIC, BIC, n_parameters)], 4))

# Final selected model (based on AIC stepwise)
print("Final Selected Model Summary:")
print(summary(step_aic))

# Extract selected variables
selected_vars <- names(coef(step_aic))[-1] # Remove intercept
print("Selected Variables:")
print(selected_vars)

R Output
R Code

# Variable selection visualization

# Model comparison plot
model_comp_data <- model_selection_summary[, .(Method, AIC, BIC, Adj_R_squared)]
model_comp_long <- melt(model_comp_data,
  id.vars = "Method",
  measure.vars = c("AIC", "BIC"),
  variable.name = "Criterion", value.name = "Value"
)

p1_var <- ggplot(data = model_comp_long, aes(x = Method, y = Value, fill = Criterion)) +
  geom_col(position = "dodge", alpha = 0.7) +
  labs(
    x = "Model Selection Method", y = "Criterion Value",
    title = "Model Selection Criteria", fill = "Criterion"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Adjusted R² comparison
p2_var <- ggplot(data = model_selection_summary, aes(x = reorder(Method, -Adj_R_squared), y = Adj_R_squared)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = round(Adj_R_squared, 3)), vjust = -0.5) +
  labs(
    x = "Model Selection Method", y = "Adjusted R²",
    title = "Model Performance Comparison"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Final model coefficients
final_coefs <- coef(step_aic)[-1] # Remove intercept
coef_data <- data.table(
  variable = names(final_coefs),
  coefficient = as.numeric(final_coefs)
)

p3_var <- ggplot(data = coef_data, aes(x = reorder(variable, abs(coefficient)), y = coefficient)) +
  geom_col(fill = "orange", alpha = 0.7) +
  coord_flip() +
  labs(
    x = "Variable", y = "Coefficient",
    title = "Final Model Coefficients"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Correlation plot of selected variables
selected_cor <- cor_matrix_quality[c(selected_vars, "quality"), c(selected_vars, "quality")]

p4_var <- corrplot(selected_cor,
  method = "color", type = "upper",
  order = "hclust", tl.cex = 0.8, tl.col = "black"
)

grid.arrange(p1_var, p2_var, p3_var, nrow = 2, ncol = 2)

1.5 Comprehensive Example: Complete Regression Analysis

Comprehensive Example

A manufacturing engineer conducts a complete regression analysis to optimize a production process:

[1] "Production Optimization Data Summary:"

       n temp_mean  temp_sd pressure_mean pressure_sd time_mean  time_sd
   <int>     <num>    <num>         <num>       <num>     <num>    <num>
1:    40  182.9235 13.28839      7.768272    1.368849  44.61594 6.506526
   yield_mean yield_sd
        <num>    <num>
1:   153.0184 7.917538

[1] "Yield by Operator:"

   operator     n mean_yield sd_yield
     <char> <int>      <num>    <num>
1:        A    12   153.8781 6.764170
2:        B    13   153.7998 8.222506
3:        C    15   151.6535 8.805235

# Comprehensive regression analysis example
set.seed(161718)
production_data <- data.table(
  batch = 1:40,
  temp = rnorm(40, mean = 180, sd = 15),
  pressure = rnorm(40, mean = 8, sd = 1.5),
  time = rnorm(40, mean = 45, sd = 8),
  operator = sample(c("A", "B", "C"), 40, replace = TRUE),
  yield = NULL
)

# Complex relationship with interactions and quadratic terms
production_data[, yield := 60 +
  0.5 * temp +
  3.0 * pressure +
  0.3 * time +
  0.01 * temp * pressure + # interaction
  -0.0015 * temp^2 + # quadratic term
  ifelse(operator == "B", 5, 0) + # operator effect
  ifelse(operator == "C", -2, 0) +
  rnorm(40, mean = 0, sd = 4)]

print("Production Optimization Data Summary:")
prod_summary <- production_data %>%
  fsummarise(
    n = fnobs(batch),
    temp_mean = fmean(temp),
    temp_sd = fsd(temp),
    pressure_mean = fmean(pressure),
    pressure_sd = fsd(pressure),
    time_mean = fmean(time),
    time_sd = fsd(time),
    yield_mean = fmean(yield),
    yield_sd = fsd(yield)
  )
print(prod_summary)

# Operator summary
operator_summary <- production_data %>%
  fgroup_by(operator) %>%
  fsummarise(
    n = fnobs(yield),
    mean_yield = fmean(yield),
    sd_yield = fsd(yield)
  )
print("Yield by Operator:")
print(operator_summary)

# Write data to CSV
fwrite(production_data, "data/production_optimization.csv")

Complete Analysis Steps:

Exploratory Data Analysis
Simple Linear Regression
Multiple Regression
Polynomial Terms
Model Diagnostics
Variable Selection
Final Model Validation

R Output
R Code

[1] "=== COMPREHENSIVE REGRESSION ANALYSIS ==="

[1] "\n--- STEP 1: EXPLORATORY DATA ANALYSIS ---"

[1] "Correlation Matrix (numeric variables):"

          temp pressure  time yield
temp     1.000    0.202 0.326 0.256
pressure 0.202    1.000 0.081 0.865
time     0.326    0.081 1.000 0.193
yield    0.256    0.865 0.193 1.000

[1] "\n--- STEP 2: SIMPLE LINEAR REGRESSIONS ---"

[1] "Simple Linear Regression Results:"

Error in Math.data.frame(structure(list(Predictor = c("Temperature", "Pressure", : non-numeric-alike variable(s) in data frame: Predictor

[1] "\n--- STEP 3: MULTIPLE REGRESSION - MAIN EFFECTS ---"


Call:
lm(formula = yield ~ temp + pressure + time + operator, data = production_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9510 -1.9880 -0.6332  1.6444  7.3494 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.92275    8.32329  12.125 6.75e-14 ***
temp          0.02179    0.04351   0.501   0.6198    
pressure      5.16863    0.40476  12.770 1.58e-14 ***
time          0.18107    0.09638   1.879   0.0689 .  
operatorB     2.70940    1.50947   1.795   0.0816 .  
operatorC    -2.66870    1.39023  -1.920   0.0633 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.339 on 34 degrees of freedom
Multiple R-squared:  0.845, Adjusted R-squared:  0.8222 
F-statistic: 37.06 on 5 and 34 DF,  p-value: 7.99e-13

[1] "\n--- STEP 4: POLYNOMIAL TERMS ---"

[1] "Model with quadratic terms:"


Call:
lm(formula = yield ~ temp + pressure + time + operator + I(temp^2) + 
    I(pressure^2), data = production_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8800 -2.3948 -0.3778  1.8195  6.9391 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    1.016e+02  9.148e+01   1.111   0.2750  
temp           1.551e-01  9.902e-01   0.157   0.8765  
pressure       1.953e+00  4.266e+00   0.458   0.6501  
time           1.820e-01  9.871e-02   1.844   0.0744 .
operatorB      2.769e+00  1.608e+00   1.722   0.0947 .
operatorC     -2.831e+00  1.454e+00  -1.948   0.0603 .
I(temp^2)     -3.837e-04  2.745e-03  -0.140   0.8897  
I(pressure^2)  2.063e-01  2.720e-01   0.758   0.4539  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.41 on 32 degrees of freedom
Multiple R-squared:  0.8478,    Adjusted R-squared:  0.8145 
F-statistic: 25.46 on 7 and 32 DF,  p-value: 2.249e-11

[1] "F-test for polynomial terms:"

Analysis of Variance Table

Model 1: yield ~ temp + pressure + time + operator
Model 2: yield ~ temp + pressure + time + operator + I(temp^2) + I(pressure^2)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     34 379.04                           
2     32 372.12  2    6.9196 0.2975 0.7447

[1] "\n--- STEP 5: INTERACTION TERMS ---"

[1] "Model with interaction:"


Call:
lm(formula = yield ~ temp + pressure + time + operator + I(temp^2) + 
    temp:pressure, data = production_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0145 -1.9143 -0.6736  1.6542  7.3233 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)   88.1805758 90.6278053   0.973   0.3379  
temp           0.1572700  1.0375790   0.152   0.8805  
pressure       5.3005491  7.3022345   0.726   0.4732  
time           0.1820613  0.0996318   1.827   0.0770 .
operatorB      2.7710622  1.6318944   1.698   0.0992 .
operatorC     -2.6475067  1.4768318  -1.793   0.0825 .
I(temp^2)     -0.0003614  0.0032065  -0.113   0.9110  
temp:pressure -0.0006904  0.0393592  -0.018   0.9861  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.441 on 32 degrees of freedom
Multiple R-squared:  0.8451,    Adjusted R-squared:  0.8112 
F-statistic: 24.93 on 7 and 32 DF,  p-value: 2.967e-11

[1] "F-test for interaction term:"

Analysis of Variance Table

Model 1: yield ~ temp + pressure + time + operator + I(temp^2) + I(pressure^2)
Model 2: yield ~ temp + pressure + time + operator + I(temp^2) + temp:pressure
  Res.Df    RSS Df Sum of Sq F Pr(>F)
1     32 372.12                      
2     32 378.80  0   -6.6817

[1] "\n--- STEP 6: MODEL SELECTION ---"

[1] "Model Comparison:"

Error in Math.data.frame(structure(list(Model = c("Main Effects", "Polynomial", : non-numeric-alike variable(s) in data frame: Model

[1] "Best model by AIC: Main Effects"

[1] "\n--- STEP 7: FINAL MODEL DIAGNOSTICS ---"

[1] "Final Model Diagnostic Summary:"

[1] "Residual standard error:"

Error in print.default("Multiple R-squared:", round(summary(final_model)$r.squared, : invalid printing digits 0

Error in print.default("Adjusted R-squared:", round(summary(final_model)$adj.r.squared, : invalid printing digits 0

# Comprehensive regression analysis steps

print("=== COMPREHENSIVE REGRESSION ANALYSIS ===")

# Step 1: Exploratory Data Analysis
print("\n--- STEP 1: EXPLORATORY DATA ANALYSIS ---")
cor_matrix_prod <- cor(production_data[, .(temp, pressure, time, yield)])
print("Correlation Matrix (numeric variables):")
print(round(cor_matrix_prod, 3))

# Step 2: Simple linear regressions
print("\n--- STEP 2: SIMPLE LINEAR REGRESSIONS ---")
lm_temp <- lm(yield ~ temp, data = production_data)
lm_pressure <- lm(yield ~ pressure, data = production_data)
lm_time <- lm(yield ~ time, data = production_data)

simple_models <- data.table(
  Predictor = c("Temperature", "Pressure", "Time"),
  R_squared = c(
    summary(lm_temp)$r.squared,
    summary(lm_pressure)$r.squared,
    summary(lm_time)$r.squared
  ),
  F_statistic = c(
    summary(lm_temp)$fstatistic[1],
    summary(lm_pressure)$fstatistic[1],
    summary(lm_time)$fstatistic[1]
  ),
  p_value = c(
    pf(summary(lm_temp)$fstatistic[1], 1, 38, lower.tail = FALSE),
    pf(summary(lm_pressure)$fstatistic[1], 1, 38, lower.tail = FALSE),
    pf(summary(lm_time)$fstatistic[1], 1, 38, lower.tail = FALSE)
  )
)
print("Simple Linear Regression Results:")
print(round(simple_models, 4))

# Step 3: Multiple regression with main effects
print("\n--- STEP 3: MULTIPLE REGRESSION - MAIN EFFECTS ---")
production_data[, operator := factor(operator)]
lm_main <- lm(yield ~ temp + pressure + time + operator, data = production_data)
print(summary(lm_main))

# Step 4: Add polynomial terms
print("\n--- STEP 4: POLYNOMIAL TERMS ---")
lm_poly <- lm(yield ~ temp + pressure + time + operator + I(temp^2) + I(pressure^2),
  data = production_data
)
print("Model with quadratic terms:")
print(summary(lm_poly))

# Test if polynomial terms are needed
anova_poly <- anova(lm_main, lm_poly)
print("F-test for polynomial terms:")
print(anova_poly)

# Step 5: Add interaction terms
print("\n--- STEP 5: INTERACTION TERMS ---")
lm_interact <- lm(yield ~ temp + pressure + time + operator + I(temp^2) + temp:pressure,
  data = production_data
)
print("Model with interaction:")
print(summary(lm_interact))

# Test if interaction is needed
anova_interact <- anova(lm_poly, lm_interact)
print("F-test for interaction term:")
print(anova_interact)

# Step 6: Model selection
print("\n--- STEP 6: MODEL SELECTION ---")
comprehensive_models <- data.table(
  Model = c("Main Effects", "Polynomial", "With Interaction"),
  R_squared = c(
    summary(lm_main)$r.squared,
    summary(lm_poly)$r.squared,
    summary(lm_interact)$r.squared
  ),
  Adj_R_squared = c(
    summary(lm_main)$adj.r.squared,
    summary(lm_poly)$adj.r.squared,
    summary(lm_interact)$adj.r.squared
  ),
  AIC = c(AIC(lm_main), AIC(lm_poly), AIC(lm_interact)),
  BIC = c(BIC(lm_main), BIC(lm_poly), BIC(lm_interact)),
  RMSE = c(
    sqrt(mean(residuals(lm_main)^2)),
    sqrt(mean(residuals(lm_poly)^2)),
    sqrt(mean(residuals(lm_interact)^2))
  )
)

print("Model Comparison:")
print(round(comprehensive_models, 4))

# Select best model (lowest AIC)
best_model_comp <- which.min(comprehensive_models$AIC)
print(paste("Best model by AIC:", comprehensive_models$Model[best_model_comp]))

# Step 7: Final model diagnostics
print("\n--- STEP 7: FINAL MODEL DIAGNOSTICS ---")
final_model <- lm_interact # Assuming this is the best

# Basic diagnostics
residuals_final <- residuals(final_model)
fitted_final <- fitted(final_model)
n_final <- nrow(production_data)
k_final <- length(coef(final_model)) - 1

diagnostic_final <- data.table(
  observation = 1:n_final,
  residuals = residuals_final,
  fitted = fitted_final,
  standardized_resid = residuals_final / sd(residuals_final)
)

print("Final Model Diagnostic Summary:")
print("Residual standard error:", round(summary(final_model)$sigma, 3))
print("Multiple R-squared:", round(summary(final_model)$r.squared, 3))
print("Adjusted R-squared:", round(summary(final_model)$adj.r.squared, 3))

R Output
R Code

Figure 10: Comprehensive regression analysis dashboard

# Comprehensive regression analysis dashboard

# 1. Pairwise scatter plots
p1_comp <- ggplot(data = production_data, aes(x = temp, y = yield, color = operator)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Temperature", y = "Yield", title = "Yield vs Temperature") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

p2_comp <- ggplot(data = production_data, aes(x = pressure, y = yield, color = operator)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Pressure", y = "Yield", title = "Yield vs Pressure") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 2. Model comparison plot
model_comp_plot <- ggplot(
  data = comprehensive_models,
  aes(x = reorder(Model, -Adj_R_squared), y = Adj_R_squared)
) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = round(Adj_R_squared, 3)), vjust = -0.5) +
  labs(x = "Model", y = "Adjusted R²", title = "Model Comparison") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 3. Final model diagnostics
p3_comp <- ggplot(data = diagnostic_final, aes(x = fitted, y = residuals)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(se = FALSE, color = "blue") +
  labs(x = "Fitted Values", y = "Residuals", title = "Final Model Diagnostics") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# 4. Actual vs Predicted
p4_comp <- ggplot(
  data = data.table(actual = production_data$yield, predicted = fitted_final),
  aes(x = predicted, y = actual)
) +
  geom_point(size = 3, alpha = 0.7, color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(x = "Predicted Yield", y = "Actual Yield", title = "Actual vs Predicted") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1_comp, p2_comp, model_comp_plot, p3_comp, nrow = 2, ncol = 2)

1.6 Key Formulas and Decision Framework

Essential Formulas Summary

Simple Linear Regression:

\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

Multiple Regression:

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Coefficient of Determination:

R^2 = \frac{SS_R}{SS_T}, \quad R_{adj}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}

Prediction Interval:

\hat{y}_0 \pm t_{\alpha/2, n-k-1} \sqrt{MS_E\left(1 + \mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0\right)}

F-Test for Overall Significance:

F_0 = \frac{MS_R}{MS_E} = \frac{SS_R/k}{SS_E/(n-k-1)}

t-Test for Individual Parameters:

t_0 = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}

R Output
R Code

[1] "Key Formulas and Functions Summary:"

                Purpose                Key_Formula    R_Function
                 <char>                     <char>        <char>
1: Parameter Estimation             β̂ = (X'X)⁻¹X'y          lm()
2:   Hypothesis Testing              t = β̂ⱼ/SE(β̂ⱼ)     summary()
3:      Model Selection AIC = n ln(SSE/n) + 2(k+1) step(), AIC()
4:           Prediction     ŷ ± t_{α/2} × SE(pred)     predict()
5:          Diagnostics            VIF = 1/(1-R²ⱼ) vif(), plot()
              Interpretation
                      <char>
1:   Least squares estimates
2:   Test significance of βⱼ
3: Balance fit vs complexity
4:     Individual prediction
5:   Check multicollinearity

Table 1: Model selection criteria comparison

Regression Model Types and Applications
Model_Type	Example	Key_Feature	R_squared_Range	Main_Diagnostic	When_to_Use
Simple Linear	Steel Strength	Single predictor	0.60-0.80	Linearity	Exploring relationships
Multiple Linear	Chemical Yield	Multiple predictors	0.75-0.90	Multicollinearity	Multiple continuous predictors
Polynomial	Reaction Rate	Nonlinear relationship	0.85-0.95	Curvature	Curved relationships
Categorical	Strength by Supplier	Factor variables	0.70-0.85	Group effects	Comparing groups/categories
Variable Selection	Quality Prediction	Automated selection	0.80-0.92	Overfitting	Many potential predictors

Table 2

# Model selection criteria comparison table
all_models_summary <- data.table(
  Model_Type = c("Simple Linear", "Multiple Linear", "Polynomial", "Categorical", "Variable Selection"),
  Example = c("Steel Strength", "Chemical Yield", "Reaction Rate", "Strength by Supplier", "Quality Prediction"),
  Key_Feature = c("Single predictor", "Multiple predictors", "Nonlinear relationship", "Factor variables", "Automated selection"),
  R_squared_Range = c("0.60-0.80", "0.75-0.90", "0.85-0.95", "0.70-0.85", "0.80-0.92"),
  Main_Diagnostic = c("Linearity", "Multicollinearity", "Curvature", "Group effects", "Overfitting"),
  When_to_Use = c(
    "Exploring relationships",
    "Multiple continuous predictors",
    "Curved relationships",
    "Comparing groups/categories",
    "Many potential predictors"
  )
)

kbl(all_models_summary,
  caption = "Regression Model Types and Applications"
) %>%
  kable_styling() %>%
  column_spec(1, width = "2.5cm") %>%
  column_spec(2, width = "2.5cm") %>%
  column_spec(3, width = "3cm") %>%
  column_spec(4, width = "2cm") %>%
  column_spec(5, width = "2.5cm") %>%
  column_spec(6, width = "3.5cm")

# Summary of all key formulas and criteria
formula_summary <- data.table(
  Purpose = c("Parameter Estimation", "Hypothesis Testing", "Model Selection", "Prediction", "Diagnostics"),
  Key_Formula = c(
    "β̂ = (X'X)⁻¹X'y",
    "t = β̂ⱼ/SE(β̂ⱼ)",
    "AIC = n ln(SSE/n) + 2(k+1)",
    "ŷ ± t_{α/2} × SE(pred)",
    "VIF = 1/(1-R²ⱼ)"
  ),
  R_Function = c("lm()", "summary()", "step(), AIC()", "predict()", "vif(), plot()"),
  Interpretation = c(
    "Least squares estimates",
    "Test significance of βⱼ",
    "Balance fit vs complexity",
    "Individual prediction",
    "Check multicollinearity"
  )
)

print("Key Formulas and Functions Summary:")
print(formula_summary)

1.7 Chapter Summary

This chapter covered the fundamental concepts of building empirical models:

Simple Linear Regression - Modeling relationships between two variables
Parameter Estimation - Using least squares to fit models
Hypothesis Testing - Testing significance of model parameters
Model Validation - Checking assumptions through residual analysis
Multiple Regression - Extending to multiple predictors
Advanced Topics - Polynomial models, categorical variables, variable selection

Key Principles:

Always validate model assumptions
Use appropriate diagnostic tools
Consider both statistical and practical significance
Apply subject matter knowledge in model building
Validate models on independent data when possible