# 非线性回归分析

## 多项式回归和样条回归的R实现

Posted by Chevy on March 26, 2020

## 背景

• Root Mean Squared Error，用以衡量模型预测误差。它对应着已知观察值与模型预测值之间的平均差异，RMSE的计算公式为`mean(observeds - predicteds)^2 %>% sqrt()`，RMSE的值越低，模型也就越适合。
• 以上两个指标可以简单地通过`caret`包的RMSE()和R2()函数来完成计算

• 随机将数据分为k群(k-fold)，例如5群
• 保留一个子集作为测试集，并使用剩下的k-1个集合合并后进行模型建立
• 用保留子集对训练得到的模型进行评价
• 重复前三步流程k-1次（也就是每个子集都有一次成为测试集的机会）
• 计算k次计算得到的误差的均值，就被成为cross-validation error，用以评价模型表现

## 使用线性回归举例

``````# install.packages("caret")
# install.packages("tidyverse")
library(tidyverse)
library(caret)
theme_set(theme_bw())
``````

``````# Load the data
data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)

# 1  219.12    55.44     70.44 25.44
# 2  332.28    58.68     50.16 32.40
# 3  176.76    28.68     22.92 17.52

# Split the data into training and test set
set.seed(123)
training.samples <- marketing\$sales %>%
createDataPartition(p = 0.8, list = FALSE)

train.data  <- marketing[training.samples, ]
test.data <- marketing[-training.samples, ]
``````

``````# Build the model
model <- lm(sales ~., data = train.data)
# Summarize the model
summary(model)

# Call:
#   lm(formula = sales ~ ., data = train.data)
#
# Residuals:
#   Min       1Q   Median       3Q      Max
# -10.7142  -0.9939   0.3684   1.4494   3.3619
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.594142   0.420815   8.541 1.05e-14
# youtube     0.044636   0.001552  28.758  < 2e-16
# facebook    0.188823   0.009529  19.816  < 2e-16
# newspaper   0.002840   0.006442   0.441     0.66
#
# Residual standard error: 2.043 on 158 degrees of freedom
# Multiple R-squared:  0.8955,	Adjusted R-squared:  0.8935
# F-statistic: 451.2 on 3 and 158 DF,  p-value: < 2.2e-16

# Make predictions
predictions <- model %>% predict(test.data)
predictions

# 2         3         6        14        15        24        32        40        41        46        51        58
# 15.036290 15.151964 15.395866 10.563146 22.137863 19.741211 13.715581 24.457935 19.601260 18.178608 15.116375 15.296489
# 65        66        68        81        86        91        93       104       107       116       124       129
# 20.412706  9.400329 14.375762 13.812250 18.335633 11.929636 23.046574 17.616955  7.526894 15.726891 18.069954 26.507800
# 134       137       138       139       148       156       168       178       181       182       183       184
# 23.111689 13.833975 25.006188 11.835838 27.874451  6.461594 15.915354 14.597916 12.599533 16.614622  7.997147 28.986837
# 189       192
# 22.075364 10.105750
``````

• Residual standard error：可以理解为模型预测结果平均偏出训练集实际结果几个单位（有点像是方差），越低越好
• F-statistic：F检验给出了模型的整体重要性评估，它评估至少一个预测变量是否具有非零系数

``````# Model performance

# (a) Prediction error, RMSE
RMSE(predictions, test.data\$sales)
# [1] 1.965508

# (b) R-square
R2(predictions, test.data\$sales)
# [1] 0.9049049
``````

RSEM：类似以上的Residual standard error，是针对测试集得到的结果，越低越好

## 非线性回归

• Polynomial regression：多项式回归，它将多项式（平方，立方）加入到模型里
• Spline regression：样条回归，使用多个多项式线段(polynomial segments)从而使得曲线更加平滑，限定这些segments的值成为Knots

### 读入数据

`tidyverse` 包用以数据的处理和可视化

`caret` 包含了一些简单回归函数

``````library(tidyverse)
library(caret)
theme_set(theme_classic())

# 读入示例数据，来自MASS包的BOSTON数据
data("Boston", package = "MASS")
# Split the data into training and test set
set.seed(111)
training.samples <- Boston\$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
``````

``````ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth()
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 我们可以看到这里ggplot2替我们自动选择了一种非线性回归模型~loess regression
``````

### 线性模型的表现

`medv = b_0 + b_1*lstat`

``````# Build the model
model <- lm(medv ~ lstat, data = train.data)
summary(model)
# Call:
#   lm(formula = medv ~ lstat, data = train.data)
#
# Residuals:
#   Min      1Q  Median      3Q     Max
# -15.303  -4.086  -1.573   1.960  24.384
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept) 34.78563    0.63794   54.53   <2e-16
# lstat       -0.96223    0.04401  -21.86   <2e-16
#
# Residual standard error: 6.389 on 405 degrees of freedom
# Multiple R-squared:  0.5413,	Adjusted R-squared:  0.5402
# F-statistic:   478 on 1 and 405 DF,  p-value: < 2.2e-16

# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)
# RMSE        R2
# 1 5.455969 0.5611306
``````

``````ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ x)
``````

### 多项式回归模型的表现

`mdev= b_0 + b_1*1stat+b_2*1stat^2+...+b_n*1stat^n`

``````model <- lm(medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
summary(model)
# Call:
#   lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = train.data)
#
# Residuals:
#   Min       1Q   Median       3Q      Max
# -15.4151  -3.8283  -0.4949   2.3619  25.3261
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)                 43.211355   0.975879   44.28   <2e-16
# poly(lstat, 2, raw = TRUE)1 -2.361731   0.137782  -17.14   <2e-16
# poly(lstat, 2, raw = TRUE)2  0.043711   0.004127   10.59   <2e-16
#
# Residual standard error: 5.659 on 404 degrees of freedom
# Multiple R-squared:  0.641,	Adjusted R-squared:  0.6392
# F-statistic: 360.7 on 2 and 404 DF,  p-value: < 2.2e-16
``````

``````model <- lm(medv ~ poly(lstat, 10, raw = TRUE), data = train.data)
summary(model)
# Call:
#   lm(formula = medv ~ poly(lstat, 10, raw = TRUE), data = train.data)
#
# Residuals:
#   Min       1Q   Median       3Q      Max
# -14.2679  -3.0880  -0.5218   2.1185  26.2338
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)                    7.285e+00  2.899e+01   0.251   0.8017
# poly(lstat, 10, raw = TRUE)1   5.325e+01  2.992e+01   1.780   0.0759
# poly(lstat, 10, raw = TRUE)2  -2.601e+01  1.238e+01  -2.101   0.0363
# poly(lstat, 10, raw = TRUE)3   5.865e+00  2.728e+00   2.150   0.0322
# poly(lstat, 10, raw = TRUE)4  -7.584e-01  3.593e-01  -2.111   0.0354
# poly(lstat, 10, raw = TRUE)5   6.108e-02  2.989e-02   2.044   0.0416
# poly(lstat, 10, raw = TRUE)6  -3.168e-03  1.607e-03  -1.971   0.0494
# poly(lstat, 10, raw = TRUE)7   1.058e-04  5.562e-05   1.902   0.0579
# poly(lstat, 10, raw = TRUE)8  -2.197e-06  1.195e-06  -1.838   0.0668
# poly(lstat, 10, raw = TRUE)9   2.578e-08  1.449e-08   1.779   0.0760
# poly(lstat, 10, raw = TRUE)10 -1.305e-10  7.566e-11  -1.725   0.0854
#
# Residual standard error: 5.343 on 396 degrees of freedom
# Multiple R-squared:  0.6863,	Adjusted R-squared:  0.6784
# F-statistic: 86.64 on 10 and 396 DF,  p-value: < 2.2e-16
``````

``````# Build the model
model <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = train.data)
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)
# RMSE        R2
# 1 4.551925 0.6912013
``````

``````ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))
``````

### Log转换

``````# Build the model
model <- lm(medv ~ log(lstat), data = train.data)
summary(model)
# Call:
#   lm(formula = medv ~ log(lstat), data = train.data)
#
# Residuals:
#   Min       1Q   Median       3Q      Max
# -14.5104  -3.4671  -0.6271   2.2332  25.9824
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)  52.3915     1.0828   48.39   <2e-16
# log(lstat)  -12.5858     0.4441  -28.34   <2e-16
#
# Residual standard error: 5.462 on 405 degrees of freedom
# Multiple R-squared:  0.6648,	Adjusted R-squared:  0.664
# F-statistic: 803.3 on 1 and 405 DF,  p-value: < 2.2e-16
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)
# RMSE        R2
# 1 4.749176 0.6650908

ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ log(x))
``````

### 样条回归

R包`splines`里就提供了 `bs`函数(B-Spline Basis for Polynomial Splines)来创建样条回归，不过你需要指定两个参数：

1. 多项式回归的degree，也就是多项式的个数
2. knots的位置（一般来说不特定，根据数据的复杂程度来看分成多少段）
``````library(splines)
# Build the model
knots <- quantile(train.data\$lstat, p = c(0.25, 0.5, 0.75))
model <- lm (medv ~ bs(lstat, knots = knots), data = train.data)
summary(model)
# Call:
#   lm(formula = medv ~ bs(lstat, knots = knots), data = train.data)
#
# Residuals:
#   Min       1Q   Median       3Q      Max
# -13.8751  -3.1449  -0.7017   2.1611  26.6803
#
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)                 50.802      2.719  18.687  < 2e-16
# bs(lstat, knots = knots)1  -14.298      4.191  -3.412 0.000711
# bs(lstat, knots = knots)2  -26.684      2.619 -10.189  < 2e-16
# bs(lstat, knots = knots)3  -28.081      3.104  -9.046  < 2e-16
# bs(lstat, knots = knots)4  -41.369      3.305 -12.519  < 2e-16
# bs(lstat, knots = knots)5  -38.778      4.545  -8.532 3.01e-16
# bs(lstat, knots = knots)6  -39.742      4.393  -9.046  < 2e-16
#
# Residual standard error: 5.354 on 400 degrees of freedom
# Multiple R-squared:  0.6818,	Adjusted R-squared:  0.6771
# F-statistic: 142.9 on 6 and 400 DF,  p-value: < 2.2e-16
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)
# RMSE        R2
# 1 4.58503 0.6868804
``````

``````ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 3))
``````

### 广义加性模型

`mgcv`包可以完成GAM的回归拟合，注意以上都是通过lm()函数对数据进行拟合，这里是使用`mgcv`包的`gam`函数进行回归拟合，`mgcv`包的`s`函数负责选定最优的knots：

``````library(mgcv)
# Build the model
model <- gam(medv ~ s(lstat), data = train.data)
summary(model)
# Family: gaussian
#
# Formula:
#   medv ~ s(lstat)
#
# Parametric coefficients:
#   Estimate Std. Error t value Pr(>|t|)
# (Intercept)  22.6781     0.2646   85.72   <2e-16
#
# Approximate significance of smooth terms:
#   edf Ref.df   F p-value
# s(lstat) 7.513  8.448 102  <2e-16
#
# R-sq.(adj) =  0.679   Deviance explained = 68.5%
# GCV = 29.097  Scale est. = 28.488    n = 407
# Make predictions
predictions <- model %>% predict(test.data)
# Model performance
data.frame(
RMSE = RMSE(predictions, test.data\$medv),
R2 = R2(predictions, test.data\$medv)
)
# RMSE        R2
# 1 4.633751 0.6807803
``````

``````ggplot(train.data, aes(lstat, medv) ) +
geom_point() +
stat_smooth(method = gam, formula = y ~ s(x))
``````

## 模型比较

Log转换 0.6650908 4.749176

## 推荐阅读

Splines, Additive Models, Model Selection and Validation: https://www.andrew.cmu.edu/user/achoulde/95791/lectures/lecture02/lecture02-95791.pdf