In chapter 1 of their 2016 book Computer Age Statistical Inference, Bradley Efron and Trevor Hastie use as an example a study of kidney function, comparing results from linear regression and lowess
(locally weighted scatterplot smoothing). The purpose of this blog post is to reproduce their results using R.
I will not copy the plots from the book here (for copyright reasons), but you can download a pdf of the book for free from a website created by the authors: https://web.stanford.edu/~hastie/CASI/index.html. I am specifically interested in Figures 1.1 and 1.2, as well as Table 1.1.
Let’s first grab the kidney function data (also kindly provided by the authors on their website):
library(tidyverse)
kidney_data <- read_delim('https://web.stanford.edu/~hastie/CASI_files/DATA/kidney.txt', delim = ' ')
kidney_data
## # A tibble: 157 x 2
## age tot
## <dbl> <dbl>
## 1 18 2.44
## 2 19 3.86
## 3 19 -1.22
## 4 20 2.3
## 5 21 0.98
## 6 21 -0.5
## 7 21 2.74
## 8 22 -0.12
## 9 22 -1.21
## 10 22 0.99
## # ... with 147 more rows
Just getting the linear regression line is straightforward:
base_plot <- kidney_data %>%
ggplot(aes(x = age, y = tot)) +
geom_point(shape = 8, color = 'blue', size = 1) +
theme_bw() +
scale_x_continuous(breaks = seq(20, 90, 10)) +
scale_y_continuous(breaks = seq(-6, 4, 2)) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
linear_plot <- base_plot +
geom_smooth(method = 'lm', color = 'green', se = F)
linear_plot
Here I separated out the plot of just the data (base_plot
), so that I can use it for both the linear regression plot and the lowess
plot. I also went to a little bit of trouble to make the plot in a similar style as Figure 1.1 in the book. It is not an exact copy, but close enough.
I removed the standard error from geom_smooth
, since the default is to plot the error as a continuous curve, and Figure 1.1 only shows the error at a few discrete points. Let’s next calculate the standard error at those points:
library(broom)
linear_model <- lm(tot ~ age, data = kidney_data)
error_points <- data.frame(age = seq(20, 80, 10))
linear_error <- augment(linear_model, newdata = error_points)
linear_error
## # A tibble: 7 x 3
## age .fitted .se.fit
## * <dbl> <dbl> <dbl>
## 1 20 1.29 0.207
## 2 30 0.502 0.155
## 3 40 -0.284 0.147
## 4 50 -1.07 0.189
## 5 60 -1.86 0.258
## 6 70 -2.64 0.337
## 7 80 -3.43 0.420
Now add the error bars to the linear plot:
linear_plot +
geom_segment(data = linear_error,
aes(x = age, xend = age,
y = .fitted - 2*.se.fit, yend = .fitted + 2*.se.fit))
The plot shows \(\pm2\) standard errors at each of the 7 selected age
values. This is a good replication of Figure 1.1.
For Figure 1.2, the authors use the lowess
algorithm in R, which stands for ‘locally weighted scatterplot smoothing.’ There is an updated version in R named loess
, and I assume there is some mapping between the two, but it is not immediately obvious, so I will stick with lowess
.
Again, obtaining just the lowess
fit is fairly straightforward, except for the fact that geom_smooth
uses loess
and not lowess
, so we have to calculate the fit ‘by hand’ (i.e. outside of ggplot
) and then add it to the plot:
lowess_model <- lowess(kidney_data$age, kidney_data$tot, 1/3) %>%
as_tibble()
lowess_plot <- base_plot +
geom_line(data = lowess_model, aes(x = x, y = y), color = 'green')
lowess_plot
The ‘1/3’ in lowess
is a parameter governing the smoothness of the fit. Now, to get the standard errors for the lowess
fit, the authors use the bootstrap. This is not the place for a full-fledged introduction to the bootstrap, but in brief it is a way of approximating the standard error by repeatedly sampling with replacement from the original data. There are many ways to run a bootstrap sample in R: here I’ll uses bootstraps
from the rsample
package:
library(rsample)
bootstrap_samples <- bootstraps(kidney_data, times = 25)
bootstrap_samples
## # Bootstrap sampling
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [157/65]> Bootstrap01
## 2 <split [157/57]> Bootstrap02
## 3 <split [157/57]> Bootstrap03
## 4 <split [157/59]> Bootstrap04
## 5 <split [157/54]> Bootstrap05
## 6 <split [157/58]> Bootstrap06
## 7 <split [157/62]> Bootstrap07
## 8 <split [157/55]> Bootstrap08
## 9 <split [157/62]> Bootstrap09
## 10 <split [157/60]> Bootstrap10
## # ... with 15 more rows
Each sample (split) contains an “analysis” data set that is the same length as the original kidney_data
, and an “assessement”" data set containing all the original data points that did not get included in the bootstrap sampling with replacement. Now, for each of these bootstrap samples, we want to apply a lowess
fit, so let’s write a function that does that and then map it over the bootstraps:
lowess_fit <- function(split){
lowess(analysis(split)$age, analysis(split)$tot, 1/3)
}
bootstrap_samples %>%
mutate(lowess_points = map(splits, lowess_fit)) %>%
select(id, lowess_points) %>%
mutate_if(is.list, map, as_data_frame) %>%
unnest()
## # A tibble: 3,925 x 3
## id x y
## <chr> <dbl> <dbl>
## 1 Bootstrap01 18 1.21
## 2 Bootstrap01 19 1.16
## 3 Bootstrap01 19 1.16
## 4 Bootstrap01 20 1.07
## 5 Bootstrap01 21 0.955
## 6 Bootstrap01 21 0.955
## 7 Bootstrap01 21 0.955
## 8 Bootstrap01 21 0.955
## 9 Bootstrap01 21 0.955
## 10 Bootstrap01 22 0.953
## # ... with 3,915 more rows
Let’s plot these 25 bootstrapped replications of the lowess
fit, along with the original data. I’ll add a distinct
because I don’t care about multiple identical values for plotting purposes:
lowess_25 <- bootstrap_samples %>%
mutate(lowess_points = map(splits, lowess_fit)) %>%
select(id, lowess_points) %>%
mutate_if(is.list, map, as_data_frame) %>%
unnest() %>%
distinct()
base_plot +
geom_line(data = lowess_25, aes(x = x, y = y, color = id),
alpha = 0.5) +
theme(legend.position = 'none')
This gives a rough idea of the standard error of the lowess
fit. Now, as did Efron and Hastie, we’ll take 250 bootstrap samples and calculate the standard error. First, we need a function that will give the lowess
fit at the same 7 points where we calculated the standard error of the linear model. Since lowess
outputs a series of \(x\) and \(y\) values, I’ll just use linear interpolation between these values:
lowess_predict <- function(fit){
approx(fit %>% as_tibble() %>% distinct(), xout = seq(20, 80, 10), rule = 2)
}
(rule = 2
allows extrapolation outside the input data range, which is necessary since the bootstrap sample won’t always contain the extrema of the original data set.)
Now, for the full 250 replications. I’ll break this into 2 steps so it is more clear what is going on.
step_1 <- bootstraps(kidney_data, times = 250) %>%
mutate(lowess_points = map(splits, lowess_fit)) %>%
mutate(prediction = map(lowess_points, lowess_predict)) %>%
select(id, prediction) %>%
mutate_if(is.list, map, as_data_frame) %>%
unnest()
step_1
## # A tibble: 1,750 x 3
## id x y
## <chr> <dbl> <dbl>
## 1 Bootstrap001 20 1.74
## 2 Bootstrap001 30 0.723
## 3 Bootstrap001 40 -0.383
## 4 Bootstrap001 50 -1.30
## 5 Bootstrap001 60 -2.22
## 6 Bootstrap001 70 -3.13
## 7 Bootstrap001 80 -4.02
## 8 Bootstrap002 20 1.84
## 9 Bootstrap002 30 0.595
## 10 Bootstrap002 40 -0.933
## # ... with 1,740 more rows
step_2 <- step_1 %>%
group_by(x) %>%
summarise(mean = round(mean(y), 2),
sd = round(sd(y), 2))
step_2
## # A tibble: 7 x 3
## x mean sd
## <dbl> <dbl> <dbl>
## 1 20 1.56 0.7
## 2 30 0.65 0.25
## 3 40 -0.65 0.34
## 4 50 -1.28 0.33
## 5 60 -1.92 0.38
## 6 70 -2.69 0.49
## 7 80 -3.48 0.71
These mean and standard deviation values match fairly well with the values given in Table 1.1 of Computer Age Statistical Inference, with the caveat that the authors only did 250 bootstrap replications, and we only did 250 bootstrap replications, so we shouldn’t expect exact agreement.
Let’s add the bootstrap standard errors to the plot of the lowess
fit:
lowess_plot +
geom_segment(data = step_2,
aes(x = x, xend = x,
y = mean - 2*sd, yend = mean + 2*sd))
This provides a good replication of Figure 1.2 in the book.
As always, I am sure that there are other ways to accomplish the same result in R, likely including ones that are more concise and clear. If you have a nice approach (or if you find a mistake in what I’ve done), please send me an email and let me know!