Bootstrap

Introduction

The primary goal of statistics is to extract information about a population (described by a distribution function F) based on a random sample X={X1,,XN} from this population. More often than not, one is interested in a particular characteristic of the population denoted generally by θ=θ(F).

Leading Ex.: The mean θ=EX1=xdF(x). Δ

Let us focus generally on F rather than specifically on θ. We can estimate F

  • parametrically, if we are willing to assume that F{FλλΛRp} for some integer p, we can take F^=Fλ^ for an λ^ estimator of the parameter vector λ, or
  • non-parametrically by the empirical distribution function [ N(x) = {n=1}^N _{[X_n x]} ]
edf_plot <- function(N){
  X <- rnorm(N)
  EDF <- ecdf(X)
  plot(EDF)
  x <- seq(-4,4,by=0.01)
  points(x,pnorm(x),type="l",col="red")
}
set.seed(517)
edf_plot(12)
edf_plot(50)

Leading Ex.: The empirical estimator of the mean is exactly the mean of the empirical distribution θ^:=1Nn=1NXn=xdF^N(x). Δ

In both cases, the population characteristic of interest θ=θ(F) can often be estimated by the corresponding sample characteristic of θ^=θ(F^). The sample characteristic is called a statistic, and it has a sampling distribution. Sometimes the statistic of interest can be a scaled version of the sample characteristic.

Leading Ex.: For F a distribution function of the Gaussian distribution with variance σ2, θ^N(θ,σ2N), which is the sampling distribution. Δ

We are rarely ever interested only in a point estimator of θ: most statistical procedures require some knowledge of the sampling distribution. For example, confidence intervals for θ or hypothesis testing about θ require knowledge of quantiles of the sampling distribution. And even in the case of point estimation, we may be interested in the estimator’s bias, variance, or jointly its mean squared error. Those are all characteristics of the sampling distribution.

While the sampling distribution is generally unknown (it can be found and worked with analytically in simple cases only, like in our leading example, in other cases it has to be approximated by the CLT), it is identified by the underlying F and the sampling process. The bootstrap embraces the idea that a (re)sampling process from F^ can mimic the sampling process from F itself that has produced the data. But since F^ is known, the characteristics of the resampling distribution (which will in turn serve as proxies for the characteristics of the sampling distribution) are more readily available. Consider the following diagram Sampling (real world):FX1,,XNθ^=θ(F^N)Resampling (bootstrap world):F^NX1,,XNθ^=θ(F^N) The bootstrap idea is that the bootstrap world can serve as a proxy for the real world. The advantage of the bootstrap world is that it is ours to control. Moreover, approximating the sampling process by resampling (i.e. performing bootstrap) can sometimes have further benefits.

Leading Ex.: Assume we are looking for a one-sided CI [θα,) with coverage 1α, that is we search for a random variable θα such that P(θθα)=1α, for α(0,1). We know that T=NX¯Nθσ^tn1P{Ttn1(1α)}=1α and so we get a CI with exact coverage by expressing θ from the inequality Ttn1(1α): θX¯Nσ^Ntn1(1α):=θ^α. When the Gaussianity assumption is taken away, i.e., when we only assume X1,,XN are i.i.d. with EX12<, we have from the CLT that TdN(0,1) and thus P(θθ^α)1αforθ^α=X¯Nσ^Nz(1α), where z(1α) is the (1α)-quantile of the standard Gaussian. By the Berry-Esseen theorem: PF(Tx)Φ(x)=O(1N)for all xP(θθ^α)=1α+O(1N). That is, the coverage of the asymptotic CI is exact up to the order O(N1/2).

Now consider the bootstrap version of the CI. Let X1,,XN be i.i.d. from the empirical distribution F^N obtained from X1,,XN i.i.d. with EX12<. Then, it can be shown using Edgeworth expansions that PF(Tx)=Φ(x)+1Na(x)ϕ(x)+O(1N) for all xR and Φ and ϕ being respectively the distribution function and the density of the standard Gaussian, while the bootstrap version satisfies PF^N(Tx)=Φ(x)+1Na^(x)ϕ(x)+O(1N), where a^(x)a(x)=O(N1/2) and T=NXNXNσ^. Hence

(1)PF^N(Tx)PF(Tx)=O(1N)P(θX¯Nσ^Nq(1α)=:θ^α)=1α+O(1N),

i.e., the (one-sided) bootstrap interval has exact coverage up to the order O(N1)!

The previous implication probably requires some more justification though. T has a sampling distribution inherited from F and let q(α) be its α-quantile. Then naturally P{Tq(1α)}=1αP(θX¯Nσ^Nq(1α))=1α, but the problem is we do not know the sampling distribution, and hence neither its quantiles. If we use asymptotic normality, we approximate the quantile as the corresponding Gaussian quantile, but we commit a certain error. The bootstrap alternative instead tries to approximate the unknown sampling distribution more directly by the distribution of T=NXNXNσ^, where XN and σ^ are calculated from the bootstrap sample X1,,XN. T also has a sampling distribution (inherited from F^N) and let q(α) be its α-quantile. Then P(T1q(1α))=1α, and according to the LHS of (1) we only commit error of order O(N1) when we replace T by T, which then yields the RHS of (1). Δ

Remark: As discussed by Davison and Hinkley (see reference to their book below), the benefits of the bootstrap distribution of T, and hence being an order of magnitude closer to the true distribution, disappears if T is not pivotal, i.e., if its (limiting) distribution does not depend on unknowns. This is justified by the fact that the unknown parameters on which the (limiting) distribution of T would depend, needs to be estimated to get the bootstrap distribution which would have a different leading term (in the Edgeworth expansion) than the true (limiting) distribution.

Example: Assume X1,,XN are i.i.d. with E|X1|3< and EX1=μ, and let θ=μ3. In the same spirit to the leading example above, the empirical estimator of θ is θ^=(X¯N)3, which is biased: b:=bias(θ^)=Eθ^θ. We can use the bootstrap to estimate the bias b as b^ and define a bias-corrected estimator as θ^b=θ^b^. Let us do just that.

In this case, the bias can actually be calculated explicitly: EFθ^=EFX¯N3=EF[μ+N1n=1N(Xnμ)]3=μ3+N13μσ2+N2γ=b, where σ2=E(X1μ)2 and γ=E(X1μ)3. For the bootstrap version, we have EF^Nθ^=EF^N(X¯N)3=EF^N[X¯N+N1n=1N(XnX¯N)]3=X¯N3+N13X¯Nσ^2+N2γ^=b^, where σ^2=N1n(XnX¯N)2 and γ^=N1n(XnX¯N)3.

Now, for θ^b=θ^b^=X¯N3N13X¯Nσ^2N2γ^ we have EFθ^b=μ3+N13[μσ2EFX¯Nσ^2]+N2[γEFγ^].

At this point, there are several ways to argue that the bias has been decreased (but all of them are a bit painful). For example, it can be calculated explicitly that EX¯Nσ^2=μσ2+O(N1) and Eγ^=γ+O(N1), and hence bias(θ^b)=O(N2) as opposed to bias(θ^)=O(N1). Thus, using bootstrap, we have successfully reduced the bias by the full order of the sample size. Δ

In the examples above, we saw how resampling can be used to construct confidence intervals with better coverage compared to those obtained using CLT, and how the resampling distribution can be used to partially correct the bias of an estimator. Of course, the second example is special in that it allows for explicit calculations, while explicit calculation of q(α) is not possible. Nonetheless, the bootstrap can be naturally paired up with Monte Carlo in such situations.

We have already seen how basic Monte Carlo can help us avoid analytic calculations and perform numerous statistical tasks. However, the cases in which the distribution under the null is fully known (like in this example) are quite rare. In many more cases, the distribution under the null is unknown, but can be approximated by the bootstrap idea above, and Monte Carlo can then be used to approximate desirable characteristics of such a distribution. From this point of view, by “bootstrap” it is commonly meant to combine

  • the plug-in principle (i.e. estimating the unknowns), and
  • Monte Carlo principle (i.e. simulation instead of analytic calculations).

The former amounts to estimating either

  • some parameters in the parametric problems parametric bootstrap, or
  • the whole F via the empirical estimator F^N in non-parametric problems non-parametric bootstrap.

The Non-Parametric Bootstrap

  • let X={X1,,XN} be a random sample from F
  • we are interested in θ=θ(F), a characteristic of F
  • the estimator is given as θ^=θ(F^)
    • we will also write θ^=θ[X], since F^ and thus the estimator depends on the sample
    • F^ can be either a parametric on non-parametric estimate
  • the distribution of a scaled estimator T=g(θ^,θ)=g(θ[X],θ) is of interest
    • e.g., T=N(θ^θ)

The workflow of the bootstrap is as follows for some BN (e.g., B=1000): DataResamplesX={X1,,XN}{X1={X1,1,,X1,N}T1=g(θ[X1],θ[X])XB={XB,1,,XB,N}TB=g(θ[XB],θ[X]) and estimate the unknown distribution FT(x)=P(Tx) by the empirical distribution of T1,,TB, i.e., by F^T,B(x)=1Bb=1BI[Tbx]. Any characteristic of FT can be estimated by the corresponding characteristic of F^T,B.

For example, the quantiles q(α) in the leading example above can be approximated by the quantiles of this empirical distribution.

Note: The bootstrap resamples are obtained by sampling with replacement from the data X1,,XN. But there is a natural question: how large the number of resamples B should be? That depends on the characteristic of interest. To approximate quantiles, at least B=103, but rather B=104, is advisable, while to estimate the variance, B200 is usually considered.

Some Theory

Any statistical methodology should be at least consistent. If we have consistency, we may ask for rates of convergence, like sketched in the example above. But with bootstrap, theoretical research is hard, results have to derived again for different kinds of statistics. Here we will only provide a consistency result for the case of bootstrapping a smooth transformation of the empirical mean.

Recall that we are using the empirical bootstrap distribution F^T,B as a proxy for the unknown distribution of the scaled estimator T, denoted by FT,N(x)=P(Tx). As we increase the number of bootstrap samples, F^T,B naturally approaches FT,N(x)=P(T1x). In fact, by the Glivenko-Cantelli theorem: supx|F^T,B(x)FT,N(x)|a.s.0asB. Hence the question is how does supx|FT,N(x)FT,N(x)| behave as N.

Theorem. Let EX12< and T=h(X¯N), where h is continuously differentiable at μ:=E(X1) and such that h(μ)0. Then supx|FT,N(x)FT,N(x)|a.s.0asN.

The previous theorem shows consistency of the bootstrap or smooth transformations of the empirical average. It also holds for random vectors, where h:RpRq where has continuous partial derivatives on some neighborhood of μ.

Applications

Here, we will demonstrate how to device bootstrap procedures for standard statistical tasks. We will not prove validity of those procedures (i.e., we will not show that the bootstrap works).

Confidence Intervals

Let T=N(θ^θ) where θR, thus Tb=N(θ^bθ^) for b=1,,B. We can take qB(α) for α(0,1) to be the α-quantile of the empirical distribution based on Tb for b=1,,B, i.e., the α-quantile of the bootstrap distribution of T. Provided that the bootstrap works (is consistent), we have P(qB(α/2)N(θ^θ)qB(1α/2))1α and hence (θ^qB(1α/2)N,θ^qB(α/2)N) is a CI for θ with asymptotically correct coverage (again, provided that the bootstrap works and hence that the quantiles of the bootstrap distribution converge to those of the true distribution of T).

Note that the interval above is different from the one we worked with in our leading example above, where we have standardized for the scale. This is called studentization and it is highly recommended. It can often be shown that bootstrap procedures based on studentized statistics have better properties. For example, in the leading example above we have seen that the studentized CI had a faster rate of convergence (to the correct coverage) than the CI based on asymptotic normality. This would not happen without studentization. So let’s change the basic CI above into a studentized CI.

Assume that N(θ^θ)N(0,v2), i.e., v2 is the asymptotic variance of T. Let v^ be a consistent estimator for v. Now we consider the studentized statistics: T=Nθ^θv^&Tb=Nθ^bθ^v^b where v^b is the estimator of the same form as v^, but calculated from the b-th resample Xb. Again, let qB(α) for α(0,1) be the α-quantile of the empirical distribution based on Tb for b=1,,B. If bootstrap works, we have now P(qB(α/2)Nθ^θv^qB(1α/2))1α from which we get the CI with asymptotically correct coverage: (θ^qB(1α/2)Nv^,θ^qB(α/2)Nv^)

Variance Estimation

We often know that N(θ^θ)dNp(0,Σ), but in these cases V=N1Σ often has a complicated form (e.g., it has a complicated dependency on unknown parameters).

But the bootstrap can be deployed quite easily: V^=1B1b=1B(θ^bθ¯)(θ^bθ¯),whereθ¯=1Bb=1Bθ^b, where θ^b is the same estimator as θ^, only calculated from the b-th bootstrap resample.

Note that V^ is an estimator of N1Σ, not just of Σ. Critically, the bootstrap will fail if var(θ^)=, which can actually happen even when the asymptotic normality holds.

Example: Let X1,,XN be a random sample from the density f(x)=3/x4I[x1]. From the CLT we have N(X¯N3/2)N(0,3/4). Taking g(x)=exp(x4), the delta method gives us N(g(X¯n)g(3/2))N(0,{g(3/2)}23/4) But the bootstrap will not work in this case, since even Eg(X¯n)=, see the code below.

x <- 1+log(1+(0:1000)/10)
target_d <- function(x){
  return(3/x^4)
}
plot(x,target_d(x),type="l")
x <- (1:6000)/1001
proposal_d <- dcauchy(x)
points(1+x,2*proposal_d,type="l",col="red")

const <- 3/1^4 / (2*dcauchy(0)) # 4.7

# rejection sampling from the target
set.seed(317)
N <- 1000
X <- rep(0,N)
j <- 0
while(j < N){
  Y <- 1+abs(rcauchy(1))
  U <- runif(1)
  if(U < target_d(Y)/dcauchy(Y)/const){
    X[j+1] <- Y
    j <- j+1
  }
}
hist(X,breaks=16,freq=F)
points(x,target_d(x),type="l")

# bootstrap estimator of the variance
B <- 1000
boot_stat <- rep(0,B)
for(b in 1:B){
  Xb <- sample(X,N,replace=T)
  boot_stat[b] <- exp(mean(Xb)^4)
}
(sigma_star <- 1/(B-1)*sum((boot_stat-mean(boot_stat))^2)) # overshot
[1] 5482212
(4*exp(1.5^4)*1.5^3)^2*3/4/N # true asymptotic variance we wish to estimate
[1] 3411.618

Bias Reduction

Unbiased estimators are in reality quite rare. But, we can estimate the bias by bootstrapping as b^=θ¯θ^ and define the bias-corrected estimator θ^b=θ^b^.

We have seen in an example above that the bias-corrected estimator can have better properties. In that example, we tried to estimate g(EX1) for g(x)=x3 by g(X¯N), and we got a faster rate for the bias of the bias-corrected estimator. Something similar happens when working with any smooth enough g instead.

Hypothesis Testing

Assume we wish to test a hypothesis H0 using a statistic T. Depending on the form of the alternative hypothesis H1, either

  • large values of T,
  • small values of T, or
  • both large and small values of T

testify against the null hypothesis and in favor of the alternative. Bootstrapping the statistic, we obtain T1,,TB and depending on what values of the statistics testify against the hypothesis, we can estimate the p-value as either

  • p-val^=1B(b=1BI[Tbtobs]),
  • p-val^=1B(b=1BI[Tbtobs]), or
  • p-val^=1B(b=1BI[|Tb||tobs|]),

If the observed value tobs of the statistic T is not too extreme (the form of H1 decides what “extreme” is), we will observe more extreme values coming out of our bootstrap, which in turn increases our estimate of the p-value, which in turn prevents us from rejecting H0.

Example: Let X1,,XN be a random sample with E(X12)< and consider H0:μ=μ0 against

  • H1:μ>μ0. Taking the studentized statistic T=N(X¯Nμ0)/s for s=(N1)1n(XnX¯N)2 and the corresponding bootstrap statistics Tb=N(X¯bX¯N)/s, the bootstrap p-value is p-val^=1B(b=1BI[Tbtobs]).
set.seed(517)
N <- 100
X <- rexp(N,1/2) # thus true mean = 2
mu_0 <- 1.78 # hypothesized value
T_stat <- (mean(X)-mu_0)/sd(X)*sqrt(N) #asympt. normal under H0
B <- 10000
boot_stat <- boot_stat_param <- rep(0,B)
for(b in 1:B){
  Xb           <- sample(X,N,replace=T)
  boot_stat[b] <- (mean(Xb)-mean(X))/sd(Xb)*sqrt(N)
  Xb_param           <- rexp(N, rate=1/mu_0)
  boot_stat_param[b] <- (mean(Xb_param)-mu_0)/sd(Xb_param)*sqrt(N)
}
p_val       <- mean(boot_stat >= T_stat)
p_val_param <- mean(boot_stat_param >= T_stat)
c(p_val,p_val_param)
[1] 0.0379 0.0394
I(abs(T_stat) > qnorm(0.95)) # asymptotic test, easy to calculate asymptotic p-val
[1] FALSE

Notice that while (parametric and non-parametric) bootstrap rejects the hypothesis on 5 % level, the test based on asymptotic normality does not. This could be expected, since we saw in our leading example above that bootstrap produces better one-sided CIs, which are naturally dual to one-sided hypothesis tests.

  • H1:μμ0. Taking the same studentized statistics, the bootstrap p-value is this time p-val^=1B(b=1BI[|Tb||tobs|]).
set.seed(517)
N <- 100
X <- rexp(N,1/2) # thus true mean = 2
mu_0 <- 1.7 # reduced, since harder to reject here
T_stat <- (mean(X)-mu_0)/sd(X)*sqrt(N)
B <- 10000
boot_stat <- rep(0,B)
for(b in 1:B){
  Xb <- sample(X,N,replace=T)
  boot_stat[b] <- (mean(Xb)-mean(X))/sd(Xb)*sqrt(N)
}
p_val       <- mean(boot_stat >= T_stat)
p_val
[1] 0.0139
2*(1-pnorm(T_stat)) # asymptotic test, easy to calculate asymptotic p-val
[1] 0.05592678

Final Thoughts

The name “bootstrap” refers to one of the absurd exploits of Baron Munchausen, who allegedly got out from the bottom of the lake by pulling up his bootstraps. At first glance, it also seems absurd that by resampling (particularly with a non-parametric bootstrap) one could actually obtain better results compared to just using the data set once – after all, there cannot be more information in the resampled data sets. Nonetheless, the bootstrap can serve us when direct analytic calculations are not available. In such cases, one often relies on CLT and the asymptotic normality it provides. However, the asymptotic normality is not a very sharp tool: the distribution is always symmetric and thus it cannot work too well, e.g., for skewed distributions (of course, the skewness vanishes asymptotically, but in finite samples skewness is the main term describing the departure of the distribution of a given statistics from normality). A well set-up bootstrap can, on the other hand, provide an effective correction (e.g., for skewness) and thus provide a better approximation of the distribution of a given statistics compared to normality.

There is a tendency to use bootstrap blindly and there has been claims such that the bootstrap (in its concordance with Monte Carlo simulations) makes mathematical skills obsolete in statistics. While bootstrap can indeed reduce the mathematical finesse needed to solve a statistical problem, e.g., to come up with a confidence interval or a test in practice, verifying the coverage of such a confidence interval or the validity of such a test actually requires even higher mathematical skills.

We have not dived into such theoretical considerations, since the mathematical finesse required for those is far beyond the scope of this course. However, the folk knowledge is that bootstrap works when the statistics possesses asymptotic normality with a non-degenerate variance (which generally requires some regularity or smoothness conditions). On the other hand, one has to be careful when working with order statistics, extremal statistics, non-smooth transformations, or non-i.i.d. regimes (e.g., time series or regression settings).

At the same time, it is fair to admit that theoretical properties of bootstrap are only asymptotic, and those may not serve as a good characterization of the finite sample performance. For example, it can be shown that the iterated bootstrap can reduce the asymptotic bias by an order increasing with the number of bootstrap iterations. However, more then two iterations are not practically viable. And even if they were, with more and more bootstrap iterations it will take larger and larger sample size for the asymptotic rates to “kick in” and result into better finite sample performance. Thus simulation studies examining the finite sample performance of any bootstrap procedure are vital.

But the bottom line is that any bootstrap procedure (take the goodness-of-fit parametric bootstrap test above for an example) should not be just embraced without a suspicion. Ideally, both asymptotic properties of a bootstrap procedure and its actual finite sample performance should be investigated. In reality, asymptotic considerations are hard, but careful simulation studies investigating finite sample performances should always be performed.

References

Davison & Hinkley (2009) Bootstrap Methods and their Application

Wasserman (2005) All of Nonparametric Statistics

Shao & Tu (1995) The Jackknife and Bootstrap

Hall (1992) The Bootstrap and Edgeworth Expansion