The primary goal of statistics is to extract information about a population (described by a distribution function ) based on a random sample from this population. More often than not, one is interested in a particular characteristic of the population denoted generally by .
Leading Ex.: The mean .
Let us focus generally on rather than specifically on . We can estimate
parametrically, if we are willing to assume that for some integer , we can take for an estimator of the parameter vector , or
non-parametrically by the empirical distribution function [ N(x) = {n=1}^N _{[X_n x]} ]
edf_plot <-function(N){ X <-rnorm(N) EDF <-ecdf(X)plot(EDF) x <-seq(-4,4,by=0.01)points(x,pnorm(x),type="l",col="red")}set.seed(517)edf_plot(12)edf_plot(50)
Leading Ex.: The empirical estimator of the mean is exactly the mean of the empirical distribution .
In both cases, the population characteristic of interest can often be estimated by the corresponding sample characteristic of . The sample characteristic is called a statistic, and it has a sampling distribution. Sometimes the statistic of interest can be a scaled version of the sample characteristic.
Leading Ex.: For a distribution function of the Gaussian distribution with variance , , which is the sampling distribution.
We are rarely ever interested only in a point estimator of : most statistical procedures require some knowledge of the sampling distribution. For example, confidence intervals for or hypothesis testing about require knowledge of quantiles of the sampling distribution. And even in the case of point estimation, we may be interested in the estimator’s bias, variance, or jointly its mean squared error. Those are all characteristics of the sampling distribution.
While the sampling distribution is generally unknown (it can be found and worked with analytically in simple cases only, like in our leading example, in other cases it has to be approximated by the CLT), it is identified by the underlying and the sampling process. The bootstrap embraces the idea that a (re)sampling process from can mimic the sampling process from itself that has produced the data. But since is known, the characteristics of the resampling distribution (which will in turn serve as proxies for the characteristics of the sampling distribution) are more readily available. Consider the following diagram The bootstrap idea is that the bootstrap world can serve as a proxy for the real world. The advantage of the bootstrap world is that it is ours to control. Moreover, approximating the sampling process by resampling (i.e. performing bootstrap) can sometimes have further benefits.
Leading Ex.: Assume we are looking for a one-sided CI with coverage , that is we search for a random variable such that , for . We know that and so we get a CI with exact coverage by expressing from the inequality : When the Gaussianity assumption is taken away, i.e., when we only assume are i.i.d. with , we have from the CLT that and thus where is the -quantile of the standard Gaussian. By the Berry-Esseen theorem: That is, the coverage of the asymptotic CI is exact up to the order .
Now consider the bootstrap version of the CI. Let be i.i.d. from the empirical distribution obtained from i.i.d. with . Then, it can be shown using Edgeworth expansions that for all and and being respectively the distribution function and the density of the standard Gaussian, while the bootstrap version satisfies where and . Hence
i.e., the (one-sided) bootstrap interval has exact coverage up to the order !
The previous implication probably requires some more justification though. has a sampling distribution inherited from and let be its -quantile. Then naturally but the problem is we do not know the sampling distribution, and hence neither its quantiles. If we use asymptotic normality, we approximate the quantile as the corresponding Gaussian quantile, but we commit a certain error. The bootstrap alternative instead tries to approximate the unknown sampling distribution more directly by the distribution of , where and are calculated from the bootstrap sample . also has a sampling distribution (inherited from ) and let be its -quantile. Then , and according to the LHS of we only commit error of order when we replace by , which then yields the RHS of .
Remark: As discussed by Davison and Hinkley (see reference to their book below), the benefits of the bootstrap distribution of , and hence being an order of magnitude closer to the true distribution, disappears if is not pivotal, i.e., if its (limiting) distribution does not depend on unknowns. This is justified by the fact that the unknown parameters on which the (limiting) distribution of would depend, needs to be estimated to get the bootstrap distribution which would have a different leading term (in the Edgeworth expansion) than the true (limiting) distribution.
Example: Assume are i.i.d. with and , and let . In the same spirit to the leading example above, the empirical estimator of is , which is biased: . We can use the bootstrap to estimate the bias as and define a bias-corrected estimator as . Let us do just that.
In this case, the bias can actually be calculated explicitly: where and . For the bootstrap version, we have where and .
Now, for we have
At this point, there are several ways to argue that the bias has been decreased (but all of them are a bit painful). For example, it can be calculated explicitly that and , and hence as opposed to . Thus, using bootstrap, we have successfully reduced the bias by the full order of the sample size.
In the examples above, we saw how resampling can be used to construct confidence intervals with better coverage compared to those obtained using CLT, and how the resampling distribution can be used to partially correct the bias of an estimator. Of course, the second example is special in that it allows for explicit calculations, while explicit calculation of is not possible. Nonetheless, the bootstrap can be naturally paired up with Monte Carlo in such situations.
We have already seen how basic Monte Carlo can help us avoid analytic calculations and perform numerous statistical tasks. However, the cases in which the distribution under the null is fully known (like in this example) are quite rare. In many more cases, the distribution under the null is unknown, but can be approximated by the bootstrap idea above, and Monte Carlo can then be used to approximate desirable characteristics of such a distribution. From this point of view, by “bootstrap” it is commonly meant to combine
the plug-in principle (i.e. estimating the unknowns), and
Monte Carlo principle (i.e. simulation instead of analytic calculations).
The former amounts to estimating either
some parameters in the parametric problems parametric bootstrap, or
the whole via the empirical estimator in non-parametric problems non-parametric bootstrap.
The Non-Parametric Bootstrap
let be a random sample from
we are interested in , a characteristic of
the estimator is given as
we will also write , since and thus the estimator depends on the sample
can be either a parametric on non-parametric estimate
the distribution of a scaled estimator is of interest
e.g.,
The workflow of the bootstrap is as follows for some (e.g., ): and estimate the unknown distribution by the empirical distribution of , i.e., by Any characteristic of can be estimated by the corresponding characteristic of .
For example, the quantiles in the leading example above can be approximated by the quantiles of this empirical distribution.
Note: The bootstrap resamples are obtained by sampling with replacement from the data . But there is a natural question: how large the number of resamples should be? That depends on the characteristic of interest. To approximate quantiles, at least , but rather , is advisable, while to estimate the variance, is usually considered.
Some Theory
Any statistical methodology should be at least consistent. If we have consistency, we may ask for rates of convergence, like sketched in the example above. But with bootstrap, theoretical research is hard, results have to derived again for different kinds of statistics. Here we will only provide a consistency result for the case of bootstrapping a smooth transformation of the empirical mean.
Recall that we are using the empirical bootstrap distribution as a proxy for the unknown distribution of the scaled estimator , denoted by . As we increase the number of bootstrap samples, naturally approaches . In fact, by the Glivenko-Cantelli theorem: Hence the question is how does behave as .
Theorem. Let and , where is continuously differentiable at and such that . Then
The previous theorem shows consistency of the bootstrap or smooth transformations of the empirical average. It also holds for random vectors, where where has continuous partial derivatives on some neighborhood of .
Applications
Here, we will demonstrate how to device bootstrap procedures for standard statistical tasks. We will not prove validity of those procedures (i.e., we will not show that the bootstrap works).
Confidence Intervals
Let where , thus for . We can take for to be the -quantile of the empirical distribution based on for , i.e., the -quantile of the bootstrap distribution of . Provided that the bootstrap works (is consistent), we have and hence is a CI for with asymptotically correct coverage (again, provided that the bootstrap works and hence that the quantiles of the bootstrap distribution converge to those of the true distribution of ).
Note that the interval above is different from the one we worked with in our leading example above, where we have standardized for the scale. This is called studentization and it is highly recommended. It can often be shown that bootstrap procedures based on studentized statistics have better properties. For example, in the leading example above we have seen that the studentized CI had a faster rate of convergence (to the correct coverage) than the CI based on asymptotic normality. This would not happen without studentization. So let’s change the basic CI above into a studentized CI.
Assume that , i.e., is the asymptotic variance of . Let be a consistent estimator for . Now we consider the studentized statistics: where is the estimator of the same form as , but calculated from the -th resample . Again, let for be the -quantile of the empirical distribution based on for . If bootstrap works, we have now from which we get the CI with asymptotically correct coverage:
Variance Estimation
We often know that but in these cases often has a complicated form (e.g., it has a complicated dependency on unknown parameters).
But the bootstrap can be deployed quite easily: where is the same estimator as , only calculated from the -th bootstrap resample.
Note that is an estimator of , not just of . Critically, the bootstrap will fail if , which can actually happen even when the asymptotic normality holds.
Example: Let be a random sample from the density . From the CLT we have . Taking , the delta method gives us But the bootstrap will not work in this case, since even , see the code below.
x <-1+log(1+(0:1000)/10)target_d <-function(x){return(3/x^4)}plot(x,target_d(x),type="l")x <- (1:6000)/1001proposal_d <-dcauchy(x)points(1+x,2*proposal_d,type="l",col="red")
const <-3/1^4/ (2*dcauchy(0)) # 4.7# rejection sampling from the targetset.seed(317)N <-1000X <-rep(0,N)j <-0while(j < N){ Y <-1+abs(rcauchy(1)) U <-runif(1)if(U <target_d(Y)/dcauchy(Y)/const){ X[j+1] <- Y j <- j+1 }}hist(X,breaks=16,freq=F)points(x,target_d(x),type="l")
# bootstrap estimator of the varianceB <-1000boot_stat <-rep(0,B)for(b in1:B){ Xb <-sample(X,N,replace=T) boot_stat[b] <-exp(mean(Xb)^4)}(sigma_star <-1/(B-1)*sum((boot_stat-mean(boot_stat))^2)) # overshot
[1] 5482212
(4*exp(1.5^4)*1.5^3)^2*3/4/N # true asymptotic variance we wish to estimate
[1] 3411.618
Bias Reduction
Unbiased estimators are in reality quite rare. But, we can estimate the bias by bootstrapping as and define the bias-corrected estimator .
We have seen in an example above that the bias-corrected estimator can have better properties. In that example, we tried to estimate for by , and we got a faster rate for the bias of the bias-corrected estimator. Something similar happens when working with any smooth enough instead.
Hypothesis Testing
Assume we wish to test a hypothesis using a statistic . Depending on the form of the alternative hypothesis , either
large values of ,
small values of , or
both large and small values of
testify against the null hypothesis and in favor of the alternative. Bootstrapping the statistic, we obtain and depending on what values of the statistics testify against the hypothesis, we can estimate the p-value as either
,
, or
,
If the observed value of the statistic is not too extreme (the form of decides what “extreme” is), we will observe more extreme values coming out of our bootstrap, which in turn increases our estimate of the p-value, which in turn prevents us from rejecting .
Example: Let be a random sample with and consider against
. Taking the studentized statistic for and the corresponding bootstrap statistics , the bootstrap p-value is
I(abs(T_stat) >qnorm(0.95)) # asymptotic test, easy to calculate asymptotic p-val
[1] FALSE
Notice that while (parametric and non-parametric) bootstrap rejects the hypothesis on 5 % level, the test based on asymptotic normality does not. This could be expected, since we saw in our leading example above that bootstrap produces better one-sided CIs, which are naturally dual to one-sided hypothesis tests.
. Taking the same studentized statistics, the bootstrap p-value is this time
set.seed(517)N <-100X <-rexp(N,1/2) # thus true mean = 2mu_0 <-1.7# reduced, since harder to reject hereT_stat <- (mean(X)-mu_0)/sd(X)*sqrt(N)B <-10000boot_stat <-rep(0,B)for(b in1:B){ Xb <-sample(X,N,replace=T) boot_stat[b] <- (mean(Xb)-mean(X))/sd(Xb)*sqrt(N)}p_val <-mean(boot_stat >= T_stat)p_val
[1] 0.0139
2*(1-pnorm(T_stat)) # asymptotic test, easy to calculate asymptotic p-val
[1] 0.05592678
Related Methods
Parametric Bootstrap and Goodness-of-Fit Testing
Assume that is a random sample from , i.e., the distribution function is known up to a parameter . The parametric bootstrap is similar to the non-parametric one, only instead of sampling with replacement from we sample from the fitted distribution , where is a consistent estimator of .
Parametric bootstrap is mainly useful for Goodness-of-Fit (GoF) testing, that is the following problem: assume is a random sample from a general , and for a parametric model , we would like to test whether the data supports the hypothesis that belongs to this parametric family:
The standard approach is to use the Kolmogorov-Smirnov statistic where is a specific estimator consistent under the parametric model (e.g., the MLE).
But here we are not comparing the empirical distribution to a fixed reference like in the usual Kolmogorov-Smirnov test, instead we compare the non-parametric fit () to the parametric fit (). Hence the distribution of is complicated, but doing parametric bootstrap is easy:
for
generate resample
estimate from the resample
calculate the EDF from the resample
set
estimate the p-value of the test by
Note: When using bootstrap for hypothesis testing, it has been suggested to use the following bootstrap estimate of the -value
to improve finite sample performance.
Iterated Bootstrap
As we have seen in an example above, the bootstrap can be used to estimate the bias of an estimator by , and the estimator of the bias can be in turn used to correct the estimator. But in that example, the bias-corrected estimator is still biased (just the asymptotic order of the bias has been lowered), which means that the estimator of the bias is itself biased. How about we iterate the bootstrap idea, and use another bootstrap to estimate the bias of ? Then we can adjust the estimator even further and hopefully reduce its bias even further (i.e., lower the asymptotic order of the bias even more).
In the following scheme, the double star denotes the double bootstrap estimate
Example: In this example, consider the estimator of the variance . We know this estimator is biased, and we know that replacing by would make it unbiased, but let us show how the bootstrap can be used without the analytic knowledge (and use the analytic knowledge only to evaluate what the bootstrap does). The bias is , and the bootstrap estimate of the bias is simply . Next, since , we have so is unbiased up to order for . Iterating bootstrap for the second time, we have as the bootstrap estimator for the bias of , and the second-iteration estimator for , i.e., for the bias of itself, is thus . This is unbiased up to the order for :
The example above is again a toy example where bootstrap calculations can be performed analytically. But similar improvements can be observed even when analytic calculations are not possible. In that case, we need to apply a second-iteration bootstrap to every first-iteration bootstrap sample, i.e., generating for , for every where . This is of course computationally demanding. Thus while we could in principle, iterate this idea to a third-iteration bootstrap an beyond, this would not be computationally feasible, and also the finite sample properties of the iterated bootstrap estimates may not show the improvements unless is very large. On the other hand, the iterated bootstrap can be useful in many cases from the methodological viewpoint.
Example: Let be i.i.d. from a distribution depending on . We are interested in testing against . Let be an estimator such that . Let us perform a bootstrap test based on the studentized statistic where is a consistent estimator of . The asymptotic test can be based on the fact that under . We have several options how bootstrap can come into play:
If we do not want to use the asymptotic distribution directly, we can use the standard bootstrap to perform the hypothesis test.
If the estimator for is not available, which is often the case, one can use the standard bootstrap to estimate and carry on with the asymptotic test (based on the distribution).
If we do not want to use the asymptotic distribution directly and the estimator for is not available, we can use double bootstrap to perform the test, as shown schematically below.
where
Jackknife
The jackknife is a predecessor of the bootstrap that also shares some similarities with the bootstrap as well with cross-validation. It was originally proposed for bias estimation, and later (but still before bootstrap) for variance estimation. The idea is to recalculate an estimator many times, always leaving out one observation at a time from the data set.
Let be a random sample from a distribution depending on . Let be an estimator of and let be the estimator calculated without the -th observation. Consider .
The jackknife estimator of the bias is which has a similar form to the bootstrap estimator of the bias, but the factor is surprising at first glance. A heuristic justification for the factor is as follows. Assuming for simplicity that and that for some constants and , we have Hence so approximates correctly up to the order , which corresponds to the bootstrap. The bias-corrected estimator is then
Tukey defined the pseudo-values and conjectured that in some situations these can be treated as i.i.d. with approximately the same variance as , and hence we can take
Similarly to the bootstrap, the jackknife has its theoretical version which allows us to study when it works. Today, the jackknife is considered as outdated, replaced by bootstrap, but it can still has its place. In the example above, where double bootstrap is used to perform a studentized hypothesis test when variance estimator is not readily available, the second iteration of the bootstrap can be naturally replaced by jackknife instead. Depending on the sample size , this might lead to a better trade-off between accuracy and computation costs (note that the jackknife is generally cheaper than the bootstrap).
Final Thoughts
The name “bootstrap” refers to one of the absurd exploits of Baron Munchausen, who allegedly got out from the bottom of the lake by pulling up his bootstraps. At first glance, it also seems absurd that by resampling (particularly with a non-parametric bootstrap) one could actually obtain better results compared to just using the data set once – after all, there cannot be more information in the resampled data sets. Nonetheless, the bootstrap can serve us when direct analytic calculations are not available. In such cases, one often relies on CLT and the asymptotic normality it provides. However, the asymptotic normality is not a very sharp tool: the distribution is always symmetric and thus it cannot work too well, e.g., for skewed distributions (of course, the skewness vanishes asymptotically, but in finite samples skewness is the main term describing the departure of the distribution of a given statistics from normality). A well set-up bootstrap can, on the other hand, provide an effective correction (e.g., for skewness) and thus provide a better approximation of the distribution of a given statistics compared to normality.
There is a tendency to use bootstrap blindly and there has been claims such that the bootstrap (in its concordance with Monte Carlo simulations) makes mathematical skills obsolete in statistics. While bootstrap can indeed reduce the mathematical finesse needed to solve a statistical problem, e.g., to come up with a confidence interval or a test in practice, verifying the coverage of such a confidence interval or the validity of such a test actually requires even higher mathematical skills.
We have not dived into such theoretical considerations, since the mathematical finesse required for those is far beyond the scope of this course. However, the folk knowledge is that bootstrap works when the statistics possesses asymptotic normality with a non-degenerate variance (which generally requires some regularity or smoothness conditions). On the other hand, one has to be careful when working with order statistics, extremal statistics, non-smooth transformations, or non-i.i.d. regimes (e.g., time series or regression settings).
At the same time, it is fair to admit that theoretical properties of bootstrap are only asymptotic, and those may not serve as a good characterization of the finite sample performance. For example, it can be shown that the iterated bootstrap can reduce the asymptotic bias by an order increasing with the number of bootstrap iterations. However, more then two iterations are not practically viable. And even if they were, with more and more bootstrap iterations it will take larger and larger sample size for the asymptotic rates to “kick in” and result into better finite sample performance. Thus simulation studies examining the finite sample performance of any bootstrap procedure are vital.
But the bottom line is that any bootstrap procedure (take the goodness-of-fit parametric bootstrap test above for an example) should not be just embraced without a suspicion. Ideally, both asymptotic properties of a bootstrap procedure and its actual finite sample performance should be investigated. In reality, asymptotic considerations are hard, but careful simulation studies investigating finite sample performances should always be performed.
References
Davison & Hinkley (2009) Bootstrap Methods and their Application