Optimal blocking is explored for experiments, such as those incorporating one or more controls, where not all treatment comparisons are of equal interest. Weighted optimality functions are employed in gaining both analytic and enumerative results; a catalogue of smaller optimal designs is provided. It is shown how design selection based on functions of variances, and on functions of efficiency factors, are both subsumed by the weighted approach.
The false discovery rate is a criterion for controlling Type I error in simultaneous testing of multiple hypotheses. For scanning statistics, due to local dependence, clusters of neighbouring hypotheses are likely to be rejected together. In such situations, it is more intuitive and informative to group neighbouring rejections together and count them as a single discovery, with the false discovery rate defined as the proportion of clusters that are falsely declared among all declared clusters. Assuming that the number of false discoveries, under this broader definition of a discovery, is approximately Poisson and independent of the number of true discoveries, we examine approaches for estimating and controlling the false discovery rate, and provide examples from biological applications.
Instrumental variables are widely used for the identification of the causal effect of one random variable on another under unobserved confounding. The distribution of the observable variables for a discrete instrumental variable model satisfies certain inequalities but no conditional independence relations. Such models are usually tested by checking whether the relative frequency estimators of the parameters satisfy the constraints. This ignores sampling uncertainty in the data. Using the observable constraints for the instrumental variable model, a likelihood analysis is conducted. A significance test for its validity is developed, and a bootstrap algorithm for computing confidence intervals for the causal effect is proposed. Applications are given to illustrate the advantage of the suggested approach.
The existing theory of the wild bootstrap has focused on linear estimators. In this note, we broaden its validity by providing a class of weight distributions that is asymptotically valid for quantile regression estimators. As most weight distributions in the literature lead to biased variance estimates for nonlinear estimators of linear regression, we propose a modification of the wild bootstrap that admits a broader class of weight distributions for quantile regression. A simulation study on median regression is carried out to compare various bootstrap methods. With a simple finite-sample correction, the wild bootstrap is shown to account for general forms of heteroscedasticity in a regression model with fixed design points.
We propose a novel quantile regression approach for longitudinal data analysis which naturally incorporates auxiliary information from the conditional mean model to account for within-subject correlations. The efficiency gain is quantified theoretically and demonstrated empirically via simulation studies and the analysis of a real dataset.
We consider a cross-section model that contains an individual component, a deterministic time trend and an unobserved latent common time series component. We show the following oracle property: the parameters of the latent time series and the parameters of the deterministic time trend can be estimated with the same asymptotic accuracy as if the parameters of the individual component were known. We consider this model in two settings: least squares fits of linear specifications of the individual component and the parameters of the deterministic time trend and, more generally, quasilikelihood estimation in a generalized linear time series model.
We construct non-Gaussian processes that vary continuously in space and time with nonseparable covariance functions. Starting from a general and flexible way of constructing valid nonseparable covariance functions through mixing over separable covariance functions, the resulting models are generalized by allowing for outliers as well as regions with larger variances. We induce this through scale mixing with separate positive-valued processes. Smooth mixing processes are applied to the underlying correlated processes in space and in time, thus leading to regions in space and time of increased spread. An uncorrelated mixing process on the nugget effect accommodates outliers. Posterior and predictive Bayesian inference with these models is implemented through a Markov chain Monte Carlo sampler. An application to temperature data in the Basque country illustrates the potential of this model in the identification of outliers and regions with inflated variance, and shows that this improves the predictive performance.
In the study of intrinsically stationary spatial processes, a new nonparametric variogram estimator is proposed through its spectral representation. The methodology is based on estimation of the variogram’s spectrum by solving a regularized inverse problem through quadratic programming. The estimated variogram is guaranteed to be conditionally negative-definite. Simulation shows that our estimator is flexible and generally has smaller mean integrated squared error than the parametric estimator under model misspecification. Our methodology is applied to a spatial dataset of decadal temperature changes.
We propose a pivotal method for estimating high-dimensional sparse linear regression models, where the overall number of regressors p is large, possibly much larger than n, but only s regressors are significant. The method is a modification of the lasso, called the square-root lasso. The method is pivotal in that it neither relies on the knowledge of the standard deviation nor does it need to pre-estimate . Moreover, the method does not rely on normality or sub-Gaussianity of noise. It achieves near-oracle performance, attaining the convergence rate {(s/n) log p}1/2 in the prediction norm, and thus matching the performance of the lasso with known . These performance results are valid for both Gaussian and non-Gaussian errors, under some mild moment restrictions. We formulate the square-root lasso as a solution to a convex conic programming problem, which allows us to implement the estimator using efficient algorithmic methods, such as interior-point and first-order methods.
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
We propose a simple forward adaptive banding method for estimating large covariance matrices using the modified Cholesky decomposition. This approach requires the fitting of a prespecified set of models due to the adaptive banding structure and can be efficiently implemented. Aside from its computational attractiveness, we propose a novel Bayes information criterion that gives consistent model selection for estimating high dimensional covariance matrices. The method compares favourably to its competitors in simulation study.
In this paper, we consider clustered right-censored time-to-event data. Such data can be analysed either using a marginal model if one is interested in population effects or using so-called frailty models if one is interested in covariate effects on the individual level and in estimation of correlation. The Cox frailty model has been studied extensively in the last decade or so and estimation techniques and large sample results are now available. It is, however, difficult to deal with time-changing covariate effects when using the Cox model. An appealing alternative model is the Aalen additive hazards model, in which it is easy to work with time dynamics. In this paper, we describe an innovative approach to estimation in the Aalen additive gamma frailty hazards model. We give the large sample properties of the estimators and investigate their small sample properties by Monte Carlo simulation. A real example is provided for illustration.
It is a challenge to evaluate experimental treatments where it is suspected that the treatment effect may only be strong for certain subpopulations, such as those having a high initial severity of disease, or those having a particular gene variant. Standard randomized controlled trials can have low power in such situations. They also are not optimized to distinguish which subpopulations benefit from a treatment. With the goal of overcoming these limitations, we consider randomized trial designs in which the criteria for patient enrollment may be changed, in a preplanned manner, based on interim analyses. Since such designs allow data-dependent changes to the population enrolled, care must be taken to ensure strong control of the familywise Type I error rate. Our main contribution is a general method for constructing randomized trial designs that allow changes to the population enrolled based on interim data using a prespecified decision rule, for which the asymptotic, familywise Type I error rate is strongly controlled at a specified level α. As a demonstration of our method, we prove new, sharp results for a simple, two-stage enrichment design. We then compare this design to fixed designs, focusing on each design’s ability to determine the overall and subpopulation-specific treatment effects.
Observational studies in which the effect of a nonrandomized treatment on an outcome of interest is estimated are common in domains such as labour economics and epidemiology. Such studies often rely on an assumption of unconfounded treatment when controlling for a given set of observed pre-treatment covariates. The choice of covariates to control in order to guarantee unconfoundedness should primarily be based on subject matter theories, although the latter typically give only partial guidance. It is tempting to include many covariates in the controlling set to try to make the assumption of an unconfounded treatment realistic. Including unnecessary covariates is suboptimal when the effect of a binary treatment is estimated nonparametrically. For instance, when using a n1/2-consistent estimator, a loss of efficiency may result from using covariates that are irrelevant for the unconfoundedness assumption. Moreover, bias may dominate the variance when many covariates are used. Embracing the Neyman–Rubin model typically used in conjunction with nonparametric estimators of treatment effects, we characterize subsets from the original reservoir of covariates that are minimal in the sense that the treatment ceases to be unconfounded given any proper subset of these minimal sets. These subsets of covariates are shown to be identified under mild assumptions. These results lead us to propose data-driven algorithms for the selection of minimal sets of covariates.
We study goodness-of-fit tests for logistic regression models for case-control data when some covariates are measured with error. We first study the applicability of traditional test methods for this problem, simply ignoring measurement error, and show that in some scenarios they are effective despite the inconsistency of the parameter estimators. We then develop a test procedure based on work of Zhang (2001) that can simultaneously test the validity of logistic regression and correct the bias in parameter estimators for case-control data with nondifferential classical additive normal measurement error. Instead of using the information matrix considered by Zhang (2001), our test statistic uses preselected functions to reduce dimensionality. Simulation studies and an application illustrate its usefulness.
We use p-values to identify the threshold level at which a regression function leaves its baseline value, a problem motivated by applications in toxicological and pharmacological dose-response studies and environmental statistics. We study the problem in two sampling settings: one where multiple responses can be obtained at a number of different covariate levels, and the other the standard regression setting involving limited number of response values at each covariate. Our procedure involves testing the hypothesis that the regression function is at its baseline at each covariate value and then computing the potentially approximate p-value of the test. An estimate of the threshold is obtained by fitting a piecewise constant function with a single jump discontinuity, known as a stump, to these observed p-values, as they behave in markedly different ways on the two sides of the threshold. The estimate is shown to be consistent and its finite sample properties are studied through simulations. Our approach is computationally simple and extends to the estimation of the baseline value of the regression function, heteroscedastic errors and to time series. It is illustrated on some real data applications.
This paper deals with the dimension reduction of high-dimensional time series based on a lower-dimensional factor process. In particular, we allow the dimension of time series N to be as large as, or even larger than, the length of observed time series T. The estimation of the factor loading matrix and the factor process itself is carried out via an eigenanalysis of a NxN non-negative definite matrix. We show that when all the factors are strong in the sense that the norm of each column in the factor loading matrix is of the order N1/2, the estimator of the factor loading matrix is weakly consistent in L2-norm with the convergence rate independent of N. Thus the curse is cancelled out by the blessing of dimensionality. We also establish the asymptotic properties of the estimators when factors are not strong. The proposed method together with the asymptotic properties are illustrated in a simulation study. An application to an implied volatility data set, with a trading strategy derived from the fitted factor model, is also reported.
When testing geometrically irregular parametric hypotheses, the bootstrap is an intuitively appealing method to circumvent difficult distribution theory. It has been shown, however, that the usual bootstrap is inconsistent in estimating the asymptotic distributions involved in such problems. This paper is concerned with the asymptotic size of likelihood ratio tests when critical values are computed using the inconsistent bootstrap. We clarify how the asymptotic size of such a test can be obtained from the size of the corresponding bootstrap test in the relevant limiting normal experiment. For boundary problems, that is, hypotheses given by convex cones, we show the bootstrap test to always be anticonservative, and we compute the size numerically for different two-dimensional examples. The examples illustrate that the size can be below or above the nominal level, and reveal that the relationship between the size of the test and the geometry of the considered hypotheses is surprisingly subtle.