Summary Treatment-selection markers are biological molecules or patient characteristics associated with one’s response to treatment. They can be used to predict treatment effects for individual subjects and subsequently help deliver treatment to those most likely to benefit from it. Statistical tools are needed to evaluate a marker’s capacity to help with treatment selection. The commonly adopted criterion for a good treatment-selection marker has been the interaction between marker and treatment. While a strong interaction is important, it is, however, not sufficient for good marker performance. In this article, we develop novel measures for assessing a continuous treatment-selection marker, based on a potential outcomes framework. Under a set of assumptions, we derive the optimal decision rule based on the marker to classify individuals according to treatment benefit, and characterize the marker’s performance using the corresponding classification accuracy as well as the overall distribution of the classifier. We develop a constrained maximum-likelihood method for estimation and testing in a randomized trial setting. Simulation studies are conducted to demonstrate the performance of our methods. Finally, we illustrate the methods using an HIV vaccine trial where we explore the value of the level of preexisting immunity to adenovirus serotype 5 for predicting a vaccine-induced increase in the risk of HIV acquisition.
Summary Knowing which populations are most at risk for severe outcomes from an emerging infectious disease is crucial in deciding the optimal allocation of resources during an outbreak response. The case fatality ratio (CFR) is the fraction of cases that die after contracting a disease. The relative CFR is the factor by which the case fatality in one group is greater or less than that in a second group. Incomplete reporting of the number of infected individuals, both recovered and dead, can lead to biased estimates of the CFR. We define conditions under which the CFR and the relative CFR are identifiable. Furthermore, we propose an estimator for the relative CFR that controls for time-varying reporting rates. We generalize our methods to account for elapsed time between infection and death. To demonstrate the new methodology, we use data from the 1918 influenza pandemic to estimate relative CFRs between counties in Maryland. A simulation study evaluates the performance of the methods in outbreak scenarios. An R software package makes the methods and data presented here freely available. Our work highlights the limitations and challenges associated with estimating absolute and relative CFRs in practice. However, in certain situations, the methods presented here can help identify vulnerable subpopulations early in an outbreak of an emerging pathogen such as pandemic influenza.
Summary DNA methylation has emerged as an important hallmark of epigenetics. Numerous platforms including tiling arrays and next generation sequencing, and experimental protocols are available for profiling DNA methylation. Similar to other tiling array data, DNA methylation data shares the characteristics of inherent correlation structure among nearby probes. However, unlike gene expression or protein DNA binding data, the varying CpG density which gives rise to CpG island, shore and shelf definition provides exogenous information in detecting differential methylation. This article aims to introduce a robust testing and probe ranking procedure based on a nonhomogeneous hidden Markov model that incorporates the above-mentioned features for detecting differential methylation. We revisit the seminal work of Sun and Cai (2009, Journal of the Royal Statistical Society: Series B (Statistical Methodology)71, 393–424) and propose modeling the nonnull using a nonparametric symmetric distribution in two-sided hypothesis testing. We show that this model improves probe ranking and is robust to model misspecification based on extensive simulation studies. We further illustrate that our proposed framework achieves good operating characteristics as compared to commonly used methods in real DNA methylation data that aims to detect differential methylation sites.
Using a new type of array technology, the reverse phase protein array (RPPA), we measure time-course protein expression for a set of selected markers that are known to coregulate biological functions in a pathway structure. To accommodate the complex dependent nature of the data, including temporal correlation and pathway dependence for the protein markers, we propose a mixed effects model with temporal and protein-specific components. We develop a sequence of random probability measures (RPM) to account for the dependence in time of the protein expression measurements. Marginally, for each RPM we assume a Dirichlet process model. The dependence is introduced by defining multivariate beta distributions for the unnormalized weights of the stick-breaking representation. We also acknowledge the pathway dependence among proteins via a conditionally autoregressive model. Applying our model to the RPPA data, we reveal a pathway-dependent functional profile for the set of proteins as well as marginal expression profiles over time for individual markers.
Array-based group-testing algorithms for case identification are widely used in infectious disease testing, drug discovery, and genetics. In this article, we generalize previous statistical work in array testing to account for heterogeneity among individuals being tested. We first derive closed-form expressions for the expected number of tests (efficiency) and misclassification probabilities (sensitivity, specificity, predictive values) for two-dimensional array testing in a heterogeneous population. We then propose two “informative” array construction techniques which exploit population heterogeneity in ways that can substantially improve testing efficiency when compared to classical approaches that regard the population as homogeneous. Furthermore, a useful byproduct of our methodology is that misclassification probabilities can be estimated on a per-individual basis. We illustrate our new procedures using chlamydia and gonorrhea testing data collected in Nebraska as part of the Infertility Prevention Project.
We provide methods that can be used to obtain more accurate environmental exposure assessment. In particular, we propose two modeling approaches to combine monitoring data at point level with numerical model output at grid cell level, yielding improved prediction of ambient exposure at point level. Extending our earlier downscaler model (Berrocal, V. J., Gelfand, A. E., and Holland, D. M. (2010b). A spatio-temporal downscaler for outputs from numerical models. Journal of Agricultural, Biological and Environmental Statistics15, 176–197), these new models are intended to address two potential concerns with the model output. One recognizes that there may be useful information in the outputs for grid cells that are neighbors of the one in which the location lies. The second acknowledges potential spatial misalignment between a station and its putatively associated grid cell.
The first model is a Gaussian Markov random field smoothed downscaler that relates monitoring station data and computer model output via the introduction of a latent Gaussian Markov random field linked to both sources of data. The second model is a smoothed downscaler with spatially varying random weights defined through a latent Gaussian process and an exponential kernel function, that yields, at each site, a new variable on which the monitoring station data is regressed with a spatial linear model. We applied both methods to daily ozone concentration data for the Eastern US during the summer months of June, July and August 2001, obtaining, respectively, a 5% and a 15% predictive gain in overall predictive mean square error over our earlier downscaler model (Berrocal et al., 2010b). Perhaps more importantly, the predictive gain is greater at hold-out sites that are far from monitoring sites.
Summary The median failure time is often utilized to summarize survival data because it has a more straightforward interpretation for investigators in practice than the popular hazard function. However, existing methods for comparing median failure times for censored survival data either require estimation of the probability density function or involve complicated formulas to calculate the variance of the estimates. In this article, we modify a K-sample median test for censored survival data (Brookmeyer and Crowley, 1982, Journal of the American Statistical Association77, 433–440) through a simple contingency table approach where each cell counts the number of observations in each sample that are greater than the pooled median or vice versa. Under censoring, this approach would generate noninteger entries for the cells in the contingency table. We propose to construct a weighted asymptotic test statistic that aggregates dependent χ2-statistics formed at the nearest integer points to the original noninteger entries. We show that this statistic follows approximately a χ2-distribution with k− 1 degrees of freedom. For a small sample case, we propose a test statistic based on combined p-values from Fisher’s exact tests, which follows a χ2-distribution with 2 degrees of freedom. Simulation studies are performed to show that the proposed method provides reasonable type I error probabilities and powers. The proposed method is illustrated with two real datasets from phase III breast cancer clinical trials.
Linking information on a movement network with space–time data on disease incidence is one of the key challenges in infectious disease epidemiology. In this article, we propose and compare two statistical frameworks for this purpose, namely, parameter-driven (PD) and observation-driven (OD) models. Bayesian inference in PD models is done using integrated nested Laplace approximations, while OD models can be easily fitted with existing software using maximum likelihood. The predictive performance of both formulations is assessed using proper scoring rules. As a case study, the impact of cattle trade on the spatiotemporal spread of Coxiellosis in Swiss cows, 2004–2009, is finally investigated.
Temporal boundary misalignment occurs when area boundaries shift across time (e.g., census tract boundaries change at each census year), complicating the modeling of temporal trends across space. Large area-level datasets with temporal boundary misalignment are becoming increasingly common in practice. The few existing approaches for temporally misaligned data do not account for correlation in spatial random effects over time. To overcome issues associated with temporal misalignment, we construct a geostatistical model for aggregate count data by assuming that an underlying continuous risk surface induces spatial correlation between areas. We implement the model within the framework of a generalized linear mixed model using radial basis splines. Using this approach, boundary misalignment becomes a nonissue. Additionally, this disease-mapping framework facilitates fast, easy model fitting by using a penalized quasilikelihood approximation to maximum likelihood estimation. We anticipate that the method will also be useful for large disease-mapping datasets for which fully Bayesian approaches are infeasible. We apply our method to assess socioeconomic trends in breast cancer incidence in Los Angeles between the periods 1988–1992 and 1998–2002.
Summary In some survival analysis of medical studies, there are often long-term survivors who can be considered as permanently cured. The goals in these studies are to estimate the noncured probability of the whole population and the hazard rate of the susceptible subpopulation. When covariates are present as often happens in practice, to understand covariate effects on the noncured probability and hazard rate is of equal importance. The existing methods are limited to parametric and semiparametric models. We propose a two-component mixture cure rate model with nonparametric forms for both the cure probability and the hazard rate function. Identifiability of the model is guaranteed by an additive assumption that allows no time–covariate interactions in the logarithm of hazard rate. Estimation is carried out by an expectation–maximization algorithm on maximizing a penalized likelihood. For inferential purpose, we apply the Louis formula to obtain point-wise confidence intervals for noncured probability and hazard rate. Asymptotic convergence rates of our function estimates are established. We then evaluate the proposed method by extensive simulations. We analyze the survival data from a melanoma study and find interesting patterns for this study.
Summary. We consider the linear regression of outcome Y on regressors W and Z with some values of W missing, when our main interest is the effect of Z on Y, controlling for W. Three common approaches to regression with missing covariates are (i) complete-case analysis (CC), which discards the incomplete cases, and (ii) ignorable likelihood methods, which base inference on the likelihood based on the observed data, assuming the missing data are missing at random (Rubin, 1976b), and (iii) nonignorable modeling, which posits a joint distribution of the variables and missing data indicators. Another simple practical approach that has not received much theoretical attention is to drop the regressor variables containing missing values from the regression modeling (DV, for drop variables). DV does not lead to bias when either (i) the regression coefficient of W is zero or (ii) W and Z are uncorrelated. We propose a pseudo-Bayesian approach for regression with missing covariates that compromises between the CC and DV estimates, exploiting information in the incomplete cases when the data support DV assumptions. We illustrate favorable properties of the method by simulation, and apply the proposed method to a liver cancer study. Extension of the method to more than one missing covariate is also discussed.
We establish a connection between a class of chain-binomial models of use in ecology and epidemiology and binomial autoregressive (AR) processes. New results are obtained for the latter, including expressions for the lag- conditional distribution and related quantities. We focus on two types of chain-binomial model, extinction–colonization and colonization–extinction models, and present two approaches to parameter estimation. The asymptotic distributions of the resulting estimators are studied, as well as their finite-sample performance, and we give an application to real data. A connection is made with standard AR models, which also has implications for parameter estimation.
To develop more targeted intervention strategies, an important research goal is to identify markers predictive of clinical events. A crucial step toward this goal is to characterize the clinical performance of a marker for predicting different types of events. In this article, we present statistical methods for evaluating the performance of a prognostic marker in predicting multiple competing events. To capture the potential time-varying predictive performance of the marker and incorporate competing risks, we define time- and cause-specific accuracy summaries by stratifying cases based on causes of failure. Such definition would allow one to evaluate the predictive accuracy of a marker for each type of event and compare its predictiveness across event types. Extending the nonparametric crude cause-specific receiver operating characteristics curve estimators by Saha and Heagerty (2010), we develop inference procedures for a range of cause-specific accuracy summaries. To estimate the accuracy measures and assess how covariates may affect the accuracy of a marker under the competing risk setting, we consider two forms of semiparametric models through the cause-specific hazard framework. These approaches enable a flexible modeling of the relationships between the marker and failure times for each cause, while efficiently accommodating additional covariates. We investigate the asymptotic property of the proposed accuracy estimators and demonstrate the finite sample performance of these estimators through simulation studies. The proposed procedures are illustrated with data from a prostate cancer prognostic study.
Summary Case–parent trio studies concerned with children affected by a disease and their parents aim to detect single nucleotide polymorphisms (SNPs) showing a preferential transmission of alleles from the parents to their affected offspring. A popular statistical test for detecting such SNPs associated with disease in this study design is the genotypic transmission/disequilibrium test (gTDT) based on a conditional logistic regression model, which usually needs to be fitted by an iterative procedure. In this article, we derive exact closed-form solutions for the parameter estimates of the conditional logistic regression models when testing for an additive, a dominant, or a recessive effect of a SNP, and show that such analytic parameter estimates also exist when considering gene–environment interactions with binary environmental variables. Because the genetic model underlying the association between a SNP and a disease is typically unknown, it might further be beneficial to use the maximum over the gTDT statistics for the possible effects of a SNP as test statistic. We therefore propose a procedure enabling a fast computation of the test statistic and the permutation-based p-value of this MAX gTDT. All these methods are applied to whole-genome scans of the case–parent trios from the International Cleft Consortium. These applications show our procedures dramatically reduce the required computing time compared to the conventional iterative methods allowing, for example, the analysis of hundreds of thousands of SNPs in a few minutes instead of several hours.
Summary Phase II trials in oncology are usually conducted as single-arm two-stage designs with binary endpoints. Currently available adaptive design methods are tailored to comparative studies with continuous test statistics. Direct transfer of these methods to discrete test statistics results in conservative procedures and, therefore, in a loss in power. We propose a method based on the conditional error function principle that directly accounts for the discreteness of the outcome. It is shown how application of the method can be used to construct new phase II designs that are more efficient as compared to currently applied designs and that allow flexible mid-course design modifications. The proposed method is illustrated with a variety of frequently used phase II designs.
Summary The current statistical literature on causal inference is primarily concerned with population means of potential outcomes, while the current statistical practice also involves other meaningful quantities such as quantiles. Motivated by the Consortium on Safe Labor (CSL), a large observational study of obstetric labor progression, we propose and compare methods for estimating marginal quantiles of potential outcomes as well as quantiles among the treated. By adapting existing methods and techniques, we derive estimators based on outcome regression (OR), inverse probability weighting, and stratification, as well as a doubly robust (DR) estimator. By incorporating stratification into the DR estimator, we further develop a hybrid estimator with enhanced numerical stability at the expense of a slight bias under misspecification of the OR model. The proposed methods are illustrated with the CSL data and evaluated in simulation experiments mimicking the CSL.
Summary It is a common practice to analyze complex longitudinal data using semiparametric nonlinear mixed-effects (SNLME) models with a normal distribution. Normality assumption of model errors may unrealistically obscure important features of subject variations. To partially explain between- and within-subject variations, covariates are usually introduced in such models, but some covariates may often be measured with substantial errors. Moreover, the responses may be missing and the missingness may be nonignorable. Inferential procedures can be complicated dramatically when data with skewness, missing values, and measurement error are observed. In the literature, there has been considerable interest in accommodating either skewness, incompleteness or covariate measurement error in such models, but there has been relatively little study concerning all three features simultaneously. In this article, our objective is to address the simultaneous impact of skewness, missingness, and covariate measurement error by jointly modeling the response and covariate processes based on a flexible Bayesian SNLME model. The method is illustrated using a real AIDS data set to compare potential models with various scenarios and different distribution specifications.
Summary This article explores effective implementation of split-plot designs in serial dilution bioassay using robots. We show that the shortest path for a robot to fill plate wells for a split-plot design is equivalent to the shortest common supersequence problem in combinatorics. We develop an algorithm for finding the shortest common supersequence, provide an R implementation, and explore the distribution of the number of steps required to implement split-plot designs for bioassay through simulation. We also show how to construct collections of split plots that can be filled in a minimal number of steps, thereby demonstrating that split-plot designs can be implemented with nearly the same effort as strip-plot designs. Finally, we provide guidelines for modeling data that result from these designs.
Summary The protein lysate array is an emerging technology for quantifying the protein concentration ratios in multiple biological samples. Statistical inference for a parametric quantification procedure has been inadequately addressed in the literature, mainly because the appropriate asymptotic theory involves a problem with the number of parameters increasing with the number of observations. In this article, we develop a multistep procedure for the Sigmoidal models, ensuring consistent estimation of the concentration levels with full asymptotic efficiency. The results obtained in the article justify inferential procedures based on large sample approximations. Simulation studies and real data analysis are used in the article to illustrate the performance of the proposed method in finite samples. The multistep procedure is convenient to work with asymptotically, and is recommended for its statistical efficiency in protein concentration estimation and improved numerical stability by focusing on optimization of lower-dimensional objective functions.
Summary Next-generation sequencing technologies are poised to revolutionize the field of biomedical research. The increased resolution of these data promise to provide a greater understanding of the molecular processes that control the morphology and behavior of a cell. However, the increased amounts of data require innovative statistical procedures that are powerful while still being computationally feasible. In this article, we present a method for identifying small RNA molecules, called miRNAs, which regulate genes by targeting their mRNAs for degradation or translational repression. In the first step of our modeling procedure, we apply an innovative dynamic linear model that identifies candidate miRNA genes in high-throughput sequencing data. The model is flexible and can accurately identify interesting biological features while accounting for both the read count, read spacing, and sequencing depth. Additionally, miRNA candidates are also processed using a modified Smith–Waterman sequence alignment that scores the regions for potential RNA hairpins, one of the defining features of miRNAs. We illustrate our method on simulated datasets as well as on a small RNA Caenorhabditis elegans dataset from the Illumina sequencing platform. These examples show that our method is highly sensitive for identifying known and novel miRNA genes.