Wiley InterScience : Biometrics
Updated: 6 min 12 sec ago
March 2, 2010 - 9:48am
The Joint United Nations Programme on HIV/AIDS (UNAIDS) has decided to use Bayesian melding as the basis for its probabilistic projections of HIV prevalence in countries with generalized epidemics. This combines a mechanistic epidemiological model, prevalence data, and expert opinion. Initially, the posterior distribution was approximated by sampling-importance-resampling, which is simple to implement, easy to interpret, transparent to users, and gave acceptable results for most countries. For some countries, however, this is not computationally efficient because the posterior distribution tends to be concentrated around nonlinear ridges and can also be multimodal. We propose instead incremental mixture importance sampling (IMIS), which iteratively builds up a better importance sampling function. This retains the simplicity and transparency of sampling importance resampling, but is much more efficient computationally. It also leads to a simple estimator of the integrated likelihood that is the basis for Bayesian model comparison and model averaging. In simulation experiments and on real data, it outperformed both sampling importance resampling and three publicly available generic Markov chain Monte Carlo algorithms for this kind of problem.
March 2, 2010 - 9:45am
We present a Bayesian model to estimate the time-varying sensitivity of a diagnostic assay when the assay is given repeatedly over time, disease status is changing, and the gold standard is only partially observed. The model relies on parametric assumptions for the distribution of the latent time of disease onset and the time-varying sensitivity. Additionally, we illustrate the incorporation of historical data for constructing prior distributions. We apply the new methods to data collected in a study of mother-to-child transmission of HIV and include a covariate for sensitivity to assess whether two different assays have different sensitivity profiles.
March 2, 2010 - 9:44am
In estimation of the ROC curve, when the true disease status is subject to nonignorable missingness, the observed likelihood involves the missing mechanism given by a selection model. In this article, we proposed a likelihood-based approach to estimate the ROC curve and the area under the ROC curve when the verification bias is nonignorable. We specified a parametric disease model in order to make the nonignorable selection model identifiable. With the estimated verification and disease probabilities, we constructed four types of empirical estimates of the ROC curve and its area based on imputation and reweighting methods. In practice, a reasonably large sample size is required to estimate the nonignorable selection model in our settings. Simulation studies showed that all four estimators of ROC area performed well, and imputation estimators were generally more efficient than the other estimators proposed. We applied the proposed method to a data set from research in Alzheimer's disease.
March 2, 2010 - 9:42am
In this article, we propose a new generalized index to recover relationships between two sets of random vectors by finding the vector projections that minimize an L 2 distance between each projected vector and an unknown function of the other. The unknown functions are estimated using the Nadaraya[ndash]Watson smoother. Extensions to multiple sets and groups of multiple sets are also discussed, and a bootstrap procedure is developed to detect the number of significant relationships. All the proposed methods are assessed through extensive simulations and real data analyses. In particular, for environmental data from Los Angeles County, we apply our multiple-set methodology to study relationships between mortality, weather, and pollutants vectors. Here, we detect existence of both linear and nonlinear relationships between the dimension-reduced vectors, which are then used to build nonlinear time-series regression models for the dimension-reduced mortality vector. These findings also illustrate potential use of our method in many other applications. A comprehensive assessment of our methodologies along with their theoretical properties are given in a Web Appendix.
March 2, 2010 - 9:33am
Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this article, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies.
February 16, 2010 - 5:44am
Combining data collected from different sources can potentially enhance statistical efficiency in estimating effects of environmental or genetic factors or gene[ndash]environment interactions. However, combining data across studies becomes complicated when data are collected under different study designs, such as family-based and unrelated individual-based case[ndash]control design. In this article, we describe likelihood-based approaches that permit the joint estimation of covariate effects on disease risk under study designs that include cases, relatives of cases, and unrelated individuals. Our methods accommodate familial residual correlation and a variety of ascertainment schemes. Extensive simulation experiments demonstrate that the proposed methods for estimation and inference perform well in realistic settings. Efficiencies of different designs are contrasted in the simulation. We applied the methods to data from the Colorectal Cancer Family Registry.
February 16, 2010 - 5:44am
Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row[ndash]column associations within high-dimensional data matrices. SSVD seeks a low-rank, checkerboard structured matrix approximation to data matrices. The desired checkerboard structure is achieved by forcing both the left- and right-singular vectors to be sparse, that is, having many zero entries. By interpreting singular vectors as regression coefficient vectors for certain linear regressions, sparsity-inducing regularization penalties are imposed to the least squares regression to produce sparse singular vectors. An efficient iterative algorithm is proposed for computing the sparse singular vectors, along with some discussion of penalty parameter selection. A lung cancer microarray dataset and a food nutrition dataset are used to illustrate SSVD as a biclustering method. SSVD is also compared with some existing biclustering methods using simulated datasets.
February 16, 2010 - 5:43am
It is of great practical interest to simultaneously identify the important predictors that correspond to both the fixed and random effects components in a linear mixed-effects (LME) model. Typical approaches perform selection separately on each of the fixed and random effect components. However, changing the structure of one set of effects can lead to different choices of variables for the other set of effects. We propose simultaneous selection of the fixed and random factors in an LME model using a modified Cholesky decomposition. Our method is based on a penalized joint log likelihood with an adaptive penalty for the selection and estimation of both the fixed and random effects. It performs model selection by allowing fixed effects or standard deviations of random effects to be exactly zero. A constrained expectation[ndash]maximization algorithm is then used to obtain the final estimates. It is further shown that the proposed penalized estimator enjoys the Oracle property, in that, asymptotically it performs as well as if the true model was known beforehand. We demonstrate the performance of our method based on a simulation study and a real data example.
February 16, 2010 - 5:42am
Time varying, individual covariates are problematic in experiments with marked animals because the covariate can typically only be observed when each animal is captured. We examine three methods to incorporate time varying, individual covariates of the survival probabilities into the analysis of data from mark-recapture-recovery experiments: deterministic imputation, a Bayesian imputation approach based on modeling the joint distribution of the covariate and the capture history, and a conditional approach considering only the events for which the associated covariate data are completely observed (the trinomial model). After describing the three methods, we compare results from their application to the analysis of the effect of body mass on the survival of Soay sheep (Ovis aries) on the Isle of Hirta, Scotland. Simulations based on these results are then used to make further comparisons. We conclude that both the trinomial model and Bayesian imputation method perform best in different situations. If the capture and recovery probabilities are all high, then the trinomial model produces precise, unbiased estimators that do not depend on any assumptions regarding the distribution of the covariate. In contrast, the Bayesian imputation method performs substantially better when capture and recovery probabilities are low, provided that the specified model of the covariate is a good approximation to the true data-generating mechanism.
February 16, 2010 - 5:41am
Clustering is a widely used method in extracting useful information from gene expression data, where unknown correlation structures in genes are believed to persist even after normalization. Such correlation structures pose a great challenge on the conventional clustering methods, such as the Gaussian mixture (GM) model, k-means (KM), and partitioning around medoids (PAM), which are not robust against general dependence within data. Here we use the exponential power mixture model to increase the robustness of clustering against general dependence and nonnormality of the data. An expectation[ndash]conditional maximization algorithm is developed to calculate the maximum likelihood estimators (MLEs) of the unknown parameters in these mixtures. The Bayesian information criterion is then employed to determine the numbers of components of the mixture. The MLEs are shown to be consistent under sparse dependence. Our numerical results indicate that the proposed procedure outperforms GM, KM, and PAM when there are strong correlations or non-Gaussian components in the data.
January 29, 2010 - 5:29am
ChIP-chip experiments are procedures that combine chromatin immunoprecipitation (ChIP) and DNA microarray (chip) technology to study a variety of biological problems, including protein[ndash]DNA interaction, histone modification, and DNA methylation. The most important feature of ChIP-chip data is that the intensity measurements of probes are spatially correlated because the DNA fragments are hybridized to neighboring probes in the experiments. We propose a simple, but powerful Bayesian hierarchical approach to ChIP-chip data through an Ising model with high-order interactions. The proposed method naturally takes into account the intrinsic spatial structure of the data and can be used to analyze data from multiple platforms with different genomic resolutions. The model parameters are estimated using the Gibbs sampler. The proposed method is illustrated using two publicly available data sets from Affymetrix and Agilent platforms, and compared with three alternative Bayesian methods, namely, Bayesian hierarchical model, hierarchical gamma mixture model, and Tilemap hidden Markov model. The numerical results indicate that the proposed method performs as well as the other three methods for the data from Affymetrix tiling arrays, but significantly outperforms the other three methods for the data from Agilent promoter arrays. In addition, we find that the proposed method has better operating characteristics in terms of sensitivities and false discovery rates under various scenarios.
January 22, 2010 - 8:04am
We introduce a correction for covariate measurement error in nonparametric regression applied to longitudinal binary data arising from a study on human sleep. The data have been surveyed to investigate the association of some hormonal levels and the probability of being asleep. The hormonal effect is modeled flexibly while we account for the error-prone measurement of its concentration in the blood and the longitudinal character of the data. We present a fully Bayesian treatment utilizing Markov chain Monte Carlo inference techniques, and also introduce block updating to improve sampling and computational performance in the binary case. Our model is partly inspired by the relevance vector machine with radial basis functions, where usually very few basis functions are automatically selected for fitting the data. In the proposed approach, we implement such data-driven complexity regulation by adopting the idea of Bayesian model averaging. Besides the general theory and the detailed sampling scheme, we also provide a simulation study for the Gaussian and the binary cases by comparing our method to the naive analysis ignoring measurement error. The results demonstrate a clear gain when using the proposed correction method, particularly for the Gaussian case with medium and large measurement error variances, even if the covariate model is misspecified.
January 22, 2010 - 8:02am
Distance sampling is a widely used methodology for assessing animal abundance. A key requirement of distance sampling is that samplers (lines or points) are placed according to a randomized design, which ensures that samplers are positioned independently of animals. Often samplers are placed along linear features such as roads, so that bias is expected if animals are not uniformly distributed with respect to distance from the linear feature. We present an approach for analyzing distance data from a survey when the samplers are points placed along a linear feature. Based on results from a simulation study and from a survey of Irish hares in Northern Ireland conducted from roads, we conclude that large bias may result if the position of samplers is not randomized, and analysis methods fail to account for nonuniformity.
January 22, 2010 - 8:00am
Given a randomized treatment Z, a clinical outcome Y, and a biomarker S measured some fixed time after Z is administered, we may be interested in addressing the surrogate endpoint problem by evaluating whether S can be used to reliably predict the effect of Z on Y. Several recent proposals for the statistical evaluation of surrogate value have been based on the framework of principal stratification. In this article, we consider two principal stratification estimands: joint risks and marginal risks. Joint risks measure causal associations (CAs) of treatment effects on S and Y, providing insight into the surrogate value of the biomarker, but are not statistically identifiable from vaccine trial data. Although marginal risks do not measure CAs of treatment effects, they nevertheless provide guidance for future research, and we describe a data collection scheme and assumptions under which the marginal risks are statistically identifiable. We show how different sets of assumptions affect the identifiability of these estimands; in particular, we depart from previous work by considering the consequences of relaxing the assumption of no individual treatment effects on Y before S is measured. Based on algebraic relationships between joint and marginal risks, we propose a sensitivity analysis approach for assessment of surrogate value, and show that in many cases the surrogate value of a biomarker may be hard to establish, even when the sample size is large.
January 22, 2010 - 7:56am
The mixture model is a method of choice for modeling heterogeneous random graphs, because it contains most of the known structures of heterogeneity: hubs, hierarchical structures, or community structure. One of the weaknesses of mixture models on random graphs is that, at the present time, there is no computationally feasible estimation method that is completely satisfying from a theoretical point of view. Moreover, mixture models assume that each vertex pertains to one group, so there is no place for vertices being at intermediate positions. The model proposed in this article is a grade of membership model for heterogeneous random graphs, which assumes that each vertex is a mixture of extremal hypothetical vertices. The connectivity properties of each vertex are deduced from those of the extreme vertices. In this new model, the vector of weights of each vertex are fixed continuous parameters. A model with a vector of parameters for each vertex is tractable because the number of observations is proportional to the square of the number of vertices of the network. The estimation of the parameters is given by the maximum likelihood procedure. The model is used to elucidate some of the processes shaping the heterogeneous structure of a well-resolved network of host/parasite interactions.
January 11, 2010 - 8:33am
We examine situations where interest lies in the conditional association between outcome and exposure variables, given potential confounding variables. Concern arises that some potential confounders may not be measured accurately, whereas others may not be measured at all. Some form of sensitivity analysis might be employed, to assess how this limitation in available data impacts inference. A Bayesian approach to sensitivity analysis is straightforward in concept: a prior distribution is formed to encapsulate plausible relationships between unobserved and observed variables, and posterior inference about the conditional exposure[ndash]disease relationship then follows. In practice, though, it can be challenging to form such a prior distribution in both a realistic and simple manner. Moreover, it can be difficult to develop an attendant Markov chain Monte Carlo (MCMC) algorithm that will work effectively on a posterior distribution arising from a highly nonidentified model. In this article, a simple prior distribution for acknowledging both poorly measured and unmeasured confounding variables is developed. It requires that only a small number of hyperparameters be set by the user. Moreover, a particular computational approach for posterior inference is developed, because application of MCMC in a standard manner is seen to be ineffective in this problem.
January 11, 2010 - 8:27am
In studies that estimate the short-term effects of air pollution on health, daily measurements of pollution concentrations are often available from a number of monitoring locations within the study area. However, the health data are typically only available in the form of daily counts for the entire area, meaning that a corresponding single daily measure of pollution is required. The standard approach is to average the observed measurements at the monitoring locations, and use this in a log-linear health model. However, as the pollution surface is spatially variable this simple summary is unlikely to be an accurate estimate of the average pollution concentration across the region, which may lead to bias in the resulting health effects. In this article, we propose an alternative approach that jointly models the pollution concentrations and their relationship with the health data using a Bayesian spatio-temporal model. We compare this approach with the simple spatial average using a simulation study, by investigating the impact of spatial variation, monitor placement, and measurement error in the pollution data. An epidemiological study from Greater London is then presented, which estimates the relationship between respiratory mortality and four different pollutants.
January 11, 2010 - 8:26am
Competing risks arise naturally in time-to-event studies. In this article, we propose time-dependent accuracy measures for a marker when we have censored survival times and competing risks. Time-dependent versions of sensitivity or true positive (TP) fraction naturally correspond to consideration of either cumulative (or prevalent) cases that accrue over a fixed time period, or alternatively to incident cases that are observed among event-free subjects at any select time. Time-dependent (dynamic) specificity (1[ndash]false positive (FP)) can be based on the marker distribution among event-free subjects. We extend these definitions to incorporate cause of failure for competing risks outcomes. The proposed estimation for cause-specific cumulative TP/dynamic FP is based on the nearest neighbor estimation of bivariate distribution function of the marker and the event time. On the other hand, incident TP/dynamic FP can be estimated using a possibly nonproportional hazards Cox model for the cause-specific hazards and riskset reweighting of the marker distribution. The proposed methods extend the time-dependent predictive accuracy measures of Heagerty, Lumley, and Pepe (2000, Biometrics 56, 337[ndash]344) and Heagerty and Zheng (2005, Biometrics 61, 92[ndash]105).
January 11, 2010 - 8:22am
Cluster randomized trials in health care may involve three instead of two levels, for instance, in trials where different interventions to improve quality of care are compared. In such trials, the intervention is implemented in health care units ("clusters") and aims at changing the behavior of health care professionals working in this unit ("subjects"), while the effects are measured at the patient level ("evaluations"). Within the generalized estimating equations approach, we derive a sample size formula that accounts for two levels of clustering: that of subjects within clusters and that of evaluations within subjects. The formula reveals that sample size is inflated, relative to a design with completely independent evaluations, by a multiplicative term that can be expressed as a product of two variance inflation factors, one that quantifies the impact of within-subject correlation of evaluations on the variance of subject-level means and the other that quantifies the impact of the correlation between subject-level means on the variance of the cluster means. Power levels as predicted by the sample size formula agreed well with the simulated power for more than 10 clusters in total, when data were analyzed using bias-corrected estimating equations for the correlation parameters in combination with the model-based covariance estimator or the sandwich estimator with a finite sample correction.
January 11, 2010 - 8:21am
The reliability of multi-item scales has received a lot of attention in the psychometric literature, where a myriad of measures like the Cronbach's [alpha] or the Spearman[ndash]Brown formula have been proposed. Most of these measures, however, are based on very restrictive models that apply only to unidimensional instruments. In this article, we introduce two measures to quantify the reliability of multi-item scales based on a more general model. We show that they capture two different aspects of the reliability problem and satisfy a minimum set of intuitive properties. The relevance and complementary value of the measures is studied and earlier approaches are placed in a broader theoretical framework. Finally, we apply them to investigate the reliability of the Positive and Negative Syndrome Scale, a rating scale for the assessment of the severity of schizophrenia.