Summary. Functional magnetic resonance imaging (MRI) is an advanced technology for studying brain functions. Owing to the complexity and high cost of functional MRI experiments, high quality multiobjective functional MRI designs are in great demand; they help to render precise statistical inference and are keys to the success of functional MRI experiments. Here, we propose an efficient approach for obtaining multiobjective functional MRI designs. In contrast with existing methods, the approach proposed does not require users to specify weights for the different objectives and can easily handle constraints to fulfil customized requirements. Moreover, the underlying statistical models that we consider are more general. We can thus obtain designs for cases where brief, long or varying stimulus durations are utilized. The usefulness of our approach is illustrated by using various experimental settings.
Summary. Environmental research increasingly uses high dimensional remote sensing and numerical model output to help to fill space–time gaps between traditional observations. Such output is often a noisy proxy for the process of interest. Thus we need to separate and assess the signal and noise (often called discrepancy) in the proxy given complicated spatiotemporal dependences. Here I extend a popular two-likelihood hierarchical model by using a more flexible representation for the discrepancy. I employ the little-used Markov random-field approximation to a thin plate spline, which can capture small-scale discrepancy in a computationally efficient manner while better modelling smooth processes than standard conditional auto-regressive models. The increased flexibility reduces identifiability, but the lack of identifiability is inherent in the scientific context. I model particulate matter air pollution by using satellite aerosol and atmospheric model output proxies. The estimated discrepancies occur at a variety of spatial scales, with small-scale discrepancy particularly important. The examples indicate little predictive improvement over modelling the observations alone. Similarly, in simulations with an informative proxy, the presence of discrepancy and resulting identifiability issues prevent improvement in prediction. The results highlight but do not resolve the critical question of how best to use proxy information while minimizing the potential for proxy-induced error.
Summary. Generalized additive models for location, scale and shape (GAMLSSs) are a popular semiparametric modelling approach that, in contrast with conventional generalized additive models, regress not only the expected mean but also every distribution parameter (e.g. location, scale and shape) to a set of covariates. Current fitting procedures for GAMLSSs are infeasible for high dimensional data set-ups and require variable selection based on (potentially problematic) information criteria. The present work describes a boosting algorithm for high dimensional GAMLSSs that was developed to overcome these limitations. Specifically, the new algorithm was designed to allow the simultaneous estimation of predictor effects and variable selection. The algorithm proposed was applied to Munich rental guide data, which are used by landlords and tenants as a reference for the average rent of a flat depending on its characteristics and spatial features. The net rent predictions that resulted from the high dimensional GAMLSSs were found to be highly competitive and covariate-specific prediction intervals showed a major improvement over classical generalized additive models.
Summary. Climate change may lead to changes in several aspects of the distribution of climate variables, including changes in the mean, increased variability and severity of extreme events. We propose the use of spatiotemporal quantile regression as a flexible and interpretable method for simultaneously detecting changes in several features of the distribution of climate variables. The spatiotemporal quantile regression model assumes that each quantile level changes linearly in time, permitting straightforward inference on the time trend for each quantile level. Unlike classical quantile regression which uses model-free methods to analyse a single quantile or several quantiles separately, we take a model-based approach which jointly models all quantiles, and thus the entire response distribution. In the spatiotemporal quantile regression model, each spatial location has its own quantile function that evolves over time, and the quantile functions are smoothed spatially by using Gaussian process priors. We propose a basis expansion for the quantile function that permits a closed form for the likelihood and allows for residual correlation modelling via a Gaussian spatial copula. We illustrate the methods by using temperature data for the south-east USA from the years 1931–2009. For these data, borrowing information across space identifies more significant time trends than classical non-spatial quantile regression. We find a decreasing time trend for much of the spatial domain for monthly mean and maximum temperatures. For the lower quantiles of monthly minimum temperature, we find a decrease in Georgia and Florida, and an increase in Virginia and the Carolinas.
Summary. An automated approach to extract interpretable features of univariate or multivariate profiles (functional data) is proposed. A landmark alignment algorithm is modified and the alignment is combined with piecewise linear approximations. Least absolute shrinkage and selection operator (lasso) regression is used for selecting the most important intercepts and slopes and yields an alternative to partial least squares to model a response associated with the profiles. Latent variables can be difficult to interpret but our extracted features simply correspond to slopes and intercepts of particular parts of the profiles. Also, features that relate to the degree of warping between a given profile and a reference can be extracted as predictors. Selection criteria for the number of knots and common knot locations between profiles are developed. We apply our proposed method to batch fermentation data where the profiles consist of on-line measurements of process variables and the corresponding yield of the process. The extracted features have good interpretability (with large dimensional reduction) and in combination with the lasso have prediction accuracy which is comparable with that of partial least squares applied to the original profiles. Also our proposed feature extraction method is applied to publicly available data where near infrared spectra define the profiles and the prediction accuracy of our feature lasso method is comparable with those of more complicated alternatives.
Summary. We describe and analyse a longitudinal diffusion tensor imaging study relating changes in the microstructure of intracranial white matter tracts to cognitive disability in multiple-sclerosis patients. In this application the scalar outcome and the functional exposure are measured longitudinally. This data structure is new and raises challenges that cannot be addressed with current methods and software. To analyse the data, we introduce a penalized functional regression model and inferential tools designed specifically for these emerging types of data. Our proposed model extends the generalized linear mixed model by adding functional predictors; this method is computationally feasible and is applicable when the functional predictors are measured densely, sparsely or with error. On-line supplements compare two implementations, one likelihood based and the other Bayesian, and provide the software that is used in simulations; the likelihood-based implementation is included as the lpfr() function in the R package refund that is available in the Comprehensive R Archive Network.
Summary. The paper is concerned with a dynamic factor model for spatiotemporal coupled environmental variables. The model is proposed in a state space formulation which, through Kalman recursions, allows a unified approach to prediction and estimation. Full probabilistic inference for the model parameters is facilitated by adapting standard Markov chain Monte Carlo algorithms for dynamic linear models to our model formulation. The predictive ability of the model is discussed for two different data sets with variables measured at two different scales. Some possibilities for further research are also outlined.
Summary. We investigate the 20-year-average boreal winter temperatures generated by an ensemble of six regional climate models (RCMs) in phase I of the North American Regional Climate Change Assessment Program. We use the long-run average (20-year integration) to smooth out variability and to capture the climate properties from the RCM outputs. We find that, although the RCMs capture the large-scale climate variation from coast to coast and from south to north similarly, their outputs can differ substantially in some regions. We propose a Bayesian hierarchical model to synthesize information from the ensemble of RCMs, and we construct a consensus climate signal with each RCM contributing to the consensus according to its own variability parameter. The Bayesian methodology enables us to make posterior inference on all the unknowns, including the large-scale fixed effects and the small-scale random effects in the consensus climate signal and in each RCM. The joint distributions of the consensus climate and the outputs from the RCMs are also investigated through posterior means, posterior variances and posterior spatial quantiles. We use a spatial random-effects model in the Bayesian hierarchical model and, consequently, we can deal with the large data sets of fine resolution outputs from all the RCMs. Additionally, our model allows a flexible spatial covariance structure without assuming stationarity or isotropy.
Summary. Characterizing the quality of dispersion of nanocomposites presents a challenging statistical problem for which no direct method has been fully adopted. A high precision, statistically well-grounded measure is required which is suitable for dealing with a single small non-homogeneous particle pattern obtained from the material. Our approach uses the Delaunay network of particles to measure the area disorder ADDel, which can be further used to categorize a material sample into well or poorly dispersed. ADDel-analysis is applied to several micrographs of nanoparticle-modified materials and found to classify the type of dispersion reliably. Selected spatial point processes are employed to estimate expected imprecision in observed measurements.
Summary. The measurement of human immunodeficiency virus ribonucleic acid levels over time leads to censored longitudinal data. Suitable models for dynamic modelling of these levels need to take this data characteristic into account. If groups of patients with different developments of the levels over time are suspected the model class of finite mixtures of mixed effects models with censored data is required. We describe the model specification and derive the estimation with a suitable expectation–maximization algorithm. We propose a convenient implementation using closed form formulae for the expected mean and variance of the truncated multivariate distribution. Only efficient evaluation of the cumulative multivariate normal distribution function is required. Model selection as well as methods for inference are discussed. The application is demonstrated on the clinical trial ACTG 315 data.
Summary We propose a randomized phase II clinical trial design based on Bayesian adaptive randomization and predictive probability monitoring. Adaptive randomization assigns more patients to a more efficacious treatment arm by comparing the posterior probabilities of efficacy between different arms. We continuously monitor the trial using the predictive probability. The trial is terminated early when it is shown that one treatment is overwhelmingly superior to others or that all the treatments are equivalent. We develop two methods to compute the predictive probability by considering the uncertainty of the sample size of the future data. We illustrate the proposed Bayesian adaptive randomization and predictive probability design using a phase II lung cancer clinical trial, and we conduct extensive simulation studies to examine the operating characteristics of the design. By coupling adaptive randomization and predictive probability approaches, the trial can treat more patients with a more efficacious treatment and allow for early stopping whenever sufficient information is obtained to conclude treatment superiority or equivalence. The design proposed also controls both the type I and the type II errors and offers an alternative Bayesian approach to the frequentist group sequential design.
Summary. Clinical data on the location of residence at the time of diagnosis of new lupus cases in Toronto, Canada, for the 40 years to 2007 are modelled with the aim of finding areas of abnormally high risk. Inference is complicated by numerous irregular changes in the census regions on which population is reported. A model is introduced consisting of a continuous random spatial surface and fixed effects for time and ages of individuals. The process is modelled on a fine grid and Bayesian inference performed by using integrated nested Laplace approximations. Predicted risk surfaces and posterior probabilities of exceedance are produced for lupus and, for comparison, psoriatic arthritis data from the same clinic. Simulations studies are also carried out to understand better the performance of the model proposed as well as to compare with existing methods.
Summary. Numerous studies have linked ambient air pollution and adverse health outcomes. Many studies of this nature relate outdoor pollution levels measured at a few monitoring stations with health outcomes. Recently, computational methods have been developed to model the distribution of personal exposures, rather than ambient concentration, and then relate the exposure distribution to the health outcome. Although these methods show great promise, they are limited by the computational demands of the exposure model. We propose a method to alleviate these computational burdens with the eventual goal of implementing a national study of the health effects of air pollution exposure. Our approach is to develop a statistical emulator for the exposure model, i.e. we use Bayesian density estimation to predict the conditional exposure distribution as a function of several variables, such as temperature, human activity and physical characteristics of the pollutant. This poses a challenging statistical problem because there are many predictors of the exposure distribution and density estimation is notoriously difficult in high dimensions. To overcome this challenge, we use stochastic search variable selection to identify a subset of the variables that have more than just additive effects on the mean of the exposure distribution. We apply our method to emulate an ozone exposure model in Philadelphia.
Summary. Multiphase (stage) designs that involve more than two phases are increasingly used by clinicians and psychologists for diagnosis and screening of dementia and many other diseases, e.g. colorectal or breast cancer. The multiphase design is an extension of the commonly used two-phase design, where an inexpensive initial screening test is followed by a gold standard. In a typical three-phase design, the screening test in phase 1 usually has high sensitivity but relatively low specificity. Phase 2 consists of a repeated application of the initial screening test and/or a more confirmatory test and then the gold standard test is used in phase 3. In such designs, both the verification process and the accuracy of each screening test may depend on patients’ characteristics. In addition, multiple-screening tests are correlated and composite decision rules may be used. However, no estimation methods exist for assessing the accuracy of a multiphase diagnosis procedure. To address these problems, we develop a method of estimating the diagnostic accuracy for each individual test and for the whole diagnostic procedure in a multiphase design in the presence of verification bias. Simulation studies are carried out to evaluate the performance of the method proposed and to compare different strategies of combining sequential tests. The method proposed is applied to data from a multiphase study of dementia.
Summary. In oncology, progression-free survival time, which is defined as the minimum of the times to disease progression or death, often is used to characterize treatment and covariate effects. We are motivated by the desire to estimate the progression time distribution on the basis of data from 780 paediatric patients with choroid plexus tumours, which are a rare brain cancer where disease progression always precedes death. In retrospective data on 674 patients, the times to death or censoring were recorded but progression times were missing. In a prospective study of 106 patients, both times were recorded but there were only 20 non-censored progression times and 10 non-censored survival times. Consequently, estimating the progression time distribution is complicated by the problems that, for most of the patients, either the survival time is known but the progression time is not known, or the survival time is right censored and it is not known whether the patient's disease progressed before censoring. For data with these missingness structures, we formulate a family of Bayesian parametric likelihoods and present methods for estimating the progression time distribution. The underlying idea is that estimating the association between the time to progression and subsequent survival time from patients having complete data provides a basis for utilizing covariates and partial event time data of other patients to infer their missing progression times. We illustrate the methodology by analysing the brain tumour data, and we also present a simulation study.
Summary. Estimating the burden of infectious disease is complicated by the general tendency for underreporting of cases. When the reporting rate is unknown, conventional methods have relied on accounting methods that do not make explicit use of surveillance data or the temporal dynamics of transmission and infection. State space models are a framework for various methods that allow dynamic models to be fitted with partially or imperfectly observed surveillance data. State space models are an appealing approach to burden estimation as they combine expert knowledge in the form of an underlying dynamic model but make explicit use of surveillance data to estimate parameter values, to predict unobserved elements of the model and to provide standard errors for estimates.
Summary. In longitudinal genetic studies, investigators collect repeated measurements on a trait that changes with time along with genetic markers. For family-based longitudinal studies, since repeated measurements are nested within subjects and subjects are nested within families, both the subject level and the measurement level correlations must be taken into account in the statistical analysis to achieve more accurate estimation. In such studies, the primary interests include testing for a quantitative trait locus effect, and estimating the age-specific quantitative trait locus effect and residual polygenic heritability function. We propose flexible semiparametric models and their statistical estimation and hypothesis testing procedures for longitudinal genetic data. We employ penalized splines to estimate non-parametric functions in the model. We find that misspecifying the baseline function or the genetic effect function in a parametric analysis may lead to a substantially inflated or highly conservative type I error rate on testing and large mean-squared error on estimation. We apply the proposed approaches to examine age-specific effects of genetic variants reported in a recent genomewide association study of blood pressure collected in the Framingham Heart Study.
Summary. The prognosis for patients with high grade gliomas is poor, with a median survival of 1 year. Treatment efficacy assessment is typically unavailable until 5–6 months post diagnosis. Investigators hypothesize that quantitative magnetic resonance imaging can assess treatment efficacy 3 weeks after therapy starts, thereby allowing salvage treatments to begin earlier. The purpose of this work is to build a predictive model of treatment efficacy by using quantitative magnetic resonance imaging data and to assess its performance. The outcome is 1-year survival status. We propose a joint, two-stage Bayesian model. In stage I, we smooth the image data with a multivariate spatiotemporal pairwise difference prior. We propose four summary statistics that are functionals of posterior parameters from the first-stage model. In stage II, these statistics enter a generalized non-linear model as predictors of survival status. We use the probit link and a multivariate adaptive regression spline basis. The hybrid Metropolis-within-Gibbs algorithm and reversible jump Markov chain Monte Carlo methods are applied iteratively between the two stages to estimate the posterior distribution. Through both simulation studies and model performance comparisons we find that we can achieve higher overall correct classification rates by accounting for the spatiotemporal correlation in the images and by allowing for a more complex and flexible decision boundary provided by the generalized non-linear model.
Summary. Healthcare resource allocation decisions are commonly informed by computer model predictions of population mean costs and health effects. It is common to quantify the uncertainty in the prediction due to uncertain model inputs, but methods for quantifying uncertainty due to inadequacies in model structure are less well developed. We introduce an example of a model that aims to predict the costs and health effects of a physical activity promoting intervention. Our goal is to develop a framework in which we can manage our uncertainty about the costs and health effects due to deficiencies in the model structure. We describe the concept of ‘model discrepancy’: the difference between the model evaluated at its true inputs, and the true costs and health effects. We then propose a method for quantifying discrepancy based on decomposing the cost-effectiveness model into a series of subfunctions, and considering potential error at each subfunction. We use a variance-based sensitivity analysis to locate important sources of discrepancy within the model to guide model refinement. The resulting improved model is judged to contain less structural error, and the distribution on the model output better reflects our true uncertainty about the costs and effects of the intervention.
Summary. The paper will help practitioners to select strength 3 designs that are useful for screening both main effects and two-factor interactions. We calculated word length patterns, correlations of four-factor interaction contrast vectors with the intercept and ranks of the two-factor interaction matrices for all non-equivalent two-level orthogonal arrays of strength 3 and run sizes up to 48. On the basis of these characteristics, there are a limited number of designs that can be recommended for practical use.