Wiley InterScience : Biometrical Journal
Updated: 11 min 1 sec ago
March 8, 2010 - 2:01am
The author compares 12 hierarchical models in the aim of estimating the abundance of fish in alpine streams by using removal sampling data collected at multiple locations. The most expanded model accounts for (i) variability of the abundance among locations, (ii) variability of the catchability among locations, and (iii) residual variability of the catchability among fish. Eleven model reductions are considered depending which variability is included in the model. The more restrictive model considers none of the aforementioned variabilities. Computations of the latter model can be achieved by using the algorithm presented by Carle and Strub (Biometrics 1978, 34, 621-630). Maximum a posteriori and interval estimates of the parameters as well as the Akaike and the Bayesian information criterions of model fit are computed by using samples simulated by a Markov chain Monte Carlo method. The models are compared by using a trout (Salmo trutta fario) parr (0+) removal sampling data set collected at three locations in the Pyrénées mountain range (Haute-Garonne, France) in July 2006. Results suggest that, in this case study, variability of the catchability is not significant, either among fish or locations. Variability of the abundance among locations is significant. 95% interval estimates of the abundances at the three locations are [0.15, 0.24], [0.26, 0.36], and [0.45, 0.58] parrs per m2. Such differences are likely the consequence of habitat variability.
March 8, 2010 - 2:01am
Large contingency tables summarizing categorical variables arise in many areas. One example is in biology, where large numbers of biomarkers are cross-tabulated according to their discrete expression level. Interactions of the variables are of great interest and are generally studied with log-linear models. The structure of a log-linear model can be visually represented by a graph from which the conditional independence structure can then be easily read off. However, since the number of parameters in a saturated model grows exponentially in the number of variables, this generally comes with a heavy computational burden. Even if we restrict ourselves to models of lower-order interactions or other sparse structures, we are faced with the problem of a large number of cells which play the role of sample size. This is in sharp contrast to high-dimensional regression or classification procedures because, in addition to a high-dimensional parameter, we also have to deal with the analogue of a huge sample size. Furthermore, high-dimensional tables naturally feature a large number of sampling zeros which often leads to the nonexistence of the maximum likelihood estimate. We therefore present a decomposition approach, where we first divide the problem into several lower-dimensional problems and then combine these to form a global solution. Our methodology is computationally feasible for log-linear interaction models with many categorical variables each or some of them having many levels. We demonstrate the proposed method on simulated data and apply it to a bio-medical problem in cancer research.
February 23, 2010 - 7:45am
No Abstract
February 17, 2010 - 3:47am
February 17, 2010 - 3:47am
In order to study family-based association in the presence of linkage, we extend a generalized linear mixed model proposed for genetic linkage analysis (Lebrec and van Houwelingen (2007), Human Heredity 64, 5-15) by adding a genotypic effect to the mean. The corresponding score test is a weighted family-based association tests statistic, where the weight depends on the linkage effect and on other genetic and shared environmental effects. For testing of genetic association in the presence of gene-covariate interaction, we propose a linear regression method where the family-specific score statistic is regressed on family-specific covariates. Both statistics are straightforward to compute. Simulation results show that adjusting the weight for the within-family variance structure may be a powerful approach in the presence of environmental effects. The test statistic for genetic association in the presence of gene-covariate interaction improved the power for detecting association. For illustration, we analyze the rheumatoid arthritis data from GAW15. Adjusting for smoking and anti-cyclic citrullinated peptide increased the significance of the association with the DR locus.
February 17, 2010 - 3:47am
The Cox proportional hazards regression model is the most popular approach to model covariate information for survival times. In this context, the development of high-dimensional models where the number of covariates is much larger than the number of observations ( ) is an ongoing challenge. A practicable approach is to use ridge penalized Cox regression in such situations. Beside focussing on finding the best prediction rule, one is often interested in determining a subset of covariates that are the most important ones for prognosis. This could be a gene set in the biostatistical analysis of microarray data. Covariate selection can then, for example, be done by L1-penalized Cox regression using the lasso (Tibshirani (). Statistics in Medicine 16, 385-395). Several approaches beyond the lasso, that incorporate covariate selection, have been developed in recent years. This includes modifications of the lasso as well as nonconvex variants such as smoothly clipped absolute deviation (SCAD) (Fan and Li (). Journal of the American Statistical Association 96, 1348-1360; Fan and Li (). The Annals of Statistics 30, 74-99). The purpose of this article is to implement them practically into the model building process when analyzing high-dimensional data with the Cox proportional hazards model. To evaluate penalized regression models beyond the lasso, we included SCAD variants and the adaptive lasso (Zou (). Journal of the American Statistical Association 101, 1418-1429). We compare them with "standard" applications such as ridge regression, the lasso, and the elastic net. Predictive accuracy, features of variable selection, and estimation bias will be studied to assess the practical use of these methods. We observed that the performance of SCAD and adaptive lasso is highly dependent on nontrivial preselection procedures. A practical solution to this problem does not yet exist. Since there is high risk of missing relevant covariates when using SCAD or adaptive lasso applied after an inappropriate initial selection step, we recommend to stay with lasso or the elastic net in actual data applications. But with respect to the promising results for truly sparse models, we see some advantage of SCAD and adaptive lasso, if better preselection procedures would be available. This requires further methodological research.
February 17, 2010 - 3:47am
February 5, 2010 - 3:58am
Some personal remarks about Hans van Houwelingen's approach to biostatistics in general are followed by a discussion of his article with Koos Zwinderman and Theo Stijnen outlining a bivariate approach to meta-analysis. It is concluded that this is more radical than many may realise in that it permits inter-trial information to be recovered. This has some advantages but in theory opens the door to bias. It is concluded that in practice the size of this bias is likely to be small. I end with some further personal remarks to Hans.
February 5, 2010 - 3:58am
The Aalen-Johansen estimator is the standard nonparametric estimator of the cumulative incidence function in competing risks. Estimating its variance in small samples has attracted some interest recently, together with a critique of the usual martingale-based estimators. We show that the preferred estimator equals a Greenwood-type estimator that has been derived as a recursion formula using counting processes and martingales in a more general multistate framework. We also extend previous simulation studies on estimating the variance of the Aalen-Johansen estimator in small samples to left-truncated observation schemes, which may conveniently be handled within the counting processes framework. This investigation is motivated by a real data example on spontaneous abortion in pregnancies exposed to coumarin derivatives, where both competing risks and left-truncation have recently been shown to be crucial methodological issues (Meister and Schaefer (2008), Reproductive Toxicology 26, 31-35). Multistate-type software and data are available online to perform the analyses. The Greenwood-type estimator is recommended for use in practice.
February 3, 2010 - 6:27am
Simulation-based assessment is a popular and frequently necessary approach for evaluating statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing various ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.
December 22, 2009 - 7:13am
No Abstract
December 22, 2009 - 7:13am
Multiple-dose factorial designs may provide confirmatory evidence that (fixed) combination drugs are superior to either component drug alone. Moreover, a useful and safe range of dose combinations may be identified. In our study, we focus on (A) adjustments of the overall significance level made necessary by multiple testing, (B) improvement of conventional statistical methods with respect to power, distributional assumptions and dimensionality, and (C) construction of corresponding simultaneous confidence intervals. We propose novel resampling algorithms, which in a simple way take the correlation of multiple test statistics into account, thus improving power. Moreover, these algorithms can easily be extended to combinations of more than two component drugs and binary outcome data. Published data summaries from a blood pressure reduction trial are analysed and presented as a worked example. An implementation of the proposed methods is available online as an R package.
December 22, 2009 - 7:13am
We extend the Dahlberg and Wang (Biometrics 2007, 63, 1237-1244) proportional hazards (PH) cure model for the analysis of time-to-event data that is subject to a cure rate with masked event to a setting where the PH assumption does not hold. Assuming an accelerated failure time (AFT) model with unspecified error distribution for the time to the event of interest, we propose rank-based estimating equations for the model parameters and use a generalization of the EM algorithm for parameter estimation. Applying our proposed AFT model to the same motivating breast cancer dataset as Dahlberg and Wang (Biometrics 2007, 63, 1237-1244), our results are more intuitive for the treatment arm in which the PH assumption may be violated. We also conduct a simulation study to evaluate the performance of the proposed method.
December 22, 2009 - 7:13am
Analysis of longitudinal data with excessive zeros has gained increasing attention in recent years; however, current approaches to the analysis of longitudinal data with excessive zeros have primarily focused on balanced data. Dropouts are common in longitudinal studies; therefore, the analysis of the resulting unbalanced data is complicated by the missing mechanism. Our study is motivated by the analysis of longitudinal skin cancer count data presented by Greenberg, Baron, Stukel, Stevens, Mandel, Spencer, Elias, Lowe, Nierenberg, Bayrd, Vance, Freeman, Clendenning, Kwan, and the Skin Cancer Prevention Study Group[New England Journal of Medicine 323, 789-795]. The data consist of a large number of zero responses (83% of the observations) as well as a substantial amount of dropout (about 52% of the observations). To account for both excessive zeros and dropout patterns, we propose a pattern-mixture zero-inflated model with compound Poisson random effects for the unbalanced longitudinal skin cancer data. We also incorporate an autoregressive of order 1 correlation structure in the model to capture longitudinal correlation of the count responses. A quasi-likelihood approach has been developed in the estimation of our model. We illustrated the method with analysis of the longitudinal skin cancer data.
December 22, 2009 - 7:13am
An extension of the stochastic susceptible-infectious-recovered (SIR) model is proposed in order to accommodate a regression context for modelling infectious disease data. The proposal is based on a multivariate counting process specified by conditional intensities, which contain an additive epidemic component and a multiplicative endemic component. This allows the analysis of endemic infectious diseases by quantifying risk factors for infection by external sources in addition to infective contacts. Inference can be performed by considering the full likelihood of the stochastic process with additional parameter restrictions to ensure non-negative conditional intensities. Simulation from the model can be performed by Ogata's modified thinning algorithm. As an illustrative example, we analyse data provided by the Federal Research Centre for Virus Diseases of Animals, Wusterhausen, Germany, on the incidence of the classical swine fever virus in Germany during 1993-2004.
December 22, 2009 - 7:13am
The ecological theory of the existence of multiple stable states between species, or the spatial heterogeneity of some unobserved environmental factor, supports the idea of multitype interactions between species. These multitype interactions can lead to different assemblages of species abundances. An exploratory tool for the detection of these species assemblages and for their spatial analysis is presented in this article. A two-stage analysis is proposed. First, a classification into types of species assemblages using only the species abundances at each site, regardless of their spatial location, is performed. The clustering procedure is based on multivariate normal mixtures and provides a measure of the classification uncertainty. Second, some tools for the study of the spatial structure of these types of assemblages are presented. We transfer the classification uncertainty to the spatial analysis of the classes in order to draw more accurate conclusions. This classification and spatial analysis method is used to point out a spatial gradient of infection in a host-pathogen system in the Åland Islands in Finland. It can be a useful preliminary tool for ecological studies involving the spatial distributions of several species.
December 22, 2009 - 7:13am
The effective population size Ne is an important parameter in population genetics and conservation biology. In recent years, there has been great interest in the use of molecular markers to estimate Ne. Although the point estimates from molecular markers in general suffer from a low reliability, the use of single nucleotide polymorphism (SNP) markers over a wide range of genome is expected to remarkably improve the reliability. In this study, expressions were derived for interval estimates of Ne from one published method, the heterozygote-excess method, when it is applied to SNP markers. The conditional variance theory is applied to the derivation of a confidence interval for Ne under random union of gametes, monogamy and polygyny. Stochastic simulation shows that the obtained confidence interval is slightly conservative, but fairly useful for practical applications. The result is illustrated with real data on SNP markers in a pig strain.
December 22, 2009 - 7:13am
No Abstract
December 22, 2009 - 7:13am
Time-dependent covariates are frequently encountered in regression analysis for event history data and competing risks. They are often essential predictors, which cannot be substituted by time-fixed covariates. This study briefly recalls the different types of time-dependent covariates, as classified by Kalbfleisch and Prentice [The Statistical Analysis of Failure Time Data, Wiley, New York, 2002] with the intent of clarifying their role and emphasizing the limitations in standard survival models and in the competing risks setting. If random (internal) time-dependent covariates are to be included in the modeling process, then it is still possible to estimate cause-specific hazards but prediction of the cumulative incidences and survival probabilities based on these is no longer feasible. This article aims at providing some possible strategies for dealing with these prediction problems. In a multi-state framework, a first approach uses internal covariates to define additional (intermediate) transient states in the competing risks model. Another approach is to apply the landmark analysis as described by van Houwelingen [Scandinavian Journal of Statistics 2007, 34, 70-85] in order to study cumulative incidences at different subintervals of the entire study period. The final strategy is to extend the competing risks model by considering all the possible combinations between internal covariate levels and cause-specific events as final states. In all of those proposals, it is possible to estimate the changes/differences of the cumulative risks associated with simple internal covariates. An illustrative example based on bone marrow transplant data is presented in order to compare the different methods.
December 14, 2009 - 8:15am
The two-sided Simes test is known to control the type I error rate with bivariate normal test statistics. For one-sided hypotheses, control of the type I error rate requires that the correlation between the bivariate normal test statistics is non-negative. In this article, we introduce a trimmed version of the one-sided weighted Simes test for two hypotheses which rejects if (i) the one-sided weighted Simes test rejects and (ii) both p-values are below one minus the respective weighted Bonferroni adjusted level. We show that the trimmed version controls the type I error rate at nominal significance level [alpha] if (i) the common distribution of test statistics is point symmetric and (ii) the two-sided weighted Simes test at level 2[alpha] controls the level. These assumptions apply, for instance, to bivariate normal test statistics with arbitrary correlation. In a simulation study, we compare the power of the trimmed weighted Simes test with the power of the weighted Bonferroni test and the untrimmed weighted Simes test. An additional result of this article ensures type I error rate control of the usual weighted Simes test under a weak version of the positive regression dependence condition for the case of two hypotheses. This condition is shown to apply to the two-sided p-values of one- or two-sample t-tests for bivariate normal endpoints with arbitrary correlation and to the corresponding one-sided p-values if the correlation is non-negative. The Simes test for such types of bivariate t-tests has not been considered before. According to our main result, the trimmed version of the weighted Simes test then also applies to the one-sided bivariate t-test with arbitrary correlation.