Browse ORBi

- What it is and what it isn't
- Green Road / Gold Road?
- Ready to Publish. Now What?
- How can I support the OA movement?
- Where can I learn more?

ORBi

Predicting correlated outcomes from molecular data Rauschenberger, Armin ; Glaab, Enrico in Bioinformatics (in press) Motivation: Multivariate (multi-target) regression has the potential to outperform univariate (single-target) regression at predicting correlated outcomes, which frequently occur in biomedical and ... [more ▼] Motivation: Multivariate (multi-target) regression has the potential to outperform univariate (single-target) regression at predicting correlated outcomes, which frequently occur in biomedical and clinical research. Here we implement multivariate lasso and ridge regression using stacked generalisation. Results: Our flexible approach leads to predictive and interpretable models in high-dimensional settings, with a single estimate for each input-output effect. In the simulation, we compare the predictive performance of several state-of-the-art methods for multivariate regression. In the application, we use clinical and genomic data to predict multiple motor and non-motor symptoms in Parkinson’s disease patients. We conclude that stacked multivariate regression, with our adaptations, is a competitive method for predicting correlated outcomes. Availability and Implementation: The R package joinet is available on GitHub (https://github.com/rauschenberger/joinet) and CRAN (https://CRAN.R-project.org/package=joinet). [less ▲] Detailed reference viewed: 59 (1 UL)Fast cross-validation for multi-penalty ridge regression ; ; Rauschenberger, Armin in Journal of Computational and Graphical Statistics (in press) Prediction based on multiple high-dimensional data types needs to account for the potentially strong differences in predictive signal. Ridge regression is a simple, yet versatile and interpretable model ... [more ▼] Prediction based on multiple high-dimensional data types needs to account for the potentially strong differences in predictive signal. Ridge regression is a simple, yet versatile and interpretable model for high-dimensional data that has challenged the predictive performance of many more complex models and learners, in particular in dense settings. Moreover, it allows using a specific penalty per data type to account for differences between those. Then, the largest challenge for multi-penalty ridge is to optimize these penalties efficiently in a cross-validation (CV) setting, in particular for GLM and Cox ridge regression, which require an additional loop for fitting the model by iterative weighted least squares (IWLS). Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly all computations are in the low-dimensional sample space. We show that our approach is several orders of magnitude faster than more naive ones. We developed a very flexible framework that includes prediction of several types of response, allows for unpenalized covariates, can optimize several performance criteria and implements repeated CV. Moreover, extensions to pair data types and to allow a preferential order of data types are included and illustrated on several cancer genomics survival prediction problems. The corresponding R-package, multiridge, serves as a versatile standalone tool, but also as a fast benchmark for other more complex models and multi-view learners. [less ▲] Detailed reference viewed: 105 (33 UL)Predictive and interpretable models via the stacked elastic net Rauschenberger, Armin ; Glaab, Enrico ; in Bioinformatics (2021), 37(14), 20122016 Motivation: Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often ... [more ▼] Motivation: Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative, and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques. Results: Here we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularisation. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability. Availability and Implementation: The R package starnet is available on GitHub: https://github.com/rauschenberger/starnet. [less ▲] Detailed reference viewed: 263 (24 UL)A powerful global test for spliceQTL effects ; Rauschenberger, Armin ; et al E-print/Working paper (2021) Statistical methods to test for effects of SNPs on exon inclusion exist, but often rely on testing of associations between multiple exon-SNP pairs, with sometimes subsequent summarization of results at ... [more ▼] Statistical methods to test for effects of SNPs on exon inclusion exist, but often rely on testing of associations between multiple exon-SNP pairs, with sometimes subsequent summarization of results at the gene level. Such approaches require heavy multiple testing correction, and detect mostly events with large effect sizes. We propose here a test to find spliceQTL effects which takes all exons and all SNPs into account simultaneously. For any chosen gene, this score-based test looks for association between the set of exon expressions and the set of SNPs, via a random-effects model framework. It is efficient to compute, and can be used if the number of SNPs is larger than the number of samples. In addition, the test is powerful to detect effects that are relatively small for individual exon-SNP pairs, but are observed for many pairs. Furthermore, test results are more often replicated across datasets than pairwise testing results. This partly our test is more robust to exon-SNP pair-specific effects, but do not extend to multiple pairs within the same gene. We conclude that the test we propose here offers more power and better replicability in the search for spliceQTL effects. [less ▲] Detailed reference viewed: 41 (2 UL)Sparse classification with paired covariates Rauschenberger, Armin ; ; et al in Advances in Data Analysis and Classification (2020), 14 This paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to ... [more ▼] This paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package palasso is available from CRAN. [less ▲] Detailed reference viewed: 172 (20 UL)Testing for association between RNA-Seq and high-dimensional data Rauschenberger, Armin ; ; et al in BMC Bioinformatics (2016), 17 Background: Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results: Using the negative binomial ... [more ▼] Background: Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results: Using the negative binomial distribution and a random-effects model, we develop an omnibus test that overcomes both difficulties. It may be conceptualised as a test of overall significance in regression analysis, where the response variable is overdispersed and the number of explanatory variables exceeds the sample size. Conclusions: The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression. The R package globalSeq is available from Bioconductor. [less ▲] Detailed reference viewed: 75 (3 UL) |
||