统计学：使用R进行多变量数据整合；mixOmics包的方法与应用

Statistics: Multivariate Data Integration Using R; Methods and Applications With the mixOmics PackageKim‐Anh LêCao, Zoe MarieWelhamChapman & Hall/CRC, 2021, xxi + 308 pages, £84.99/$115.00, hardcover ISBN: 978‐1032128078 eBook ISBN: 9781003026860

International Statistical Review · 2024

被引 0

ABS 3

Krzysztof Podgórski 通讯

中文导读

该书介绍了mixOmics R包，用于整合基因、蛋白质等不同类型组学数据，提供多变量分析方法与案例，适合生物统计学家和数据科学家学习使用。

Abstract

Readership: Biostatisticians, data scientists, statistics graduates, computational biologists as well as researchers in the medical and biological sciences. The book introduces the computational tools recently developed by the authors and their large network of collaborators on the mixOmics project. The goal of the project was to implement multivariate statistical analysis methods in the context of modern high-throughput technologies generating data on thousands of molecules at different cellular levels in biological material. This diverse and rapidly developing area of biostatistics is nowadays referred to as omics. The value of the book is at least two-fold. First, it provides a compact but well-balanced introduction to the methodology of multivariate analysis in the context of omics data. Second, it instructs with the hands-on approach how the mixOmics R-package can be effectively used to perform suitable statistical analyses involving data in which several variables of different types (e.g. genes, proteins and metabolites) must be integrated into one analytic workflow. The package is a part of Bioconductor—a leading open-source software platform within R for broadly understood bioinformatics. The book is organized in three parts. In the first one, entitled ‘Modern Biology and Multivariate Analysis’, the focus is on methods of multivariate analysis performed under the so-called curse of dimensionality, a situation that is typical for omics studies. The presentation only signals the typical problems occurring whenever one has to deal with data sparsity, increased computation, overfitting, or visualization challenges. The second part, ‘mixOmics Under the Hood’, elaborates more on the methods implemented in the package. These are put into action in the last part—‘mixOmics in Action’. Probably, the most interesting is the last part in which a reader can explore a variety of methods by ‘playing with’ simple and short R-code leading directly to the results of analysis and their visualization. It is quite remarkable how little coding is needed to get very profound insights into the examples of the data set. The following methods are discussed: principal component analysis, projection to latent structure, canonical correlation analysis, discriminant analysis, canonical correlation analysis and various data integrations. A significant portion of the presentation considers sparse versions of the method whenever applicable as it is a critical aspect in high-dimensional settings. Simple but illustrative case studies are presented showing how the methodology can be used for empirical data analysis. The authors not only lead through mixOmics but also provide very accurate and valuable references whenever a less well-known method or technique is discussed. Because the book is essentially a presentation of the methodology and its applications for the mixOmics project, the project's webpage http://www.mixOmics.org can be considered complementary to the book with its rich additional material. It provides an organized update on the most recent developments. The book has also a sub-page within the project site that provides: (1) excellent case studies vignettes (excluding the methodological aspects) and (2) the R code for each case study. Additionally, the mixOmics community on GitHub and https://mixomics-users.discourse.group/ provide a convenient platform for those who want to actively participate in discussions and receive newsletters on the project. All this gives the book a chance to remain relevant despite rapid and constant developments in the field of omics data analysis. The leading theme throughout the book is dimension reduction and an accurate bibliography helps to get the theoretical background for this broadly relevant topic. One could probably wish for a more self-contained formal exposition of the methods as it is hard to fully understand the rationale behind the discussed procedures without having access to a more comprehensive description. It is clear that the author made this choice consciously but by my taste, adding some depth in the presentation, here and there, would serve the intended audience of the book well by giving a better recap of the foundations. In fact, on rare occasions, the book does it to some extent, for example, the appendices on data transformation in Part I, similarity matrices in Part II, and sparse PCA and PLS in Part III. However, making it consistent across all covered topics would, in my opinion, add to the value of this text. The main challenge for a reader is that there is no established single reference monograph on the topics discussed in the text, the closest one (but far from perfect) being Garzon et al. (2023). Nevertheless, Multivariate Data Integration Using R is a well-written book, a properly balanced and designed mix of methodology and applications, meeting all the standards of exposition on modern computationally assisted inference methods. The authors' extensive and insightful knowledge of the covered topics gives credibility to the package used for the book, while at the same time, the book becomes a good promotion for the package itself. Because of all this, the book should have a broad appeal to those wanting to learn dimension reduction methodology, to practitioners in omics research area who want to use them, and even to general experts in the field of high-dimensional multivariate analysis by providing a compact reference to the comprehensive mixOmics package. I highly recommend Multivariate Data Integration Using R to these audiences.

生物统计学多变量分析组学数据分析R语言

阅读原文 ↗