An introduction into distance correlation with applications to genetics Dominic Edelmann
German Cancer Research Center (Heidelberg, Germany)
Abstract:
Measuring dependence between random variables undoubtedly plays a central role in statistics. By far the most prominent dependence coefficient is Pearson correlation, which measures the strength of linear association. If two random variables are independent, Pearson correlation is zero, signifying the least possible dependence. However, the converse is not true. For Pearson correlation (and virtually all other classical correlation coefficients), there exist random variables that are highly dependent but feature a correlation coefficient of zero. This implies that there can be important dependencies in a data set that are not identified by any of the classical correlation coefficients.
In two seminal papers, Székely, et al. (2007, 2009) introduced the powerful concept of distance correlation. Different from classical measures such as Pearson or Spearman correlation, the distance correlation between two random variables is zero only in the case, where there is no dependence between these random variables. Consequently, distance correlation can potentially identify any association between data sets.
In the first part of this talk, we provide an introduction into distance correlation including a detailed comparison to classical Pearson correlation. We further present the R package dcortools and outline its potential for data analysis.
In the second part of the talk, we develop a family of distance covariance methods for testing the association of single nucleotide polymorphisms (SNPs) with quantitative responses. We show that certain versions of distance covariance correspond to locally most powerful tests for specific statistical models leading to crucial insights in which situations these tests perform well. The performance of the approach is investigated in various simulation studies and a real world example.