Curve Fitting with Distance Correlation

Status: Selected by R. Schill

Thesis: Master

Field: Genomics, Statistics

Advisors: Spang, Pirkl

Courses Required: Bioinformatik Blockpraktikum (can be combined)

Objective: Dynamic models of biological network predict the course of molecular quantities along time. Typically these predictions are parameterized non linear curves and often the values of the parameters are unknown and need to be estimated from data. Moreover, a fit of the model to real measurements is needed for validation. This is a classical machine learning problem: “curve fitting”. We assume a functional relation of the form: y = f(x,p)+err, where f(.,p) is a parameterized family of functions. Curves are traditionally fitted by minimizing the squared difference between the curve and the observations over all possible values of the parameter p. The cost function for fitting is hence L(p)= E(y-f(x,p)^2). From y = f(x,p)+err we also know that the residuals r(x) = y- f(x,p) must be independent of x. A potential dependence between r(x) and x can be quantified and tested using the distance correlation D(x,r(x)). If there is a significant dependence, this is strong evidence that the fitted curve is incorrect, even if it is close to the data. Here we investigate whether curve fitting can be improved by including the distance correlation into the cost function: L’(p) = a L(p)+(1-a) D(x,r(x)).

First-Steps:

Become familiar with curve fitting and distance correlation
Simulate data from e.g. y = a x+err, and fit a polynomial of degree 4 using L(p).
Calculate D(x,r(x))
Modify curve fitting by using L’(p)
Validate both predictions on independently generated data and compare both approaches
Do the same with several other functional forms

Questions: How can curve fitting with the cost function L’(p) be implemented? How must the parameter ‘a’ in L’(p) be calibrated? Are their improvements in detecting that the data was linear? Are their improvements in prediction accuracy? How much data is needed to see improvements?

Start Reading:

T D G Rossiter Curve ﬁtting with the R Environment for Statistical Computing http://www.itc.nl/~rossiter/teach/R/R_CurveFit.pdf

James et al. Chapter 7

http://www-bcf.usc.edu/~gareth/ISL/

Maria L. Rizzo and Gabor J. Szekely Package ‘energy’

http://cran.r-project.org/web/packages/energy/energy.pdf

Projects closed

Curve Fitting with Distance Correlation