A model-free feature selection method for multi-outcome data based on predictive strength

Fuchs Sebastian, University of Salzburg

In regression analysis the main objective is to estimate the functional relationship between a set of $q \geq 1$ response variables $\mathbf{Y} = (Y_1, \dots, Y_q)$ and a set of $p \geq 1$ predictor variables $\mathbf{X} = (X_1, \dots, X_p)$. In view of constructing a good model, the question naturally arises to what extent \(\mathbf{Y}\) can be predicted from the information provided by \(\mathbf{X}\), and which of the predictor variables \(X_1, \dots, X_p\) are relevant for the model at all. We propose a direct and natural extension of Azadkia \& Chatterjee's rank correlation $T$ to a set of $q \geq 1$ response variables. The novel measure $T^q$ then quantifies the scale-invariant extent of functional dependence of the response vector $\mathbf{Y}$ on vector $\mathbf{X}$, characterizes independence of $\mathbf{X}$ and $\mathbf{Y}$ as well as perfect dependence of $\mathbf{Y}$ on $\mathbf{X}$ and hence fulfils all the desired characteristics of a measure of predictability. Aiming at maximum interpretability, we provide various general invariance and continuity results for \(T^q\) as well as novel ordering results for conditional distributions, revealing new insights into the nature of \(T\). Building upon the graph-based estimator for $T$, we present a non-parametric estimator for $T^q$ that is strongly consistent in full generality, i.e., without any distributional assumptions. Based on this estimator we develop a model-free and dependence-based feature ranking and forward feature selection, called MFOCI, of data with multiple response variables, thus facilitating the selection of the most relevant predictor variables. Several simulations as well as real-data examples for multi-response data illustrate \(T^q\)'s broad applicability and the superior performance of MFOCI in comparison to existing procedures.

Area: IS19 - Dependence Modeling (Elisa Perrone)

Keywords: conditional dependence, nonparametric measures of association, model- free variable selection, multi-output feature selection

Please Login in order to download this file