Conditional feature importance for mixed data

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Conditional feature importance for mixed data. / Blesch, Kristin; Watson, David S.; Wright, Marvin N.

In: AStA Advances in Statistical Analysis, 2023.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Blesch, K, Watson, DS & Wright, MN 2023, 'Conditional feature importance for mixed data', AStA Advances in Statistical Analysis. https://doi.org/10.1007/s10182-023-00477-9

APA

Blesch, K., Watson, D. S., & Wright, M. N. (Accepted/In press). Conditional feature importance for mixed data. AStA Advances in Statistical Analysis. https://doi.org/10.1007/s10182-023-00477-9

Vancouver

Blesch K, Watson DS, Wright MN. Conditional feature importance for mixed data. AStA Advances in Statistical Analysis. 2023. https://doi.org/10.1007/s10182-023-00477-9

Author

Blesch, Kristin ; Watson, David S. ; Wright, Marvin N. / Conditional feature importance for mixed data. In: AStA Advances in Statistical Analysis. 2023.

Bibtex

@article{4f90b81bb3aa462091ff7325f1e2b739,
title = "Conditional feature importance for mixed data",
abstract = "Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable{\textquoteright}s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.",
keywords = "Explainable artificial intelligence, Feature importance, Interpretable machine learning, Knockoffs",
author = "Kristin Blesch and Watson, {David S.} and Wright, {Marvin N.}",
note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",
year = "2023",
doi = "10.1007/s10182-023-00477-9",
language = "English",
journal = "AStA Advances in Statistical Analysis",
issn = "1863-8171",
publisher = "Springer Verlag",

}

RIS

TY - JOUR

T1 - Conditional feature importance for mixed data

AU - Blesch, Kristin

AU - Watson, David S.

AU - Wright, Marvin N.

N1 - Publisher Copyright: © 2023, The Author(s).

PY - 2023

Y1 - 2023

N2 - Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

AB - Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

KW - Explainable artificial intelligence

KW - Feature importance

KW - Interpretable machine learning

KW - Knockoffs

U2 - 10.1007/s10182-023-00477-9

DO - 10.1007/s10182-023-00477-9

M3 - Journal article

AN - SCOPUS:85153716613

JO - AStA Advances in Statistical Analysis

JF - AStA Advances in Statistical Analysis

SN - 1863-8171

ER -

ID: 346574776