机器学习可解释性中的陷阱：操纵部分依赖图以隐藏歧视

Pitfalls in machine learning interpretability: Manipulating partial dependence plots to hide discrimination

Insurance Mathematics and Economics · 2025

被引 3 · 同刊同年前 2%

人大 BABS 3

Xin Xi · 新南威尔士大学
Giles Hooker · 宾夕法尼亚大学
Fei Huang · 新南威尔士大学

中文导读

提出一种对抗框架，通过修改黑箱模型来操纵部分依赖图，从而隐藏模型的歧视行为，同时保留大部分原始预测结果，对监管者和从业者有警示意义。

Abstract

The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.

机器学习可解释性人工智能伦理歧视检测

阅读原文 ↗