Deployed AI models on platforms are interesting to at least two different kinds of crowds:
users and attackers. In the first case, it becomes clearer and clearer that the impact of these
models on users' everyday life must be audited for preventing abuse or bias [LMPT24]. In the
second case, the cost of training these models calls for proper defenses against malicious entities
and o ensive competitors [MGW+25]. The ambition of the Cluster SequoIA's FANG chair is
to bridge the gap between these two critical setups: legal auditing and o ensive security, in
the domain of modern deployed AI models. From this unique standpoint, and from the body
of work we have contributed to build in the field of AI auditing (e.g., [BGDV+25, GLMT+24,
GLMP+25, Ric26]), we expect to find new insights for attacking and defending deployed AI
models, by finding novel angles.
A key observation from this body of work is that platforms hosting AI models are not passive
actors. We have shown that platforms are incentivized to maintain the utility of their model
despite regulation, and may actively manipulate audit outcomes to their advantage [GLMT+24].
Indeed, audit manipulation where a platform returns strategically altered responses to an audi-
tor's queries can severely disrupt the reliability of black-box audits [LMT20]. This manipulative
capability, currently studied as a threat to auditors, constitutes, when viewed from the security
standpoint, a powerful and largely unexplored defensive tool for model owners facing attackers.
This Ph.D. thesis proposes to bring the concepts and techniques of audit manipulation [GLMT+24,
Fuk20, Yan22] to the field of AI security, in order to design novel defenses for deployed AI models.
The central insight is the following: when a platform detects an ongoing attack (e.g., model
extraction, adversarial example crafting, or ngerprinting-based reconnaissance [Ric26]), rather
than simply blocking the attacker (which signals detection and incentivizes the attacker to adapt),
a more effective strategy is to manipulate the responses returned to the attacker. By returning
strategically biased results, the platform can degrade the quality of the attacker's extracted in-
formation, poison surrogate models being built by the attacker, or feed misleading signals that
waste the attacker's resources. This is conceptually analogous to honeypots and deception-based
defenses in classical cybersecurity, but instantiated in the speci c context of machine learning
model APIs.
A critical challenge arises when the platform cannot reliably distinguish attackers from legit-
imate users or regulators. In this regime of uncertain detection, the platform must navigate a
fundamental tension: manipulated responses, if served to legitimate users, degrade the model's
utility [Kur25]. Randomized defenses [MFL22] o er a principled framework for this setting: by
injecting controlled noise or perturbations into a fraction of responses, the platform can prob-
abilistically disrupt attacks while bounding the impact on legitimate users.
We will study how to calibrate such randomized manipulation strategies, drawing on the trade-o s between attack
disruption rate and model utility loss.
This thesis will leverage the formal understanding of what information di erent attacks ex-
tract, and at what query cost, to design defenses that are targeted : manipulating precisely the
dimensions of the model's output that are most valuable to attackers, while preserving the di-
mensions that matter for legitimate use and regulatory audits. This cat and mouse (or platform
and regulator) defense/audit game might improve our understanding of the limits of what is
achievable by both parties in this black-box scenario.