GG

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

with Sarah Ball, Shafi Goldwasser, Frauke Kreuter, Omer Reingold, Guy N. Rothblum

Preprint

arXiv

Abstract

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

with Shafi Goldwasser

Preprint

arXiv

Abstract

In this paper, we initiate a cryptographically inspired theoretical study of detection versus mitigation of adversarial inputs produced by attackers of Machine Learning algorithms during inference time. We formally define defense by detection (DbD) and defense by mitigation (DbM). Our definitions come in the form of a 3-round protocol between two resource-bounded parties: a trainer/defender and an attacker. The attacker aims to produce inference-time inputs that fool the training algorithm. We define correctness, completeness, and soundness properties to capture successful defense at inference time while not degrading (too much) the performance of the algorithm on inputs from the training distribution. We first show that achieving DbD and achieving DbM are equivalent for ML classification tasks. Surprisingly, this is not the case for ML generative learning tasks, where there are many possible correct outputs that can be generated for each input. We show a separation between DbD and DbM by exhibiting a generative learning task for which is possible to defend by mitigation but is provably impossible to defend by detection under the assumption that the Identity-Based Fully Homomorphic Encryption (IB-FHE), publicly-verifiable zero-knowledge Succinct Non-Interactive Arguments of Knowledge (zk-SNARK) and Strongly Unforgeable Signatures exist. The mitigation phase uses significantly fewer samples than the initial training algorithm.

The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses

with Berkant Turan, Sai Ganesh Nagarajan, Sebastian Pokutta

Under Review, Theoretical Foundations of Foundation Models @ ICML 2024

arXiv

Abstract

We formalize and analyze the trade-off between backdoor-based watermarks and adversarial defenses, framing it as an interactive protocol between a verifier and a prover. While previous works have primarily focused on this trade-off, our analysis extends it by identifying transferable attacks as a third, counterintuitive but necessary option. Our main result shows that for all learning tasks, at least one of the three exists: a watermark, an adversarial defense, or a transferable attack. By transferable attack, we refer to an efficient algorithm that generates queries indistinguishable from the data distribution and capable of fooling all efficient defenders. Using cryptographic techniques, specifically fully homomorphic encryption, we construct a transferable attack and prove its necessity in this trade-off. Furthermore, we show that any task that satisfies our notion of a transferable attack implies a cryptographic primitive, thus requiring the underlying task to be computationally complex. Finally, we show that tasks of bounded VC-dimension allow adversarial defenses against all attackers, while a subclass allows watermarks secure against fast adversaries.

Nonlocality under Computational Assumptions

with Khashayar Barooti, Alexandru Gheorghiu, Marc-Olivier Renou

STOC 2024, QIP 2025

conference arXiv

Abstract

Nonlocality and its connections to entanglement are fundamental features of quantum mechanics that have found numerous applications in quantum information science. A set of correlations is said to be nonlocal if it cannot be reproduced by spacelike-separated parties sharing randomness and performing local operations. An important practical consideration is that the runtime of the parties has to be shorter than the time it takes light to travel between them. One way to model this restriction is to assume that the parties are computationally bounded. We therefore initiate the study of nonlocality under computational assumptions and derive the following results: (a) We define the set 𝖭𝖾𝖫 (not-efficiently-local) as consisting of all bipartite states whose correlations arising from local measurements cannot be reproduced with shared randomness and polynomial-time local operations. (b) Under the assumption that the Learning With Errors problem cannot be solved in quantum polynomial-time, we show that 𝖭𝖾𝖫=𝖤𝖭𝖳, where 𝖤𝖭𝖳 is the set of all bipartite entangled states (pure and mixed). This is in contrast to the standard notion of nonlocality where it is known that some entangled states, e.g. Werner states, are local. In essence, we show that there exist (efficient) local measurements producing correlations that cannot be reproduced through shared randomness and quantum polynomial-time computation. (c) We prove that if 𝖭𝖾𝖫=𝖤𝖭𝖳 unconditionally, then 𝙱𝚀𝙿≠𝙿𝙿. In other words, the ability to certify all bipartite entangled states against computationally bounded adversaries gives a non-trivial separation of complexity classes. (d) Using (c), we show that a certain natural class of 1-round delegated quantum computation protocols that are sound against 𝙿𝙿 provers cannot exist.

Bayes Complexity of Learners vs Overfitting

with Rudiger Urbanke

Under Review

arXiv

Abstract

We introduce a new notion of complexity of functions and we show that it has the following properties: (i) it governs a PAC Bayes-like generalization bound, (ii) for neural networks it relates to natural notions of complexity of functions (such as the variation), and (iii) it explains the generalization gap between neural networks and linear schemes. While there is a large set of papers which describes bounds that have each such property in isolation, and even some that have two, as far as we know, this is a first notion that satisfies all three of them. Moreover, in contrast to previous works, our notion naturally generalizes to neural networks with several layers. Even though the computation of our complexity is nontrivial in general, an upper-bound is often easy to derive, even for higher number of layers and functions with structure, such as period functions. An upper-bound we derive allows to show a separation in the number of samples needed for good generalization between 2 and 4-layer neural networks for periodic functions.

Breaking a Classical Barrier for Classifying Arbitrary Test Examples in the Quantum Model

with Khashayar Barooti, Rudiger Urbanke

AISTATS 2023 (Poster at QIP 2022)

conference arXiv

Informal Abstract

This paper presents a surprising connection between recent advances in delegation of quantum computation to adversarial robustness. Imagine an adversary A that is in between a classifier C and a source S producing samples to be classified. Can we design an interactive protocol between C and the A, in which A would convince C that the samples being sent for classification really came from S? Using ideas from delegation of quantum computation we show it is possible to do it if A receives quantum states from S and A communicates with C classically. Why do we need quantum mechanics? Assume that S is uniform on {0,1}^n. How can A convince C that the samples he sent really came from S and were not handcrafted by S (if A received classical samples from S)? This seems to us be an almost impossible problem to solve, as we discuss in the paper.

Noether: The more things change, the more stay the same

with Rudiger Urbanke

Under Review

arXiv

Informal Abstract

The field of theroetical machine learning managed to produce some initial explanations for why neural networks perform so well in practice. Unfortunately when reading these papers it is hard to see a common theme or a unifying message. In this paper we show how to look at many of the recent advances through one lens, i.e. lens of symmetries. We show how this concept explains many previous results and how it helps to obtain new ones. To do all of that we study a connection to the famous Noether's theorem from physics.

Exponential Separation between Two Learning Models and Adversarial Robustness

with Rudiger Urbanke

NeurIPS 2021 (spotlight)

conference arXiv

Informal Abstract

There are two sides of this paper. The technical one and potential applications to adversarial robustness. On the technical side we show exponential separation between two models. The standard PAC-learning model and the Equivalence-Query model in which the samples are acquired through an interaction between a teacher and a learner, where the teacher provides counterexamples to hypotheses given by the learner. It is not surprising that one needs fewer samples in the Equivalence-Query model but proving it formally is tricky and the reduction and the proof are quite involved. The second part, i.e. implications for adversarial robustness, was in fact our motivation to study these two models. Our reasoning is as follows. Every adversarial attack in particular provides counterexamples to a given classifier and thus is of the form of the EQ-model. Then the idea is that these adversarial examples can be used for learning faster. This is not a new idea- adversarial training does exactly that. Our contribution is in PROVING that if the adversarial examples have particular structure then they can be used to improve learning over the standard approach.

Query Complexity of Adversarial Attacks

with Rudiger Urbanke

ICML 2021

conference arXiv

Informal Abstract

How hard is it to adversarially attack a network to which the adversary has black box access only? That, of course, depends on what the adversary knows. If he knew the learning algorithm, training data and the randomness used internally by the algorithm then the model becomes white-box from his point of view. In this paper we analyze what happens if the adversary doesn't know the traning data nor the internal randomness. We show that the number of queries needed ro reliably attack depends on the amount of "entropy" in the decision boundaries. Moreover we prove that some learning algorithms are inherently more secure, e.g. KNN, than others, e.g. linear classifiers.

Spectral Clustering Oracles in Sublinear Time

with Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, Christian Sohler

SODA 2021

conference arXiv

Abstract

Given a graph G that can be partitioned into k disjoint expanders with outer conductance upper bounded by ∊ « 1, can we efficiently construct a small space data structure that allows quickly classifying vertices of G according to the expander (cluster) they belong to? Formally, we would like an efficient local computation algorithm that misclassifies at most an O(epsilon) fraction of vertices in every expander. We refer to such a data structure as a spectral clustering oracle. Our main result is a spectral clustering oracle with query time O^*(n^(1/2+O(epsilon))) and preprocessing time 2^O(1/epsilon k^4 log^2(k))n^(1/2 + O(epsilon)) that provides misclassification error O(epsilon log k) per cluster for any epsilon « 1/log k. More generally, query time can be reduced at the expense of increasing the preprocessing time appropriately (as long as the product is about n^(1+O(epsilon)) – this in particular gives a nearly linear time spectral clustering primitive. The main technical contribution is a sublinear time oracle that provides dot product access to the spectral embedding of G by estimating distributions of short random walks from vertices in G. The distributions themselves provide a poor approximation to the spectral embedding, but we show that an appropriate linear transformation can be used to achieve high precision dot product access. We give an estimator for this linear transformation and analyze it using spectral perturbation bounds and a novel upper bound on the leverage scores of the spectral embedding matrix of a k-clusterable graph. We then show that dot product access to the spectral embedding is sufficient to design a clustering oracle. At a high level our approach amounts to hyperplane partitioning in the spectral embedding of G, but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.

Constructing a provably adversarially-robust classifier from a high accuracy one

with Rudiger Urbanke

AISTATS 2020

conference arXiv

Abstract

Modern machine learning models with very high accuracy have been shown to be vulnerable to small, adversarially chosen perturbations of the input. Given black-box access to a high-accuracy classifier f, we show how to construct a new classifier g that has high accuracy and is also robust to adversarial L2-bounded perturbations. Our algorithm builds upon the framework of randomized smoothing that has been recently shown to outperform all previous defenses against L2-bounded adversaries. Using techniques like random partitions and doubling dimension, we are able to bound the adversarial error of g in terms of the optimum error. In this paper we focus on our conceptual contribution, but we do present two examples to illustrate our framework. We will argue that, under some assumptions, our bounds are optimal for these cases.

The first order truth behind undecidability of regular path queries determinacy

with Jerzy Marcinkowski, Piotr Ostropolski-Nalewaja

ICDT 2019

conference arXiv

Abstract

In our paper [Głuch, Marcinkowski, Ostropolski-Nalewaja, LICS ACM, 2018] we have solved an old problem stated in [Calvanese, De Giacomo, Lenzerini, Vardi, SPDS ACM, 2000] showing that query determinacy is undecidable for Regular Path Queries. Here a strong generalisation of this result is shown, and -- we think -- a very unexpected one. We prove that no regularity is needed: determinacy remains undecidable even for finite unions of conjunctive path queries.

Can One Escape Red Chains? Regular Path Queries Determinacy is Undecidable

with Jerzy Marcinkowski, Piotr Ostropolski-Nalewaja

LICS 2018

conference arXiv

Abstract

For a given set of queries (which are expressions in some query language) Q = {Q1, Q2, ... Qk} and for another query Q0 we say that Q determines Q0 if -- informally speaking -- for every database D, the information contained in the views Q(D) is sufficient to compute Q0(D). Query Determinacy Problem is the problem of deciding, for given Q and Q0, whether Q determines Q0. Many versions of this problem, for different query languages, were studied in database theory. In this paper we solve a problem stated in [CGLV02] and show that Query Determinacy Problem is undecidable for the Regular Path Queries -- the paradigmatic query language of graph databases.

4/3 Rectangle Tiling lower bound

with Krzysztof Loryś

arXiv

Abstract

The problem that we consider is the following: given an n×n array A of positive numbers, find a tiling using at most p rectangles (which means that each array element must be covered by some rectangle and no two rectangles must overlap) that minimizes the maximum weight of any rectangle (the weight of a rectangle is the sum of elements which are covered by it). We prove that it is NP-hard to approximate this problem to within a factor of 4/3 (the previous best result was 5/4).

Grzegorz (Greg) Głuch

Publications

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses

Nonlocality under Computational Assumptions

Bayes Complexity of Learners vs Overfitting

Breaking a Classical Barrier for Classifying Arbitrary Test Examples in the Quantum Model

Noether: The more things change, the more stay the same

Exponential Separation between Two Learning Models and Adversarial Robustness

Query Complexity of Adversarial Attacks

Spectral Clustering Oracles in Sublinear Time

Constructing a provably adversarially-robust classifier from a high accuracy one

The first order truth behind undecidability of regular path queries determinacy

Can One Escape Red Chains? Regular Path Queries Determinacy is Undecidable

4/3 Rectangle Tiling lower bound

Teaching

Videos from a course on "Quantum Interactive Proofs" I held at EPFL can be found here

Education

École Polytechnique Fédérale de Lausanne (EPFL)

University of Wroclaw

University of Wroclaw

University of Wroclaw