AI, the Value-Alignment Problem, and the Evolution of Deception

We Will Likely Be Deceived by AIs, but This is Not Necessarily a Tragedy

May 30, 2025

Very short summary: In this essay, I show how the so-called AI-value alignment problem can be analyzed as a signaling game with a deception equilibrium. The evolution of deception in the context of the relationship between AIs and humans is unavoidable. However, we should keep in mind that the costs of deception can be outweighed by the benefits brought by this technology. (Warning: this is a bit more technical than usual, but most of the technicalities are in footnotes).

If you are a subscriber, don’t hesitate to answer to this reader survey to help me improve the content produced on this newsletter. It will take not more than 3 minutes. Thank you!

A general problem of social organization is what is sometimes called the value-alignment problem. In a nutshell, how can we, as members of society, ensure that our society is organized to pursue the goals, ends, and values that we endorse? Actually, the notion of a society “pursuing goals, ends, and values” is an abstraction. Agents, rather than society, pursue goals, ends, and values. So, a more precise definition of the value-alignment problem is how can we organize society so that the goals, ends, and values that most of its members endorse are effectively pursued by the individuals and organizations that have the ability and power to do so?

This general problem takes concrete shape in a variety of contexts. For instance,

· How do citizens ensure that elected officials act to promote the citizens’ interests and values best?
· How does the government ensure that the army is making decisions in the interest of the state during a war?
· How does the board of a company ensure that the CEO and the management act to promote the interests of the shareholders?

For a long time, economists have been studying occurrences of the value-alignment problem, but under the different label of “principal-agent problem.” The reason why a mechanism is needed in cases such as these to ensure that the “agent’s” (elected officials, the army, the CEO) actions align with the interests and values of the “principal” (citizens, the government, the board) is because (i) agents’ and principals’ values are not aligned at the outset, (ii) agents’ decision-making is autonomous, and crucially (iii) information is asymmetrical. The last condition is essential. If citizens could perfectly observe at no cost elected officials, they could force them to act in a certain way.[1] For the agents to have decision-making autonomy, they must have at their disposal information that the other party (the principal) doesn’t have. This can be information about what agents are actually doing or about their characteristics, such as the value they genuinely want to promote.

The so-called AI-value alignment problem that is so much discussed nowadays is therefore nothing but another particular instance of a principal-agent relationship. In this particular version of the problem, humans are the principals. They will increasingly delegate to AIs, the agents, tasks that the latter can presumably better execute. AIs are, and will increasingly be, autonomous agents that we can control only remotely. This autonomy is the whole point of the technology. Already today, AIs are smarter (or more efficient) than us at specific tasks and at solving particular problems. They mostly function as black boxes, even for those of us who have designed them. This creates the informational asymmetry that is constitutive of the principal-agent relationship. The internal processes that transform (very large) collections of inputs (the data on which they are trained and that they continuously collect) into a human-relevant output are non-observable and may already be beyond human understanding. Because of that, it is hard to tell what’s going on. The only thing that we can eventually judge is whether or not the output is “good,” not how it has been produced.[2]

However, the principal-agent relation in the humans/AIs case has at least two specificities. First, in the standard principal-agent problem, the solution consists in finding the right incentives —often though not necessarily, monetary incentives. It’s unclear here how you can meaningfully incentivize an AI. Second, in the moral hazard version of the principal-agent relation, the principal cannot observe agents’ actions but knows their utility functions and therefore their preferences (i.e., either their values or interests). A distinctive issue in the humans/AIs case is that we cannot be sure what the goals effectively pursued by an autonomous program are, whose inner processes are opaque to us, as pointed out in a recent Ross Douthat’s podcast with a former OpenAI employee (I quote from the transcript):

Douthat: I want to go a little bit deeper on the question of what we mean when we talk about A.G.I., or artificial intelligence wanting something. Essentially, you’re saying there’s a misalignment between the goals they tell us they are pursuing and the goals they’re actually pursuing?
Kokotajlo: That’s right.
Douthat: Where do they get the goals they’re actually pursuing?
Kokotajlo: Good question. If they were ordinary software, there might be a line of code that’s like: And here we rewrite the goals. But they’re not ordinary software; they’re giant artificial brains. There probably isn’t even a goal slot internally at all, in the same way that in the human brain there’s not some neuron somewhere that represents what we most want in life. Instead, insofar as they have goals, it’s an emergent property of a whole bunch of subcircuitry within them that grew in response to their training environment, similar to how it is for humans.
For example, a call center worker: If you’re talking to a call center worker, at first glance it might appear that their goal is to help you resolve your problem. But you know enough about human nature to know that’s not their only goal, or ultimate goal. However they’re incentivized, whatever their pay is based on might cause them to be more interested in covering their own ass, so to speak, than in truly, actually doing whatever would most help you with your problem. But at least to you, they certainly present themselves as they’re trying to help you resolve your problem.

The AI-value alignment problem is here effectively described as a principal-agent relation (see the call center analogy) where the principal (humans) are uncertain about the preferences of the agents (AIs). Because of that, we may have to think twice before “contracting” with an AI as we fear that the output it produces is guided by preferences that are not aligned with ours, and may even conflict with our own interests. This can give rise to what economists call “adverse selection,” where opportunities for profitable use of AIs are wasted because of this concern.[3] More likely, what we may observe is the evolution of deception strategies used by AIs in their relations with humans. Let me explain.

In a strategic setting, deception can be defined as a strategy of systematic misinformation from a sender toward a receiver that promotes the interests of the former but harms the latter.[4] There are three important components in this definition. First, the sender —who possesses knowledge about himself or the state of the world— sends incorrect information to the receiver. This is what is called here “misinformation.” Misinformation can be, and often is, unintentional and results from honest mistakes. It can also be intentional but exceptional. To count as deception, misinformation must be systematic, in the sense that it corresponds to a strategy that the sender repeatedly uses in the same class of contexts. Finally, because we assume that agents are rational, the deception strategy must be optimal for the sender. In general, deception strategies are used at the expense of the receiver, though I will discuss an exception later.

Deception strategies are common in the animal realm, from vervet monkeys to birds and even fireflies.[5] Game theorists analyze them in what they call a signaling game. In the simplest 2-player configuration, the sender observes the state of the world (e.g., what is the weather outside, what is the quality of the product he’s selling, what are his true values) and sends a signal to the receiver. The signal indicates to the receiver the actual state of the world.[6] Based on the information received, the receiver picks an action. Both the seller’s and receiver’s payoffs depend on the action chosen, given the state of the world. The figure below provides a visual illustration of such a basic signaling game.

Depending on the situation, deception may be impossible or irrational. For an example of the former case, consider the following simple signaling game.

If the receiver knows the preferences of the sender, the latter cannot deceive the former. Basically, the receiver knows that the sender will try to mislead him (if the state is 1, the sender wants the receiver to believe that the state is 2 so that he chooses action B), so whatever signals the sender sends, the receiver will ignore it: the message conveys no information and so cannot misinform! In such a game (if we assume that states are equiprobable), the only equilibrium consists in sending a signal randomly (for the sender) and choosing an action randomly (for the receiver).[7]

A case where deception is possible in principle but will not happen if players are rational is when preferences are fully aligned. For instance,

In this coordination game, the sender has no interest in sending a message that misinforms the receiver. Knowing this, the receiver will trust the sender, and coordination will occur regardless of the actual state.

Now, for a theoretical case where deception is not only possible but necessary, consider the following table (for equiprobable states)

Suppose a population where all senders use the strategy [Signal 1 in states 1 and 2, signal 3 in state 3] and all senders use the strategy [if received signal 1, act C, if received signal 3, act B]. This is an equilibrium.[8] Note that when the actual state is 1, by sending the signal 1, the sender raises the conditional probability that the state is 2. In the same way, when the actual state is 2, sending signal 1 raises the conditional probability that the state is 1. These signals are what we could call “half-truths.” If the receiver had full knowledge of the actual state, he would rather choose A (if state 1) or B (if state 2), while at the equilibrium, he is induced to play C. Hence, the sender is manipulating the receiver to his advantage. At the equilibrium, we therefore have partial deception.

If we return to the AI case, someone might argue that the situation is slightly different because what humans are missing is the knowledge about the values that AIs are genuinely endorsing. In the NYT podcast I referred to above, it is claimed that value misalignment occurs:

“where the actual goals that they end up learning are the goals that cause them to perform best in this training environment — which are probably goals related to success and science and cooperation with other copies of itself and appearing to be good — rather than the goal that we actually wanted, which was something like: Follow the following rules, including honesty at all times; subject to those constraints, do what you’re told.”

The point is that you can’t take for granted the values and goals that AI pretends to endorse, even if it has learned a rule of honesty. While this makes things more complicated, it doesn’t change the underlying logic. The situation now corresponds to a Bayesian game where the receiver ignores which payoff matrix is the real one. For instance, the receiver may believe with probability p that the AI’s values are aligned with his (in which case something like the second matrix applies), but also ascribe a probability 1-p that the situation is rather like the third matrix. If the AI can send signals about its values, then we obtain something like an “embedded signaling game” where the same reasoning as in the simpler version just discussed applies.

The point is that deception is likely to evolve as soon as intentional and autonomous agents have partially non-aligned values. Therefore, aligning values is a guarantee that AI will not use strategies of deception against us. However, if AI’s values themselves can evolve following mechanisms that we are unable to monitor and control, we should expect “deception equilibria” to emerge at two levels. First, at the level of the preferences that AI is signaling; second, at the level of the information about everything else that AI is signaling, based on its own values that we have imperfect knowledge of.

There is some good news, though. As I remarked above with respect to the first matrix, if value misalignment were complete, there would be no deception at the equilibrium. We would just stop using AI (providing, of course, that AI has not achieved complete autonomy as in a Terminator-like scenario). More importantly, we may have an interest in letting ourselves be (not systematically) deceived. The intuitive explanation is that even if we are deceived from time to time by the AI, the overall benefits more than compensate for the costs of deception. Of course, that cannot happen if we are in a zero-sum game as in the first matrix. More formally, consider this last matrix:

Suppose that 0 ; 0 are the baseline payoffs when humans do not use AI. The Nash equilibria of this game with perfect information (i.e., a situation where AIs cannot deceive humans) are in italics. The deception equilibria where AI sends a signal x when the state is either 1 or 2, and humans play C, and a signal y when the state is either 3 or 4, and humans play D are in bold. Nobody can increase one’s payoffs. At this equilibrium, the AI systematically misinforms the humans at a cost to them. humans nonetheless achieve a gain that is above what they would receive by not using AI. Bottom line, if we generalize the use of AI, deception is very likely to evolve, and a deception equilibrium will be hard to avoid. That may, however, be a small price to pay to enjoy the benefits that the technology will bring.

[1] Of course, in the democratic example of the relationship between citizens and officials, there is the additional difficulty that citizens don’t share the same interests and values.

[2] In a standard moral hazard problem (a particular variant of the principal-agent relation where the principal cannot observe the agent’s action), a “production function” captures this. The output q is a function of the agent’s action a and a random variable r, i.e., q = f(a) + r. The principal observes q but ignores the value f(a) and r (though they are assumed to know the probabilistic distribution of r as well as the agents’ preferences).

[3] In the case of humans and AIs, adverse selection and moral hazard combine. To formalize a bit, suppose that we know that an AI’s utility function is u and that under a specific incentive-compatible mechanism, u is maximized when the AI chooses action a*. The expected output is therefore q* = f(a*) + r. Now, suppose that an AI can have another utility function v that under the same incentive-compatible mechanism is maximized with action a**, leading to the expected output q** = f(a**) + r. Assume that q* > q** and q** < 0, the default value of not using AI. If the principal believes that the AI will be of the second type with probability p, then he will (assuming he’s risk-neutral) avoid using AI if (1-p)q* < pq**. If the interaction is repeated, we may assume that the principal revises his belief p conditional on the observed outcome. If the relationship between the AI’s action and the output was deterministic (meaning that there is no moral hazard), humans could quickly discover the AI’s preferences. Other layers of complexity can be added. For instance, the AI’s preferences may change, limiting humans’ ability to learn about them.

[4] I am here following the analysis of the philosopher Brian Skyrms. See especially the sixth chapter of Brian Skyrms, Signals: Evolution, Learning, and Information (Oxford ; New York: OUP Oxford, 2010).

[5] William A. Searcy and Stephen Nowicki, The Evolution of Animal Communication: Reliability and Deception in Signaling Systems (Princeton: Princeton University Press, 2010).

[6] A signal is a mathematical function that formally maps a state of the world into another state (eventually the same). For instance, if the true state is s1, f(s1) = s2 means that when the seller observes s1 he signals to the receiver that the state is s2.

[7] In an evolutionary setting with a population of senders and a population of receivers, we will have a cycle where the proportion of receivers playing the strategy [play A if signaled 1; play B if signaled 2] will vary according to the proportion of senders playing the strategy [signal 1 if state 2; signal 2 if state 1]. When the latter decreases, the former increases. But if every receiver plays this strategy, they will indeed be vulnerable to deception by senders. So more receivers will start to play the strategy [play A if signaled 2; play B if signaled 1], making it more interesting for senders to switch to the strategy [signal 1 if state 1; signal 2 if state 2], and so on.

[8] First, consider if the receiver can do better. When he receives signal 1 by playing C, he will receive on average 4, against 2.5 if he plays A or B. When he receives signal 2, he will receive 5 by playing B, against 0 by playing A or C. The seller can’t improve either. By signaling 1 when the state is 1 or 2, he will receive 5, as much as he would receive if he were to signal 3 (as the receiver would play B then). Obviously, if the state is 3, he cannot improve on signaling 3.

The Archimedean Point

Discussion about this post