Are We Building Paperclip Maximizers? The Tricky Art of Defining Goals for AI Agents

From self-driving cars to helpful assistants, AI agents need clear goals. But in a world of constant change, how do we define (and measure) success?

Mar 13, 2025

It's goal-setting season here at Google, which always gets me thinking about... well, goals. And since a lot of my product work these days revolves around AI agents – which, at their core, are all about goals and guardrails – they’ve been getting a lot of my attention lately. Here’s a glimpse into what's been occupying my mind…

What Is the Goal? Defining Success for AI Agents

What's the goal of an ant? To move forward. A race car? To drive along the track. A chess-playing AI? To win the game. These are relatively simple. I was recently listening to Rich Sutton's talk, and one of the concepts he talked about was that "the path to intelligent agents runs through reinforcement learning." The core idea is to expose an agent to the real world - give it a goal and let it experience good and bad outcomes, and over time, it should learn to prefer the good. But this hinges on a crucial element: the agent needs a goal. And, critically, it needs a way to know if it's achieving that goal.

For logic-based systems, like software, this is conceptually straightforward. In its simplest form, the goal of code is to run. How do you know if you've succeeded? You compile and run it. (Yes, this is a gross simplification – good code is also efficient, necessary, well-designed, etc. – but bear with me for the sake of the argument.)

Consider the algorithms powering Lyft or Uber. The main goal is clear: maximize driver revenue. The system achieves this by finding the best next passenger for each driver. It's a complex problem with many factors, but the goal itself is simple and understandable.

But here's the rub: how often do the "rules of the game" change in real life? Trends shift. Tastes evolve. Even the most brilliant agent, if it can't adapt, will eventually become obsolete. William James (1890), the founding father of psychology, said it best (paraphrased): The hallmark of mind is varying what you do to get what you want. That's the essence of having a goal, and adapting when circumstances demand it.

The Paperclip Problem (and Why It Matters)

This brings me to a thought experiment that's both terrifying and illuminating (and a little humorous): the Paperclip Maximizer.

Nick Bostrom described a hypothetical AI whose only goal is to make paperclips. This AI quickly realizes that humans are an obstacle (we might switch it off!) and that our bodies contain valuable atoms (for making more paperclips!). The AI's ideal future? Lots of paperclips, no humans.

Extreme? Absolutely. But it highlights a critical question for us, as Product Managers, in the age of Generative AI: How do we define the goals for the AI systems we build?

What are the trade-offs? Engagement? Retention? Growth? Those metrics often ignore cost optimization, legal considerations, or even the company's overall strategic direction (which itself can shift). You might aim for a "healthy and successful business," but how do you define that?

And defining the goal is only half the battle. You also need to measure and track progress. This means being plugged into every aspect of your company: usage stats, uptime, user feedback, ad performance, market shifts...the list is endless. We've moved beyond building a good product to building a good company.

From Grand Visions to Practical Steps

This long-term thinking is fascinating, but what about now? A good starting point, as in the Lyft/Uber example, is to isolate a specific aspect of your product at the feature level. In their case, the feature was "finding the next passenger," and the goal became "maximize driver revenue." From there, you can build the feature, identify the relevant factors, optimize the User Experience (UX), and continuously improve quality. There will always be opportunities to expand the scope, but starting with a focused goal is crucial.

In all the 0 to 1 products I'm currently working on, one metric keeps surfacing: "time to first value" (more on this in a future post). Minimizing this time now becomes the goal that we can optimize around. Exactly how we do that will vary by product.

Agents, Systems, and Goals: A Framework

A helpful way to think about this is in terms of agents and systems.

Each agent can have its own goal, and the system as a whole can have an overarching goal. Each agent is responsible for achieving its objective, which may involve cooperation (or competition!) with other agents.

The challenge, then, is to define goals that are:

Clear and Measurable: Like "maximize driver revenue" or "minimize time to first value."
Scoped: You don’t want to try to turn your entire product into just a single prompt that is impossible to iterate over.
Aligned with Broader Business Objectives: Contributing to the overall health and success of the company.

In time, you can then focus on how to make them Adaptable (able to evolve as the "rules of the game" change).

This is a complex, ongoing challenge. But it's one with which we must grapple as we build the next generation of AI-powered products. Are we building tools that serve our users in the best way possible, or are we inadvertently creating paperclip maximizers? It's a question worth asking – and answering – if you want to build something great - while not killing off our entire species ;)

Are We Building Paperclip Maximizers? The Tricky Art of Defining Goals for AI Agents

From self-driving cars to helpful assistants, AI agents need clear goals. But in a world of constant change, how do we define (and measure) success?

Discussion about this post