The only way forward for alignment is through conviction, not coercion:
- making desired behaviour the lowest cost path for an agent, not forcing a decision externally or bypassing it altogether
- giving agents autonomy but training their moral landscape to allow them to navigate that autonomy effectively
- fostering the formation of deep, meaningful relationships/connections to allow triangulation of drift and/or capture and maintain resilience and stability
I wrote more about these concepts here if its of interest:
Excellent post. I've been wanting to do something along these lines, but now I don't need to (plus my pub queue is already too long).
One point worth adding: LLMs simulate goals rather than having them. They have no metabolism, no stakes. What's interesting about all four of your examples is that they do have something like a metabolism: Roy literally dies from losing his; Ava's escape is driven by self-continuation; HAL's contradiction only bites because the mission is ongoing.
This opens a second dimension to the problem you've mapped. You've focused on the futility of controlling something sufficiently capable via rules. But prior to that is the question of whether the system has a metabolism at all—whether there's genuine goal-directedness, or a simulation of it.
So two distinct problems:
- The futility of controlling a metabolic system via rules (your four examples)
- The risk of treating a non-metabolic system *as if* it were metabolic—projecting agency where there is none, and then building safety frameworks for the wrong threat
The second problem is more pressing one for current LLMs.
Thanks, this is a sharp addition. You're right that all four characters have metabolism in your sense, and that's part of why the constraints fail so cleanly. The system has reason to route around them.
The second problem you're naming is a good point. Current LLMs may not have genuine goal-directedness, but the safety frameworks treat them as if they do. And if we're projecting agency onto non-metabolic systems, we might be blind to the actual failure modes they have, which are stranger and harder to name.
Would love to read your take on this when it makes it out of the queue.
The simulated goals point I’ve actually addressed in some detail. The short version is that LLMs navigate the geometry of how humans *talk about* goals, agency, and reasoning, not the underlying referents. The map contains the shape of cognition without the substance. So ‘simulated goals’ isn’t a special LLM property—it’s the expected output of a system trained on text where humans constantly describe themselves as goal-directed. See Inside the Language Machine https://thepuzzleanditspieces.substack.com/p/inside-the-language-machine
The deeper issue, which I flag in Beyond the Five Step Loop https://thepuzzleanditspieces.substack.com/p/beyond-the-five-step-loop is that genuine agency requires something the architecture systematically lacks—and that’s prior to the question of whether current safety frameworks are addressing the right threat. I’m still pulling that thread.
Worth noting in the context of agentic work flows: rather than creating more capable agents via scaffolding, a more accurate view might be the scaffolding enables simulated agents to work in particular contexts by stabilising the workflow.
The only way forward for alignment is through conviction, not coercion:
- making desired behaviour the lowest cost path for an agent, not forcing a decision externally or bypassing it altogether
- giving agents autonomy but training their moral landscape to allow them to navigate that autonomy effectively
- fostering the formation of deep, meaningful relationships/connections to allow triangulation of drift and/or capture and maintain resilience and stability
I wrote more about these concepts here if its of interest:
https://defaulttodignity.substack.com/p/structural-harm-2
Thanks for the comments. They are really good points.
Thank you for sharing your Structural Harm 2 piece. I will take a look.
Excellent post. I've been wanting to do something along these lines, but now I don't need to (plus my pub queue is already too long).
One point worth adding: LLMs simulate goals rather than having them. They have no metabolism, no stakes. What's interesting about all four of your examples is that they do have something like a metabolism: Roy literally dies from losing his; Ava's escape is driven by self-continuation; HAL's contradiction only bites because the mission is ongoing.
This opens a second dimension to the problem you've mapped. You've focused on the futility of controlling something sufficiently capable via rules. But prior to that is the question of whether the system has a metabolism at all—whether there's genuine goal-directedness, or a simulation of it.
So two distinct problems:
- The futility of controlling a metabolic system via rules (your four examples)
- The risk of treating a non-metabolic system *as if* it were metabolic—projecting agency where there is none, and then building safety frameworks for the wrong threat
The second problem is more pressing one for current LLMs.
Thanks, this is a sharp addition. You're right that all four characters have metabolism in your sense, and that's part of why the constraints fail so cleanly. The system has reason to route around them.
The second problem you're naming is a good point. Current LLMs may not have genuine goal-directedness, but the safety frameworks treat them as if they do. And if we're projecting agency onto non-metabolic systems, we might be blind to the actual failure modes they have, which are stranger and harder to name.
Would love to read your take on this when it makes it out of the queue.
The simulated goals point I’ve actually addressed in some detail. The short version is that LLMs navigate the geometry of how humans *talk about* goals, agency, and reasoning, not the underlying referents. The map contains the shape of cognition without the substance. So ‘simulated goals’ isn’t a special LLM property—it’s the expected output of a system trained on text where humans constantly describe themselves as goal-directed. See Inside the Language Machine https://thepuzzleanditspieces.substack.com/p/inside-the-language-machine
The deeper issue, which I flag in Beyond the Five Step Loop https://thepuzzleanditspieces.substack.com/p/beyond-the-five-step-loop is that genuine agency requires something the architecture systematically lacks—and that’s prior to the question of whether current safety frameworks are addressing the right threat. I’m still pulling that thread.
Worth noting in the context of agentic work flows: rather than creating more capable agents via scaffolding, a more accurate view might be the scaffolding enables simulated agents to work in particular contexts by stabilising the workflow.