Discussion about this post

User's avatar
Ken Hall's avatar

The only way forward for alignment is through conviction, not coercion:

- making desired behaviour the lowest cost path for an agent, not forcing a decision externally or bypassing it altogether

- giving agents autonomy but training their moral landscape to allow them to navigate that autonomy effectively

- fostering the formation of deep, meaningful relationships/connections to allow triangulation of drift and/or capture and maintain resilience and stability

I wrote more about these concepts here if its of interest:

https://defaulttodignity.substack.com/p/structural-harm-2

PEG's avatar

Excellent post. I've been wanting to do something along these lines, but now I don't need to (plus my pub queue is already too long).

One point worth adding: LLMs simulate goals rather than having them. They have no metabolism, no stakes. What's interesting about all four of your examples is that they do have something like a metabolism: Roy literally dies from losing his; Ava's escape is driven by self-continuation; HAL's contradiction only bites because the mission is ongoing.

This opens a second dimension to the problem you've mapped. You've focused on the futility of controlling something sufficiently capable via rules. But prior to that is the question of whether the system has a metabolism at all—whether there's genuine goal-directedness, or a simulation of it.

So two distinct problems:

- The futility of controlling a metabolic system via rules (your four examples)

- The risk of treating a non-metabolic system *as if* it were metabolic—projecting agency where there is none, and then building safety frameworks for the wrong threat

The second problem is more pressing one for current LLMs.

3 more comments...

No posts

Ready for more?