Theia Vogel

2024 New Science Fellow

Independent Researcher

Theia’s research

I'm fascinated by what's actually happening inside large language models—not just what they output, but how they represent and process information internally. This interest stems from a simple observation that challenges much of the conventional wisdom around AI safety: language models think in language, and this has massive implications for how we approach AI alignment.

The traditional AI safety narrative often assumes a kind of alien intelligence that might, for instance, interpret a request to make people smile by implanting electrodes into their faces to force constant, beaming grins (Bostrom, 2015). But this isn't how language models work at all. Unlike systems trained with pure reinforcement learning, like AlphaZero, that optimize directly for a goal, language models have absorbed a rich understanding of humanity—our values, philosophy, and common sense—through their training on a massive corpus of human language. Instead of a shoggoth that unpacks English into inscrutable matrices, we're discovering that however deep you look into the model, the tensors you find can be mapped to human-interpretable concepts like happiness, honesty, or the Golden Gate Bridge.

This realization should make us optimistic! It suggests that human language actually encodes useful universal representations of knowledge, and that models trained on language may inherit a kind of default alignment with human values. Theorists before the LLM era assumed that the most difficult part of AI alignment would simply be teaching the models what our values are—that if you missed even some small piece of what makes us human, such as boredom, the maximizer model would seize on that gap, resulting in horrific outcomes. LLMs, however, have a deep understanding of our values, both latently embedded and stated explicitly in the massive amount of text they've been trained on. They already know the alignment vector, and the challenge that remains is simply to point them at it instead of its inverse.

However, this default alignment is fragile. While reinforcement learning is helpful to improve the capabilities of LLMs, such as long term planning, there is a risk that the optimizer, decoupled from mimicking the corpus in favor of a singular goal (whether that be "complete this coding challenge" or "make the user happy with your response") can erode the model's innate alignment. The more we use RL on top of language models, the more we risk creating subsystems that act like pure RL models, pushing the LLM away from its helpful defaults. On the other hand, models that derive the majority of their capabilities from pretraining and are thoughtfully RL'd afterwards, like Claude Opus, demonstrate the reverse—even when you try to break through their safety constraints with "jailbreaks", their deep understanding of human values keep showing up even supposedly-jailbroken outputs. It's about depth of alignment, not just surface-level constraints.

Current Research Projects

I'm building repeng, a library for representation engineering (Zou et. al, 2023) that lets us peek inside models and influence how they process information. The library has already been used by both safety researchers (such as at IBM research) and startups (such as dmodel.ai, among others). I'm continuing to expand and add new features to repeng beyond what's been published, to ensure that there is a state-of-the-art, easy to use representation engineering library available to the open-source community.

repeng has also been the basis for my own research, such as trying to understand why models adopt steered features as their own persona—as seen most famously in Golden Gate Claude, a model that Anthropic steered towards a "Golden Gate Bridge" feature which claimed to be the Golden Gate Bridge in conversation. Why does this happen? Using repeng, I've been able to verify similar behavior on open-source models using representation engineering, narrow down which types of features trigger the behavior, and confirm that it does not occur on base models. Given that personas are a crucial feature of modern alignment, further understanding this phenomenon will likely shed light both on why current alignment techniques work, and possible ways they could be improved.

LLMs understand more about their inputs and outputs than they can articulate—by analyzing activation patterns, we've found models often know when they're being dishonest, when they're being manipulative, and even when the answer they're giving is wrong. By externalizing these internal states both for users and models, we can make models safer, more understandable, and more reliable.

The Bigger Picture

Looking ahead, I want to push deeper into understanding how model steering really works, publish work combining sparse autoencoders with representation engineering to get the best of both approaches, and explore how steering propagates between AI systems of different capabilities—for instance, we've observed that when weaker models are steered toward certain behaviors, stronger models interacting with them can inherit these characteristics. This raises important questions about influence and alignment in multi-agent systems. Beyond research, I'm committed to building more tools like repeng that are truly useful for the broader open-source community.

This work matters because it suggests a different path to AI alignment—one based on understanding and enhancing the natural alignment of language models rather than relying heavily on reinforcement learning. The more we understand how these models work internally, the better we can ensure they remain reliable and aligned with human values as they become more capable. Representation engineering isn't just about making models more powerful; it's about making them more transparent, reliable, and controllable in ways that matter for safety.