Detecting and Controlling Sycophancy with Cascading Linear Features

LLMs 📅 2026-06-27 👁 42 views 🏷 activation steering, cascading linear features, sycophancy detection, interpretable AI, LLM alignment, contrastive sample generation, 2026 AI safety, model steering

Computer Science > Artificial Intelligence

arXiv:2606.26155 (cs) | Submitted on 23 Jun 2026

Authors: Maty Bohacek, Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel

Abstract

Interpreting and controlling model behaviors through activation steering methods typically requires many pairs of contrastive samples that clearly exhibit a desired or undesired behavior. The quality and structure of these data pairs determine how reliably interpretability frameworks can detect the model features responsible for a given behavior, and consequently, how effectively they can steer models away from or toward such behavior.

In this work, we present an iterative data generation pipeline designed to isolate cascading linear features responsible for a target behavior. Moving beyond simple binary pairs of contrasting examples, our method isolates samples that exhibit varying degrees of the feature, scaling linearly with the behavior of interest. This richer signal allows for better disentanglement of overlapping features.

We apply this approach to detecting and steering away from sycophancy—the tendency of language models to prioritize user validation over factual accuracy or balanced reasoning—a growing concern in 2026 as LLMs become more deeply embedded in personalized and digital health, education, and customer service systems. We demonstrate that sycophancy features discovered through cascading samples form linearly separable subspaces in the model's activation space. This enables us to select activations that correspond more cleanly to the target behavior compared to baseline approaches.

We further evaluate the pipeline's ability to enable detection, deterministic scoring, and robust steering of sycophantic behavior. Our method matches or outperforms both LLM-as-a-judge and system prompting baselines, while requiring lower computational overhead and providing stronger interpretability guarantees. By 2026, such low-cost, transparent alignment tools are critical for deploying LLMs in high-stakes environments without sacrificing performance or auditability.

Code & Data: https://cascading-feats.github.io/

Subjects: Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.26155 [cs.AI]

(or arXiv:2606.26155v1 [cs.AI] for this version)

DOI: https://doi.org/10.48550/arXiv.2606.26155

via ArXiv AI

Detecting and Controlling Sycophancy with Cascading Linear Features

Computer Science > Artificial Intelligence

Abstract

Related