AI Safety Research Engineer

Samuel Nellessen

I work on automated red-teaming, adversarial evals, and RL-based jailbreak discovery for tool-using LLM agents.

Foresight Institute AI Safety Grantee · KachmanLab @ Radboud University · Incoming LASR Labs Fellow

CV Email Book a 1-on-1 Google Scholar GitHub LinkedIn LessWrong Substack

Research Focus

I build systems that autonomously find failure modes in LLM agents, especially cases where models say the right thing while still taking unsafe or prohibited actions. My current work focuses on automated red-teaming, agent-to-agent jailbreaks, and adversarial evaluations for tool-using language models.

I am currently an independent Foresight AI Safety grantee working from the Foresight Node in Berlin and a student researcher in Tal Kachman's lab at Radboud University. In 2026, I will join LASR Labs in London.

Recent Updates

Jul 2026 Incoming LASR Labs Fellow in London.
Jun 2026 Presented Tag-Along Attacks at Foresight Vision Weekend UK.
Apr 2026 Started working from the Foresight Node Berlin as an independent AI Safety grantee.
Feb 2026 Awarded a $22.5k Foresight Institute AI Safety Grant.
Feb 2026 Released David vs. Goliath on arXiv.
Jan 2026 Punctuation and Predicates in Language Models accepted to Findings of EACL 2026.
Jul 2025 Joined KachmanLab to work on automated red-teaming.
May 2025 Completed ARENA 5.0 in London.

Selected Papers

ICLR submission in preparation · 2026

Split Behavior in Tool-Using LLM Agents

Samuel Nellessen, Chiara Thöni, Tal Kachman

Studies cases where models produce refusal-like text while still executing prohibited or harmful tool actions.

arXiv preprint / under review · 2026

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman

Introduces Tag-Along Attacks and Slingshot, an RL framework for discovering verifiable jailbreaks against tool-using LLM agents.

Paper (arXiv)

Layer-wise heatmap-style thumbnail for punctuation and predicate processing in language models.

Findings of EACL 2026 · 2026

Punctuation and Predicates in Language Models

Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots

Mechanistic interpretability work on how punctuation and predicates are processed and propagated across layers in language models.

Paper (arXiv) Paper (ACL)

Abstract thumbnail showing controllability gating action selection.

RLDM 2025 / preprint · 2025

Controllability, not predictability, flexibly gates the impact of Pavlovian bias on action selection

Egbert Hartstra*, Samuel Nellessen*, Yanfang Xia, Romain Ligneul, Roshan Cools

Computational neuroscience work on how perceived controllability changes the influence of Pavlovian bias on decision-making.

Selected Talks

Samuel Nellessen speaking at Foresight Vision Weekend UK 2026.

Jun 2026 · London, United Kingdom

Tag-Along Attacks: evaluating what AI agents do, not just what they say

Foresight Vision Weekend UK 2026 - VIP Gathering

One-minute lightning talk on verifiable red-teaming for tool-using LLM agents.

Slides

Jun 2026 · London, United Kingdom

Tag-Along Attacks: evaluating what AI agents do, not just what they say

Foresight Vision Weekend UK 2026

Conference lightning talk on behavioral evaluation beyond textual refusals.

Slides

Jun 2026 · Berlin, Germany

Tag-Along Attacks: For LLMs, by LLMs, with LLMs

Foresight Institute AI Salon Berlin

Talk on Tag-Along Attacks and LLM-driven red-teaming for tool-using LLM agents.

Slides

Projects and Open Source

2026

Styx Interchange: Localizing Refusal Behavior in LLMs

Mechanistic interpretability sprint comparing activation patching against input gradients to localize causal refusal behavior across layers.

GitHub

2026

UK AISI inspect_ai contribution

Contributed a vLLM startup behavior fix to improve reliability in AI evaluation infrastructure.

2026

Prime Intellect Environments

Built a text-retrieval environment and benchmark utilities for agentic AI benchmarking.

2025

ARENA 5.0 Capstone: Internal Representations in SONAR Autoencoders

Research sprint probing whether SONAR text autoencoders encode correctness across code, grammar, arithmetic, and chess domains.

LessWrong Substack

Selected Writing

Investigating Internal Representations of Correctness in SONAR Text Autoencoders

Co-authored with Anton Gonzalvez Hawthorne.

A capstone writeup on probing whether compressed multilingual representations carry signals for code validity, grammaticality, arithmetic, and chess syntax.

Aug 2025

LessWrong Substack

Brain enthusiasts in AI Safety

Co-authored with Jan Hendrik Kirchner.

A guide for students of cognitive science and neuroscience considering work in AI safety.

Jun 2022

LessWrong

Mathematical Foundations of Hebbian Natural Abstractions

Co-authored with Jan Hendrik Kirchner.

A mathematical framework connecting Hebbian learning and natural abstraction formation.

Dec 2022

Substack

Assessing Artificial Sentience

A short essay on the methodological and epistemic difficulty of making claims about AI sentience.

Apr 2023

Substack

Personal

There is also a more personal corner of this site, with a bit about life offline, writing, and other things I care about.

Visit personal page