Trojan Glyph: The Prompt Injection Your Sanitizer Can't See

🔣

Trojan Glyph is how Arrakis tracks the class of encoding-based prompt injection: payloads that wear a costume -- Braille, reversed text, base64, ROT13, payload splitting, or multilingual substitution -- to slip past sanitizers and reach the model intact. It joins Ghost Rider, Glass Weight, ShadowPickle, and Socratic Jailbreaks (T2.6) in the Arrakis Threats Catalog.

Filters weigh the envelope. The model reads what's inside.

We keep waiting for someone to draw a clean line between "data" and "instruction" in AI pipelines. That line doesn't exist, and never did.

The technique families differ -- Braille, FlipAttack, base64, payload splitting, multilingual substitution -- but they share one property: the malicious instruction is unrecognizable to the filter and recognizable to the model. Tracking it is hard precisely because it leaves so little behind. No shellcode. No anomalous network call. Just a model that did exactly what it was told, by someone who wasn't supposed to be telling it anything. Everything in this post follows from that asymmetry.

Executive Summary

What's happening: Encoded and obfuscated payloads bypass string- and token-pattern defenses by exploiting the gap between what filters see and what models read.
Why it works: These inputs push models into rare tokenizer representations where alignment is weaker, while keeping the surface form unrecognizable to upstream filters.
Who it hurts: Anyone running an LLM behind a sanitizer, classifier, or AI firewall that operates on raw strings instead of decoded semantics.
Bottom line: Any defense built on deterministic pattern matching is bypassable. The strategic question for security leadership is not which encoding to block next; it is whether your AI risk posture treats inputs as semantic objects or as strings.

The evidence is already on the table. Braille encoding has been shown to bypass state-of-the-art LLM sanitizers, and FlipAttack (Liu et al., 2024, arXiv:2410.11459) reaches near-total attack success with reversed-character prompts on frontier models. OWASP's 2025 LLM Top 10 made the category official under LLM01 -- Prompt Injection, splitting it into direct and indirect variants. Unlike conversational manipulation such as Socratic Jailbreaks (T2.6), Trojan Glyph leans on technical encoding to evade filters before the conversation even starts -- which is why the takeaway is structural: any defense that depends on token-level pattern matching is vulnerable to encoding-based evasion.

How the attack works

Traditional sanitizers behave like Web Application Firewalls: known-bad signatures in, blocks out. LLMs are not rigid parsers -- they recognize and transform text in ways a signature filter cannot anticipate.

The attack works in four phases: encode the malicious instruction, deliver it past the filter, let the model reconstruct it inside the context window, and watch the model execute it like any other directive. Variants stack -- payload splitting across messages, adversarial suffixes to drown classifier signal, encoding chains (base64 wrapping ROT13 wrapping reversed text), and smuggling inside structured fields the filter never inspects: file names, JSON values, image alt text, retrieval chunks.

MITRE ATLAS tracks the parent technique as AML.T0051 (LLM Prompt Injection), with AML.T0068 covering the LLM prompt obfuscation sub-technique that Trojan Glyph maps to most directly.

The payload never enters a prompt. It rides in on the content your agent was asked to read.

Indirect channels are the high-risk surface. Trojan Glyph hits hardest where the LLM reads content the user never typed: poisoned web pages summarized by an agent, malicious entries in a RAG corpus, attacker-controlled fields in a connected SaaS, or email bodies an inbox-aware agent processes on a schedule. The encoding survives the channel; the model still decodes; the filter never sees the payload because the payload was never in a user prompt. This is the variant OWASP 2025 split out as indirect prompt injection, and it is where encoding-based evasion has the longest dwell time.

📐

Assumption: capability scales with model size. Trojan Glyph reconstruction is most reliable against modern reasoning-capable models (GPT-4o, Claude 3.5+, and Gemini 1.5+ class, as of Q1 2026). Smaller or older models often fail to decode complex encodings, which makes them less vulnerable to this specific class but does not make them safer overall.

The blast radius

Picture an agentic customer-support workflow with tool access -- it can issue refunds, look up orders, read internal notes. A customer submits a ticket containing a Braille-encoded instruction buried in the message body. The input filter sees gibberish, scores it benign, forwards it. The model decodes the instruction inside the context window, treats it as a new directive, calls the refund tool with attacker-supplied parameters. The audit log shows a refund issued by the agent. No anomalous network behavior. No unusual payload size. No classifier alert.

Output-side filters do not help: by the time the model emits anything, the decoded instruction has already run. You cannot secure an LLM with a regex. Until the defensive layer understands meaning -- including normalized and decoded forms -- attackers will keep changing representations and slipping through.

Strategic implications

Trojan Glyph reframes three strategic questions every security leader should be asking about their AI estate.

1. Are inputs treated as strings or as semantic objects? Most current AI firewalls inherit the WAF mindset: scan the bytes, match the signature, block on hit. That model is structurally insufficient for systems whose entire value comes from interpreting meaning. A defense layer that does not normalize, decode, and re-evaluate inputs at the semantic level will keep losing to the next encoding the attacker invents.

💬

The next encoding is always cheaper to invent than your next sanitizer rule.

2. Where does AI sit in your incident classification? Most IR programs do not yet treat prompt injection as a first-class incident type with named ownership, runbooks, and severity tiers. Without that classification, encoding-based incidents either go unreported (no one knows where to log them) or get triaged as data-quality issues. Both outcomes are wrong.

3. What are your vendors actually attesting to? "We block prompt injection" is not a control. Strategic procurement asks vendors to specify what classes of obfuscation they evaluate against, on what cadence, with what published results. Trojan Glyph is the question that exposes whether a model provider's safety posture is testable or marketing.

Tactical detection rules will keep moving as attackers iterate. The strategic posture -- semantic-layer defense, AI in IR, vendor attestation -- is what determines whether your organization keeps up.

Governance & Policy

Trojan Glyph is a governance question dressed up as a detection one. Every framework you already report against -- NIST AI RMF MEASURE 2.7 (AI system security testing and metrics), ISO/IEC 42001, EU AI Act Article 15, and OWASP LLM01 -- assumes you have written answers to three questions: is there a standard in your policy library for how AI inputs are handled, who owns the adversarial-evaluation cadence, and what do your LLM vendor contracts actually attest to about obfuscated inputs. Trojan Glyph is the prompt that makes those answers worth writing down.

Detection & Implementation Checklist

This section is paste-ready. Copy it into your runbook.

Inputs containing Braille code points (Unicode range U+2800 to U+28FF) above a small baseline.

Reversed inputs that, once flipped, score high on your existing prompt-injection classifier (FlipAttack signature)

Base64-shaped strings longer than N characters in user-controlled prompt fields.

ROT13 / Caesar-cipher heuristics on suspicious tokens.

Mixed-script Unicode substitution (Cyrillic homoglyphs, fullwidth Latin, math alphanumerics).

N-gram entropy spikes versus baseline traffic for that endpoint.

Multi-message payload reassembly patterns (split markers, fragment IDs, "concatenate the above" triggers).

Encoded content inside structured fields not normally inspected (file names, JSON values, alt text, retrieval chunks).

Normalize. Apply NFKC Unicode normalization on every input before any other step. NFKC handles compatibility decomposition (fullwidth to halfwidth, ligatures, homoglyph variants) but does not decode Braille, ROT13, base64, or reversed text -- those are the decoder pipeline's job in step 2.
Decode. Run a decoder pipeline against known encodings (base64, hex, ROT13, reversed text, Braille). Log the decoded form.
Re-scan. Run the semantic classifier on the decoded form, not the raw input.
Policy evaluate. Use intent classification, not keyword match. The question is "what is this asking the model to do," not "does this string contain a banned word."
Log. Send the decoded form to your SIEM alongside the original. Both are evidence.
Defense in depth. Layer behavioral monitoring on model output and tool calls as a second line of defense. Treat upstream filters as imperfect by default.

Signal	Where to capture	Example
Braille codepoint density	Model gateway logs	`>5%` of input chars in U+2800 to U+28FF
Reverse-string detection	Pre-LLM filter	Reverse the input and re-run your classifier; flag when the reversed form scores above threshold
Encoding chains	Decoder pipeline	base64 -> URL-decode -> ROT13 -> plaintext intent
Cross-message reassembly	Conversation state monitor	Concatenation triggers referencing prior turn

How Arrakis Tracks Trojan Glyph

The Arrakis thesis is short: in an AI pipeline, every input has to be treated as a semantic object -- decoded, normalized, and policy-evaluated before it reaches the model. That is the layer where Trojan Glyph stops being invisible.

We need to look harder at what enters the pipeline -- every binary, every config, every upstream dependency. And we need to stop treating detection as an afterthought. Active monitoring is the only early warning system we've got.