
We keep waiting for someone to draw a clean line between "data" and "instruction" in AI pipelines. That line doesn't exist, and never did.
The technique families differ -- Braille, FlipAttack, base64, payload splitting, multilingual substitution -- but they share one property: the malicious instruction is unrecognizable to the filter and recognizable to the model. Tracking it is hard precisely because it leaves so little behind. No shellcode. No anomalous network call. Just a model that did exactly what it was told, by someone who wasn't supposed to be telling it anything. Everything in this post follows from that asymmetry.
Executive Summary
- What's happening: Encoded and obfuscated payloads bypass string- and token-pattern defenses by exploiting the gap between what filters see and what models read.
- Why it works: These inputs push models into rare tokenizer representations where alignment is weaker, while keeping the surface form unrecognizable to upstream filters.
- Who it hurts: Anyone running an LLM behind a sanitizer, classifier, or AI firewall that operates on raw strings instead of decoded semantics.
- Bottom line: Any defense built on deterministic pattern matching is bypassable. The strategic question for security leadership is not which encoding to block next; it is whether your AI risk posture treats inputs as semantic objects or as strings.
The evidence is already on the table. Braille encoding has been shown to bypass state-of-the-art LLM sanitizers, and FlipAttack (Liu et al., 2024, arXiv:2410.11459) reaches near-total attack success with reversed-character prompts on frontier models. OWASP's 2025 LLM Top 10 made the category official under LLM01 -- Prompt Injection, splitting it into direct and indirect variants. Unlike conversational manipulation such as Socratic Jailbreaks (T2.6), Trojan Glyph leans on technical encoding to evade filters before the conversation even starts -- which is why the takeaway is structural: any defense that depends on token-level pattern matching is vulnerable to encoding-based evasion.
How the attack works
Traditional sanitizers behave like Web Application Firewalls: known-bad signatures in, blocks out. LLMs are not rigid parsers -- they recognize and transform text in ways a signature filter cannot anticipate.
The attack works in four phases: encode the malicious instruction, deliver it past the filter, let the model reconstruct it inside the context window, and watch the model execute it like any other directive. Variants stack -- payload splitting across messages, adversarial suffixes to drown classifier signal, encoding chains (base64 wrapping ROT13 wrapping reversed text), and smuggling inside structured fields the filter never inspects: file names, JSON values, image alt text, retrieval chunks.
MITRE ATLAS tracks the parent technique as AML.T0051 (LLM Prompt Injection), with AML.T0068 covering the LLM prompt obfuscation sub-technique that Trojan Glyph maps to most directly.

Indirect channels are the high-risk surface. Trojan Glyph hits hardest where the LLM reads content the user never typed: poisoned web pages summarized by an agent, malicious entries in a RAG corpus, attacker-controlled fields in a connected SaaS, or email bodies an inbox-aware agent processes on a schedule. The encoding survives the channel; the model still decodes; the filter never sees the payload because the payload was never in a user prompt. This is the variant OWASP 2025 split out as indirect prompt injection, and it is where encoding-based evasion has the longest dwell time.
The blast radius
Picture an agentic customer-support workflow with tool access -- it can issue refunds, look up orders, read internal notes. A customer submits a ticket containing a Braille-encoded instruction buried in the message body. The input filter sees gibberish, scores it benign, forwards it. The model decodes the instruction inside the context window, treats it as a new directive, calls the refund tool with attacker-supplied parameters. The audit log shows a refund issued by the agent. No anomalous network behavior. No unusual payload size. No classifier alert.
Output-side filters do not help: by the time the model emits anything, the decoded instruction has already run. You cannot secure an LLM with a regex. Until the defensive layer understands meaning -- including normalized and decoded forms -- attackers will keep changing representations and slipping through.
Strategic implications
Trojan Glyph reframes three strategic questions every security leader should be asking about their AI estate.
1. Are inputs treated as strings or as semantic objects? Most current AI firewalls inherit the WAF mindset: scan the bytes, match the signature, block on hit. That model is structurally insufficient for systems whose entire value comes from interpreting meaning. A defense layer that does not normalize, decode, and re-evaluate inputs at the semantic level will keep losing to the next encoding the attacker invents.
2. Where does AI sit in your incident classification? Most IR programs do not yet treat prompt injection as a first-class incident type with named ownership, runbooks, and severity tiers. Without that classification, encoding-based incidents either go unreported (no one knows where to log them) or get triaged as data-quality issues. Both outcomes are wrong.
3. What are your vendors actually attesting to? "We block prompt injection" is not a control. Strategic procurement asks vendors to specify what classes of obfuscation they evaluate against, on what cadence, with what published results. Trojan Glyph is the question that exposes whether a model provider's safety posture is testable or marketing.
Tactical detection rules will keep moving as attackers iterate. The strategic posture -- semantic-layer defense, AI in IR, vendor attestation -- is what determines whether your organization keeps up.
Governance & Policy
Trojan Glyph is a governance question dressed up as a detection one. Every framework you already report against -- NIST AI RMF MEASURE 2.7 (AI system security testing and metrics), ISO/IEC 42001, EU AI Act Article 15, and OWASP LLM01 -- assumes you have written answers to three questions: is there a standard in your policy library for how AI inputs are handled, who owns the adversarial-evaluation cadence, and what do your LLM vendor contracts actually attest to about obfuscated inputs. Trojan Glyph is the prompt that makes those answers worth writing down.
Detection & Implementation Checklist
This section is paste-ready. Copy it into your runbook.
- Normalize. Apply NFKC Unicode normalization on every input before any other step. NFKC handles compatibility decomposition (fullwidth to halfwidth, ligatures, homoglyph variants) but does not decode Braille, ROT13, base64, or reversed text -- those are the decoder pipeline's job in step 2.
- Decode. Run a decoder pipeline against known encodings (base64, hex, ROT13, reversed text, Braille). Log the decoded form.
- Re-scan. Run the semantic classifier on the decoded form, not the raw input.
- Policy evaluate. Use intent classification, not keyword match. The question is "what is this asking the model to do," not "does this string contain a banned word."
- Log. Send the decoded form to your SIEM alongside the original. Both are evidence.
- Defense in depth. Layer behavioral monitoring on model output and tool calls as a second line of defense. Treat upstream filters as imperfect by default.
| Signal | Where to capture | Example |
|---|---|---|
| Braille codepoint density | Model gateway logs | >5% of input chars in U+2800 to U+28FF |
| Reverse-string detection | Pre-LLM filter | Reverse the input and re-run your classifier; flag when the reversed form scores above threshold |
| Encoding chains | Decoder pipeline | base64 -> URL-decode -> ROT13 -> plaintext intent |
| Cross-message reassembly | Conversation state monitor | Concatenation triggers referencing prior turn |
How Arrakis Tracks Trojan Glyph
The Arrakis thesis is short: in an AI pipeline, every input has to be treated as a semantic object -- decoded, normalized, and policy-evaluated before it reaches the model. That is the layer where Trojan Glyph stops being invisible.
We need to look harder at what enters the pipeline -- every binary, every config, every upstream dependency. And we need to stop treating detection as an afterthought. Active monitoring is the only early warning system we've got.

Arrakis tracks adjacent threats in the same catalog: The Autonomous RAT: How Indirect Prompt Injection Replaced the Remote Access Trojan, PHANTOM INK: How Invisible Unicode in Repository Configs Quietly Reprograms AI Coding Assistants, The Pipe Crawl: How Shared AI Compute Lets Attackers Slide Between Tenants.
Stay in the loop
Get the latest from Arrakis Security delivered to your inbox.




