X-Tokenizer · A Multimodal Action Tokenizer for VLA Pretraining

Abstract

Action tokenization as semantic interface learning.

TL;DR

Existing action tokenizers compress motion into discrete codes — but the codes are opaque to the surrounding vision-language model. X-Tokenizer reframes tokenization as semantic interface learning: we co-train a multimodal encoder, a Semantic Residual Quantizer (SRQ), and a paired decoder so that the top-level codes q₀ are simultaneously grounded in language and faithful to motion.

Pretraining at scale — 2.4 M trajectories, 2.0 B action frames, 17 arm families — yields one frozen tokenizer that reuses across (i) RoboTwin 2.0 dual-arm benchmarks, (ii) 5-arm cross-embodiment joint training, and (iii) real-world tabletop manipulation, while leaving the host VLM's reasoning ability intact.

On RoboTwin Hard the dual-arm setting degrades only −3.8 (vs. −5.9 for π0.5). Cross-embodiment joint training lifts Hard by +10.4. Real-world average is 77.4 across 7 tasks with no per-task tuning. Against FAST, multimodal grounding gains +13.5 % and long-horizon execution gains +8.25.

Method · Encoder → SRQ → Decoder

One pretraining pipeline. Three reusable parts.

The encoder aligns each action chunk to the VLM’s compressed features from the same timestep — co-temporal downsampled tri-view video, task-level language, and fine-grained sub-task instructions. SRQ discretises the aligned latent into intent code q₀ and kinematic residuals q₁₋₃. The decoder’s primary job is action reconstruction, with an auxiliary head that predicts next-frame VLM features.

X-Tokenizer · Encoder · SRQ · Decoder

Step 0 / 6 · Loading

0.0s

Live · 20-second animated walkthrough · Encoder → Contrastive → SRQ → MAM → Decode animation auto-plays in view

E

Multimodal encoder

Each action chunk is aligned to the VLM’s downsampled features at the same timestep — not raw pixels. Context comes from tri-view video (same temporal stride), the task-level language instruction, and fine-grained sub-task instructions. Contrastive InfoNCE pulls the action stream into the VLM’s compressed representation space.

Q

Semantic Residual Quantizer

4-layer residual VQ on the aligned latent. Top code q₀ captures coarse intent; residuals q₁₋₃ absorb continuous kinematic detail. Codebook usage is kept wide via EMA reset.

D

Decoder + dual heads

Primary: masked action modeling reconstructs the original continuous action chunk. Auxiliary: a lightweight head predicts next-frame VLM features as extra supervision — keeping codes grounded in the VLM space without shifting the main objective away from action reconstruction.

Codebook · diagnostics

The codebook is used. The embedding space is aligned.

Wide and balanced codebook utilisation (no collapse, no dead codes). UMAP of the q₀ embeddings shows that action codes cluster by semantic intent, not by raw kinematic similarity.

Codebook usage

Token frequencies across the full 2.0 B pretraining frames stay within a tight band — no degenerate cluster dominates.

Perplexity

Effective codebook size rises monotonically during pretraining and saturates near the theoretical ceiling.

Semantic alignment

UMAP of q₀ code embeddings: clusters emerge by verb/intent, not by joint trajectory similarity.

Robustness · code stability under noise

Same intent, same codes.

Inject Gaussian noise into the action chunk before re-tokenizing. The top-level q₀ codes barely move — residuals q₁₋₃ absorb the perturbation. This is what makes the semantic interface stable across embodiments and demonstration noise.

Animation: auto-plays the first time this card scrolls into view. Each row corresponds to a different noise level σ.

Quantitative results · 3 benchmarks

One tokenizer, three regimes.

RoboTwin 2.0 dual-arm (50 tasks · Easy/Medium/Hard), 5-arm cross-embodiment joint training, and real-world tabletop manipulation across 7 tasks. Numbers below are success rates; all VLAs share the same backbone — only the action tokenizer differs.

RoboTwin 2.0 · dual-arm

50 tasks · Easy / Medium / Hard

Hard drop −3.8

π₀ π_0.5 X-VLA Wall-OSS + X-Tokenizer(ours)

Cross-embodiment · 5-arm joint training

Family-matrix transfer · Easy / Medium / Hard

Hard +10.4

Single-embodiment Joint (5-embodiment)

Real-world · 7 tabletop tasks + VQA

Single frozen X-Tokenizer · no per-task tuning

Avg 77.4 · 8 / 9 ⭐

Wall-OSS + FAST + RVQ (no-aux) + X-Tokenizer(ours)

Real-world · tabletop rollouts

One model. Seven tasks. No per-task tuning.

All clips below are autonomous policy rollouts on the same physical platform. The policy uses a single, frozen X-Tokenizer.

Use ‹ › to scroll · all distinct rollouts · native aspect ratio, no crop.

Citation

BibTeX

@article{xtokenizer2026,
  title  = {X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining},
  author = {Kang, Xirui and Shi, Yanpei and Liang, Lucy and Gan, Roy and Liu, Dongxiu and Zhang, Pushi and Chen, Danpeng and Qin, Xiaoyi and Zheng, Yinan and Zheng, Jinliang and Wang, Hao and Zhan, Xianyuan and Su, Hang},
  year   = {2026},
}