X-Tokenizer

A multimodal action tokenizer that doubles as a semantic interface
between vision-language reasoning and continuous robot control.
One frozen tokenizer pretrained on 2.4 M trajectories across 17 arm families — reused for both VLM grounding and downstream VLA pretraining without per-task tuning.
Xirui Kang1,2,3,*· Yanpei Shi1,*,†· Lucy Liang1,†· Roy Gan1· Dongxiu Liu1,3· Pushi Zhang1· Danpeng Chen1· Xiaoyi Qin1· Yinan Zheng3· Jinliang Zheng3· Hao Wang1· Xianyuan Zhan3,‡· Hang Su1,3,‡
1X Square Robot 2City University of Hong Kong 3Tsinghua University * Equal contribution  ·  Project lead  ·  Corresponding author
0M
Trajectories
0B
Action frames
0
Arm families
Abstract

Action tokenization as semantic interface learning.

TL;DR

Existing action tokenizers compress motion into discrete codes — but the codes are opaque to the surrounding vision-language model. X-Tokenizer reframes tokenization as semantic interface learning: we co-train a multimodal encoder, a Semantic Residual Quantizer (SRQ), and a paired decoder so that the top-level codes q₀ are simultaneously grounded in language and faithful to motion.

Pretraining at scale — 2.4 M trajectories, 2.0 B action frames, 17 arm families — yields one frozen tokenizer that reuses across (i) RoboTwin 2.0 dual-arm benchmarks, (ii) 5-arm cross-embodiment joint training, and (iii) real-world tabletop manipulation, while leaving the host VLM's reasoning ability intact.

On RoboTwin Hard the dual-arm setting degrades only −3.8 (vs. −5.9 for π0.5). Cross-embodiment joint training lifts Hard by +10.4. Real-world average is 77.4 across 7 tasks with no per-task tuning. Against FAST, multimodal grounding gains +13.5 % and long-horizon execution gains +8.25.

Method · Encoder → SRQ → Decoder

One pretraining pipeline. Three reusable parts.

The encoder aligns each action chunk to the VLM’s compressed features from the same timestep — co-temporal downsampled tri-view video, task-level language, and fine-grained sub-task instructions. SRQ discretises the aligned latent into intent code q₀ and kinematic residuals q₁₋₃. The decoder’s primary job is action reconstruction, with an auxiliary head that predicts next-frame VLM features.
Step 0 / 6 · Loading
Sparse Multi-View Video Instructions Task: Package the coffee box Sub: Use the right arm to place the coffee box into the carton. Action Chunk Context Encoder ❄️ Action Encoder 🔥 Contrastive Learning Semantic Residual Quantization 🔥 token 1 token 2 token 3 token 4 token 5 ··· token n layer1 layer2 layer3 layer4 Masked Action Token Prediction ··· Transformer SoftMax → Pred Mask Action Token q₁² Action Decoder 🔥 Future Feature Predictor 🔥 RECONSTRUCTED ACTION CHUNK PREDICTED FUTURE CONTEXT FEATURE
0.0s
Live · 20-second animated walkthrough · Encoder → Contrastive → SRQ → MAM → Decode animation auto-plays in view
E

Multimodal encoder

Each action chunk is aligned to the VLM’s downsampled features at the same timestep — not raw pixels. Context comes from tri-view video (same temporal stride), the task-level language instruction, and fine-grained sub-task instructions. Contrastive InfoNCE pulls the action stream into the VLM’s compressed representation space.

Q

Semantic Residual Quantizer

4-layer residual VQ on the aligned latent. Top code q₀ captures coarse intent; residuals q₁₋₃ absorb continuous kinematic detail. Codebook usage is kept wide via EMA reset.

D

Decoder + dual heads

Primary: masked action modeling reconstructs the original continuous action chunk. Auxiliary: a lightweight head predicts next-frame VLM features as extra supervision — keeping codes grounded in the VLM space without shifting the main objective away from action reconstruction.

Codebook · diagnostics

The codebook is used. The embedding space is aligned.

Wide and balanced codebook utilisation (no collapse, no dead codes). UMAP of the q₀ embeddings shows that action codes cluster by semantic intent, not by raw kinematic similarity.
Robustness · code stability under noise

Same intent, same codes.

Inject Gaussian noise into the action chunk before re-tokenizing. The top-level q₀ codes barely move — residuals q₁₋₃ absorb the perturbation. This is what makes the semantic interface stable across embodiments and demonstration noise.
ROBUSTNESS · CODE STABILITY UNDER NOISE A 24-frame action chunk is perturbed, then re-tokenized. As σ rises, only a few q₀ cells flip — circled in red. The rest of the intent stays the same. σ noise 0.000 Action Chunk · 24 frames · 6 sub-chunks of 4 tokens X-Tokenizer · q₀ codes At σ = 0.006, only 1 / 6 of top-level q₀ tokens flip — residual layers q1−3 absorb the perturbation.
Animation: auto-plays the first time this card scrolls into view. Each row corresponds to a different noise level σ.
Quantitative results · 3 benchmarks

One tokenizer, three regimes.

RoboTwin 2.0 dual-arm (50 tasks · Easy/Medium/Hard), 5-arm cross-embodiment joint training, and real-world tabletop manipulation across 7 tasks. Numbers below are success rates; all VLAs share the same backbone — only the action tokenizer differs.
RoboTwin 2.0 · dual-arm
50 tasks · Easy / Medium / Hard
Hard drop −3.8
π0 π0.5 X-VLA Wall-OSS + X-Tokenizer(ours)
Cross-embodiment · 5-arm joint training
Family-matrix transfer · Easy / Medium / Hard
Hard +10.4
Single-embodiment Joint (5-embodiment)
Real-world · 7 tabletop tasks + VQA
Single frozen X-Tokenizer · no per-task tuning
Avg 77.4 · 8 / 9 ⭐
Wall-OSS + FAST + RVQ (no-aux) + X-Tokenizer(ours)
Real-world · tabletop rollouts

One model. Seven tasks. No per-task tuning.

All clips below are autonomous policy rollouts on the same physical platform. The policy uses a single, frozen X-Tokenizer.
Use ‹ › to scroll · all distinct rollouts · native aspect ratio, no crop.
Citation

BibTeX

@article{xtokenizer2026,
  title  = {X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining},
  author = {Kang, Xirui and Shi, Yanpei and Liang, Lucy and Gan, Roy and Liu, Dongxiu and Zhang, Pushi and Chen, Danpeng and Qin, Xiaoyi and Zheng, Yinan and Zheng, Jinliang and Wang, Hao and Zhan, Xianyuan and Su, Hang},
  year   = {2026},
}