RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong1,2* Hongyu Li1* Shanyuan Liu2* Bo Cheng2 Yuhang Ma2 Liebucha Wu2

Xiaoyu Wu2 Manyuan Zhang3 Dawei Leng2 Yuhui Yin2 Lijun Zhang1

1Beihang University 2360 AI Research 3The Chinese University of Hong Kong

* Equal contribution. † Corresponding authors.

A representation-based tokenizer that improves both image generation and editing.

arXiv

Abstract

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult.

To address these limitations, we propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compresses latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity.

Introduction

Diffusion tokenizers sit between two competing requirements: they should reconstruct images faithfully for editing, but also provide compact and structured latents that remain easy for diffusion models to learn.

Motivation

Reconstruction fidelity and generative tractability are both necessary.

The introduction argues that better reconstruction helps preserve identity and structure for image editing, while generation benefits from latents that are semantically organized, lower-dimensional, and easier to denoise. Existing representation-based tokenizers often preserve semantics but suffer from frozen encoders or overly high-dimensional latent spaces, making it difficult to improve both generation and editing together.

Motivation of RPiAE

Method

RPiAE addresses this trade-off with a representation-pivoted autoencoder and an objective-decoupled training strategy.

Architecture

Representation-Pivoted Autoencoder

RPiAE uses a trainable representation-model encoder together with a frozen Pivot Replica Encoder for semantic supervision. A Variational Bridge compresses high-dimensional representation features into compact latents, and a decoder reconstructs the image from the bridged features.

Overview of RPiAE

Training Strategy

Objective-decoupled stage-wise optimization

Training is decomposed into three stages: pivot-regularized encoder tuning, variational bridge learning with KL regularization, and decoder specialization under fixed latents. This design separates reconstruction fidelity, semantic preservation, and generative tractability.

Three-stage training of RPiAE

Results

Image Reconstruction

Best reconstruction fidelity among internal-RM tokenizers

On ImageNet-1K reconstruction, RPiAE achieves rFID 0.50, PSNR 21.3, LPIPS 0.216, and SSIM 0.525, improving reconstruction quality while preserving strong semantic structure.

Class-Conditional Generation

Strong generation quality with compact latents

With 80 training epochs, RPiAE reaches gFID 2.25 without CFG and 1.51 with CFG, together with IS 225.9 and Recall 0.65, establishing the strongest overall result in the main table.

Main Takeaway

Generation and editing improve together

The main result supports the paper's central claim: RPiAE improves reconstruction without sacrificing generation, showing that preserving representation semantics and reducing latent modeling difficulty can benefit both sides at once.

Main Table

Reconstruction and class-conditional generation on ImageNet-1K

The main comparison table shows that RPiAE achieves the strongest overall trade-off among internal-RM tokenizers, with the best reconstruction fidelity and the best generation result with CFG.

Method Tokenizer rFID PSNR LPIPS SSIM Epochs gFID IS Rec. gFID + CFG IS + CFG Rec. + CFG
MaskGIT VQGAN 2.23 17.9 0.202 0.422 555 6.18 182.1 0.51 - - -
LlamaGen VQGAN 0.59 24.5 - 0.813 300 9.38 112.9 0.67 2.18 263.3 0.58
REPA-XL SD-VAE 0.61 26.9 0.130 0.736 80 7.90 - - - - -
LightningDiT VA-VAE 0.27 27.7 0.097 0.779 64 5.14 130.2 0.62 - - -
LightningDiT RAE-S 0.64 18.9 0.252 0.489 80 3.05 166.1 0.60 - - -
DiTDH-XL RAE-B 0.57 18.8 0.256 0.483 80 2.16 214.8 0.59 1.74 235.0 0.60
LightningDiT FAE-d32 0.68 - - - 80 2.08 207.6 0.59 1.70 243.8 0.61
LightningDiT RPiAE (60 ep.) 0.50 21.3 0.216 0.525 60 2.46 201.1 0.59 2.06 208.5 0.61
LightningDiT RPiAE (80 ep.) 0.50 21.3 0.216 0.525 80 2.25 208.7 0.60 1.51 225.9 0.65

Appendix Table

Extended class-conditional generation results at 800 epochs

The supplementary experiments extend class-conditional generation to 800 epochs. RPiAE achieves the best Inception Score without CFG, and the best gFID and Recall with CFG among internal-RM models.

Method Tokenizer Epochs gFID IS Prec. Rec. gFID + CFG IS + CFG Prec. + CFG Rec. + CFG
REPA-XL SD-VAE 800 5.90 - - - 1.42 305.7 0.80 0.65
LightningDiT VA-VAE 800 2.17 205.6 0.77 0.65 1.35 295.3 0.79 0.65
LightningDiT RAE-B 800 1.87 209.7 0.80 0.63 1.41 309.4 0.80 0.63
DiTDH-XL RAE-B 800 1.51 242.9 0.79 0.63 1.13 262.6 0.78 0.67
LightningDiT FAE-d32 800 1.48 239.8 0.81 0.63 1.29 268.0 0.80 0.64
LightningDiT RPiAE 800 1.68 254.7 0.79 0.63 1.09 272.1 0.75 0.70
Benchmark and training curve results
Training curves show higher performance ceilings and faster convergence on GenEval, DPG-Bench, and GEdit.

Selected Samples in Class Conditional Image Generation Task

BibTeX


@misc{RPiAE,
  title={RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing},
  author={Yue Gong and Hongyu Li and Shanyuan Liu and Bo Cheng and Yuhang Ma and Liebucha Wu and Xiaoyu Wu and Manyuan Zhang and Dawei Leng and Yuhui Yin and Lijun Zhang},
  year={2026},
  eprint={2603.19206},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.19206},
}