RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong^1,2* Hongyu Li¹* Shanyuan Liu²* Bo Cheng² Yuhang Ma² Liebucha Wu²

Xiaoyu Wu² Manyuan Zhang³ Dawei Leng²† Yuhui Yin² Lijun Zhang¹†

¹Beihang University ²360 AI Research ³The Chinese University of Hong Kong

* Equal contribution. † Corresponding authors.

A representation-based tokenizer that improves both image generation and editing.

Abstract

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult.

To address these limitations, we propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compresses latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity.

Introduction

Diffusion tokenizers sit between two competing requirements: they should reconstruct images faithfully for editing, but also provide compact and structured latents that remain easy for diffusion models to learn.

Motivation

Reconstruction fidelity and generative tractability are both necessary.

The introduction argues that better reconstruction helps preserve identity and structure for image editing, while generation benefits from latents that are semantically organized, lower-dimensional, and easier to denoise. Existing representation-based tokenizers often preserve semantics but suffer from frozen encoders or overly high-dimensional latent spaces, making it difficult to improve both generation and editing together.

Method

RPiAE addresses this trade-off with a representation-pivoted autoencoder and an objective-decoupled training strategy.

Architecture

Representation-Pivoted Autoencoder

RPiAE uses a trainable representation-model encoder together with a frozen Pivot Replica Encoder for semantic supervision. A Variational Bridge compresses high-dimensional representation features into compact latents, and a decoder reconstructs the image from the bridged features.

Training Strategy

Objective-decoupled stage-wise optimization

Training is decomposed into three stages: pivot-regularized encoder tuning, variational bridge learning with KL regularization, and decoder specialization under fixed latents. This design separates reconstruction fidelity, semantic preservation, and generative tractability.

Results

Image Reconstruction

Best reconstruction fidelity among internal-RM tokenizers

On ImageNet-1K reconstruction, RPiAE achieves rFID 0.50, PSNR 21.3, LPIPS 0.216, and SSIM 0.525, improving reconstruction quality while preserving strong semantic structure.

Class-Conditional Generation

Strong generation quality with compact latents

With 80 training epochs, RPiAE reaches gFID 2.25 without CFG and 1.51 with CFG, together with IS 225.9 and Recall 0.65, establishing the strongest overall result in the main table.

Main Takeaway

Generation and editing improve together

The main result supports the paper's central claim: RPiAE improves reconstruction without sacrificing generation, showing that preserving representation semantics and reducing latent modeling difficulty can benefit both sides at once.

Main Table

Reconstruction and class-conditional generation on ImageNet-1K

The main comparison table shows that RPiAE achieves the strongest overall trade-off among internal-RM tokenizers, with the best reconstruction fidelity and the best generation result with CFG.

Method	Tokenizer	rFID	PSNR	LPIPS	SSIM	Epochs	gFID	IS	Rec.	gFID + CFG	IS + CFG	Rec. + CFG
MaskGIT	VQGAN	2.23	17.9	0.202	0.422	555	6.18	182.1	0.51	-	-	-
LlamaGen	VQGAN	0.59	24.5	-	0.813	300	9.38	112.9	0.67	2.18	263.3	0.58
REPA-XL	SD-VAE	0.61	26.9	0.130	0.736	80	7.90	-	-	-	-	-
LightningDiT	VA-VAE	0.27	27.7	0.097	0.779	64	5.14	130.2	0.62	-	-	-
LightningDiT	RAE-S	0.64	18.9	0.252	0.489	80	3.05	166.1	0.60	-	-	-
DiTDH-XL	RAE-B	0.57	18.8	0.256	0.483	80	2.16	214.8	0.59	1.74	235.0	0.60
LightningDiT	FAE-d32	0.68	-	-	-	80	2.08	207.6	0.59	1.70	243.8	0.61
LightningDiT	RPiAE (60 ep.)	0.50	21.3	0.216	0.525	60	2.46	201.1	0.59	2.06	208.5	0.61
LightningDiT	RPiAE (80 ep.)	0.50	21.3	0.216	0.525	80	2.25	208.7	0.60	1.51	225.9	0.65

Appendix Table

Extended class-conditional generation results at 800 epochs

The supplementary experiments extend class-conditional generation to 800 epochs. RPiAE achieves the best Inception Score without CFG, and the best gFID and Recall with CFG among internal-RM models.

Method	Tokenizer	Epochs	gFID	IS	Prec.	Rec.	gFID + CFG	IS + CFG	Prec. + CFG	Rec. + CFG
REPA-XL	SD-VAE	800	5.90	-	-	-	1.42	305.7	0.80	0.65
LightningDiT	VA-VAE	800	2.17	205.6	0.77	0.65	1.35	295.3	0.79	0.65
LightningDiT	RAE-B	800	1.87	209.7	0.80	0.63	1.41	309.4	0.80	0.63
DiTDH-XL	RAE-B	800	1.51	242.9	0.79	0.63	1.13	262.6	0.78	0.67
LightningDiT	FAE-d32	800	1.48	239.8	0.81	0.63	1.29	268.0	0.80	0.64
LightningDiT	RPiAE	800	1.68	254.7	0.79	0.63	1.09	272.1	0.75	0.70

Benchmark and training curve results — Training curves show higher performance ceilings and faster convergence on GenEval, DPG-Bench, and GEdit.

Selected Samples in Class Conditional Image Generation Task

Samples generated by RPiAE with LightningDiT trained for 800 epochs using only ImageNet.

BibTeX


@misc{RPiAE,
  title={RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing},
  author={Yue Gong and Hongyu Li and Shanyuan Liu and Bo Cheng and Yuhang Ma and Liebucha Wu and Xiaoyu Wu and Manyuan Zhang and Dawei Leng and Yuhui Yin and Lijun Zhang},
  year={2026},
  eprint={2603.19206},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.19206},
}