ERNIE-Image Team, Baidu · 2026-04-15 release

ERNIE Image

A technical reference for Baidu's 8B single-stream Diffusion Transformer text-to-image model, assembled from the ERNIE Image Hugging Face model card, the public configuration files, and the reference inference code.

Model ERNIE Image / Turbo Arch 8B DiT (single-stream) License Apache 2.0 VRAM 24 GB Langs EN · 中文

Skip the install. The Turbo checkpoint is live in your browser — no account, no download, no cost.

Open the live demo →

Abstract

ERNIE Image is an open-weight text-to-image generator whose distinguishing feature is not raw photorealism but legible in-image text: multi-line marketing copy, bilingual signage, comic dialogue, and labeled infographics, rendered in a single denoising pass rather than composited after the fact. The model is a single-stream Diffusion Transformer, eight billion parameters in the backbone, and it ships alongside a small causal-LM Prompt Enhancer that rewrites terse user inputs into the dense, structured descriptions the denoiser was trained to follow. Two checkpoints are published under Apache 2.0 — a quality-oriented SFT variant sampled in roughly fifty steps and a Turbo variant distilled with Distribution Matching Distillation and reinforcement learning down to eight.

This page is a technical reference. It is written to be read linearly by an engineer who already knows what latent diffusion is and now wants to know the specific numbers, specific config fields, and specific trade-offs of the ERNIE Image release. Sections are numbered, tables are verbatim from the public model card, and figures the model card does not disclose are explicitly flagged rather than inferred.

Not an official document. This page is an independent technical reference compiled from public sources. The canonical product page and free browser demo live at ernie-image.org. For anything authoritative about the ERNIE Image model itself, refer to Baidu's own communications and the Hugging Face model cards linked in §10.

Release & variants

The ERNIE Image release on April 15, 2026 packaged two checkpoints that share an identical 8B ErnieImageTransformer2DModel backbone but differ in how they are sampled at inference time. The split follows the now-familiar pattern of a multi-step quality checkpoint plus a heavily distilled few-step sibling aimed at latency-sensitive use.

ERNIE Image (SFT)

The quality-oriented checkpoint, trained with supervised fine-tuning on top of the base diffusion weights. Default inference uses roughly fifty denoising steps at a classifier-free guidance scale near four, and it gives the highest Overall score on GenEval among the publicly listed ERNIE Image variants. This is the recommended checkpoint when you are rendering a poster, a multi-panel comic page, or any output where text legibility has to survive one-shot sampling without post-hoc inpainting.

ERNIE Image Turbo

The same backbone, distilled down to eight sampling steps at guidance scale one using Distribution Matching Distillation plus a reinforcement-learning polish pass. The effective speedup relative to the 50-step SFT default is approximately six times at comparable perceptual quality, though the GenEval Overall is slightly lower (0.8667 without the Prompt Enhancer vs. 0.8856 for SFT). Turbo is what you want when you are driving an interactive gallery or serving real users; the in-browser demo linked further down this page runs Turbo against the public Hugging Face Space.

DiT backbone config

The diffusion backbone lives in the Hugging Face repository under the transformer subfolder and exposes its architecture through a plain config.json. The declared class is ErnieImageTransformer2DModel — a single-stream Diffusion Transformer in the lineage of Peebles & Xie's original DiT paper, applied to a latent-space text-to-image setting rather than ImageNet class-conditional generation. The core dimensions are the following:

transformer/config.json — key fields

_class_name: ErnieImageTransformer2DModel
num_layers: 36
hidden_dim: 4096
ffn_hidden_size: 12288
num_attention_heads: 32
in_channels: 128
out_channels: 128
text_in_dim: 3072
total parameters: ~8B

A couple of specifics worth noting. The expansion ratio on the feed-forward block is ffn_hidden_size / hidden_dim = 3.0, on the lower end of what modern DiT-style denoisers publish, which aligns with ERNIE Image's positioning as compute-efficient enough to run on a single 24 GB consumer GPU. The input and output channel counts are both 128, which implies a 128-channel VAE latent space rather than the more common 16-channel arrangement of earlier latent-diffusion pipelines — a choice that gives the transformer proportionally more signal per spatial token at the cost of a more expensive VAE. And the text conditioning enters the backbone at dimension 3072, which matches the hidden size of the Prompt Enhancer discussed in §4 and is the glue that ties the two modules together.

Prompt Enhancer (Ministral3ForCausalLM)

The Prompt Enhancer is an independent component shipped inside the ERNIE Image repository under the prompt_enhancer subfolder. Its job is narrow: take a terse user description plus a target output resolution, and emit a rich, structured visual description for the diffusion backbone to consume. It is a small decoder-only language model constrained by an external chat template; it does not participate in denoising and can be toggled off entirely.

prompt_enhancer/config.json — key fields

architectures: Ministral3ForCausalLM
hidden_size: 3072
num_hidden_layers: 26
vocab_size: 131 072
max_position_embeddings: 262 144
rope_scaling: YaRN
pipeline toggle: use_pe (True / False)

The chat template in chat_template.jinja enforces a strict output contract: the enhancer must return only the rewritten visual description, with no preamble, no commentary, and no explanation. The resulting text is what the backbone actually conditions on, and the pipeline exposes it through output.revised_prompts so production systems can log the enhanced prompts alongside their outputs for audit, reproducibility, and offline prompt-engineering review.

One subtlety that matters for deployments: turning the Prompt Enhancer on is not unambiguously better. On GenEval, enabling use_pe raises Counting from 0.7781 to 0.8187 and Position from 0.8550 to 0.8625, but drops Overall from 0.8856 to 0.8728 because richer enhancer output trades off some attribute-binding precision for elaborative detail. The right move is to treat use_pe as a per-scene switch rather than a global default — enabled for layout-heavy or detail-dense scenes and disabled when the user's literal wording carries critical binding constraints.

The enhancer tokenizer also carries a batch of special tokens that hint at more ambition than the ERNIE Image model card documents: [SYSTEM_PROMPT], [IMG], [IMG_END], [THINK], and several tool-calling markers. Not all of them participate in the current text-to-image inference path — they are the residue of a more general multimodal tokenizer lineage — but they are worth flagging for anyone building on top of the pipeline.

Text encoder — Mistral3Model with Pixtral vision config

The single most unusual thing in the ERNIE Image configuration is the text encoder. It is not a conventional CLIP text tower, and it is not the Prompt Enhancer either. It is a Mistral3Model whose config.json contains both a text_config and a vision_config, and the vision half declares model_type: pixtral.

text_encoder/config.json — vision_config excerpt

model_type: pixtral
patch_size: 14
num_hidden_layers: 24
hidden_size: 1024
image_token_index: present

Read literally, this means the module labeled "text encoder" in the ERNIE Image pipeline is structurally capable of ingesting image tokens alongside text tokens, even though the public model card focuses strictly on text-to-image generation and the Quickstart does not expose an image conditioning input. Whether this is residual capacity from an earlier training stage, scaffolding for a future image-to-image release, or a latent capability already active inside the encoder is not something the model card answers. For anyone reverse-engineering the inference path, it is the single most interesting loose thread in the ERNIE Image release.

Inference pipeline

Assembled from the Diffusers Quickstart and the reference infer_demo.py, the end-to-end ERNIE Image inference flow is a four-stage cascade:

Prompt intake. The caller hands the pipeline a string plus a resolution preset. If use_pe=True, the Prompt Enhancer rewrites the string into a richer visual description and the rewritten text becomes the conditioning input; otherwise the literal string is used.
Tokenization and encoding. The text (or enhanced text) is tokenized and fed through the Mistral3 text encoder, which emits a 3072-dimensional conditioning tensor that matches the backbone's text_in_dim.
Latent denoising. A noise latent is sampled and the ErnieImageTransformer2DModel backbone iteratively denoises it. The SFT variant runs ~50 steps at guidance 4.0; the Turbo variant runs 8 steps at guidance 1.0 (no classifier-free guidance needed after DMD + RL distillation).
VAE decoding. The final latent is passed through the VAE decoder to produce a pixel image. If Prompt Enhancer was enabled, the pipeline also returns output.revised_prompts so the enhanced text can be persisted for audit.

The reference implementation documents seven resolution presets baked into the pipeline: 1024×1024, 848×1264, 1264×848, 768×1376, 1376×768, 896×1200, and 1200×896. Together they span square, portrait, landscape, tall, and widescreen aspect ratios without forcing post-hoc cropping.

Try it in-browser

The Turbo checkpoint is hosted as a public Hugging Face Space at baidu/ERNIE-Image-Turbo, and the Space is embeddable over an iframe. The block below runs against the upstream Space directly; queue times depend on Space traffic.

baidu/ERNIE-Image-Turbo · huggingface.co

Embed of the public baidu/ERNIE-Image-Turbo Space. For a full product experience — gallery, presets, and curated prompts — open ernie-image.org.

Want the full product instead of a raw Space? The live demo wraps the same Turbo checkpoint with presets, bilingual prompt helpers, and a gallery.

Open the live demo →

Benchmark tables

All four tables below are verbatim transcriptions of the public ERNIE Image and ERNIE Image Turbo Hugging Face model cards, evaluated on April 15, 2026. Scores are in the [0, 1] range; higher is better. Rows corresponding to ERNIE Image checkpoints are highlighted.

On ranking stability: automatic text-to-image benchmarks drift as evaluator scripts age and as new models saturate the scoring heuristics. The GenEval 2 work flags GenEval drift explicitly. Read every number on this page as a point-in-time snapshot from 2026-04-15, not as a durable ranking.

GenEval — compositional text-to-image

Model	Single	Two	Count	Colors	Position	Attr	Overall
ERNIE Image (w/o PE)	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE Image (w/ PE)	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE Image Turbo (w/o PE)	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE Image Turbo (w/ PE)	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN and OneIG-ZH — bilingual multi-axis

Track / model	Alignment	Text	Reasoning	Style	Diversity	Overall
EN · ERNIE Image (w/ PE)	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
EN · ERNIE Image Turbo (w/ PE)	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
EN · ERNIE Image (w/o PE)	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
ZH · ERNIE Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
ZH · ERNIE Image Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
ZH · ERNIE Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208

The EN–ZH Overall gap on the top ERNIE Image row is a mere 0.5750 − 0.5543 = 0.0207, which is small enough to treat English and Chinese as a single pipeline rather than as two language tracks stitched together with a translation layer. This is the most concrete quantitative support for the model card's "native bilingual" positioning.

LongTextBench — in-image text rendering

Model	EN	ZH	Avg
ERNIE Image (w/ PE)	0.9804	0.9661	0.9733
ERNIE Image Turbo (w/ PE)	0.9675	0.9636	0.9655
ERNIE Image Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE Image (w/o PE)	0.9679	0.9594	0.9636

LongTextBench measures how accurately long strings of text are rendered inside a generated image, with separate English and Chinese subsets. The 0.9733 average on the top ERNIE Image row is the single strongest benchmark result in this release, and it is also the only benchmark in this suite that most competing open text-to-image models do not currently report — which makes direct head-to-head impossible, but also makes ERNIE Image the de facto reference point for this capability.

Caveats and unknowns

A technical reference is only as trustworthy as its list of things it does not know. Items the ERNIE Image model card deliberately does not disclose, and which downstream deployments should treat as open questions rather than infer from the model's behavior:

Training data. Source datasets, licensing provenance, scale (number of image-text pairs), language mix, and synthetic-data fraction are not published. Compliance-sensitive deployments need to design their own filtering, dedup, and audit layer on top of the public weights.
Training objective details. Whether the diffusion objective is ε-prediction, v-prediction, or something else, what noise schedule was used, and which loss weighting was applied — none of these are stated in the public model card, though they can be partially reconstructed from the Diffusers pipeline code.
Alignment and reward modeling. ERNIE Image Turbo is described as "DMD + RL distilled," but the reward model, the reward signal, and the RL algorithm are not documented.
Safety evaluation. Red-team coverage, refusal behavior on disallowed content, and watermarking / provenance are not part of the public release. Operators deploying into regulated markets need to layer their own content governance.
Multimodal capacity of the text encoder. See §5: the encoder's vision_config declares a pixtral vision tower, but no image input path is exposed in the public Quickstart. Whether this is vestigial, dormant, or accessible via an undocumented API is not something the model card answers.

References

ERNIE Image model card, Hugging Face — huggingface.co/baidu/ERNIE-Image.
ERNIE Image Turbo model card, Hugging Face — huggingface.co/baidu/ERNIE-Image-Turbo.
ERNIE Image Turbo live Space (iframe source) — huggingface.co/spaces/baidu/ERNIE-Image-Turbo.
ERNIE Image reference code — github.com/baidu/ernie-image.
Peebles & Xie, Scalable Diffusion Models with Transformers — arXiv:2212.09748.
Ghosh et al., GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment — arXiv:2310.11513.
OneIG-Bench — arXiv:2506.07977.
Yin et al., Distribution Matching Distillation for Diffusion Models — arXiv:2311.18828.
Canonical product page and free browser demo — ernie-image.org.