ERNIE-Image Team, Baidu · 2026-04-15 release
ERNIE Image
A technical reference for Baidu's 8B single-stream Diffusion Transformer text-to-image model, assembled from the ERNIE Image Hugging Face model card, the public configuration files, and the reference inference code.
Skip the install. The Turbo checkpoint is live in your browser — no account, no download, no cost.
Open the live demo →Abstract
ERNIE Image is an open-weight text-to-image generator whose distinguishing feature is not raw photorealism but legible in-image text: multi-line marketing copy, bilingual signage, comic dialogue, and labeled infographics, rendered in a single denoising pass rather than composited after the fact. The model is a single-stream Diffusion Transformer, eight billion parameters in the backbone, and it ships alongside a small causal-LM Prompt Enhancer that rewrites terse user inputs into the dense, structured descriptions the denoiser was trained to follow. Two checkpoints are published under Apache 2.0 — a quality-oriented SFT variant sampled in roughly fifty steps and a Turbo variant distilled with Distribution Matching Distillation and reinforcement learning down to eight.
This page is a technical reference. It is written to be read linearly by an engineer who already knows what latent diffusion is and now wants to know the specific numbers, specific config fields, and specific trade-offs of the ERNIE Image release. Sections are numbered, tables are verbatim from the public model card, and figures the model card does not disclose are explicitly flagged rather than inferred.
Not an official document. This page is an independent technical reference compiled from public sources. The canonical product page and free browser demo live at ernie-image.org. For anything authoritative about the ERNIE Image model itself, refer to Baidu's own communications and the Hugging Face model cards linked in §10.
Release & variants
The ERNIE Image release on April 15, 2026 packaged two checkpoints that share an
identical 8B ErnieImageTransformer2DModel backbone but differ in how they
are sampled at inference time. The split follows the now-familiar pattern of a
multi-step quality checkpoint plus a heavily distilled few-step sibling aimed at
latency-sensitive use.
ERNIE Image (SFT)
The quality-oriented checkpoint, trained with supervised fine-tuning on top of the base diffusion weights. Default inference uses roughly fifty denoising steps at a classifier-free guidance scale near four, and it gives the highest Overall score on GenEval among the publicly listed ERNIE Image variants. This is the recommended checkpoint when you are rendering a poster, a multi-panel comic page, or any output where text legibility has to survive one-shot sampling without post-hoc inpainting.
ERNIE Image Turbo
The same backbone, distilled down to eight sampling steps at guidance scale one using Distribution Matching Distillation plus a reinforcement-learning polish pass. The effective speedup relative to the 50-step SFT default is approximately six times at comparable perceptual quality, though the GenEval Overall is slightly lower (0.8667 without the Prompt Enhancer vs. 0.8856 for SFT). Turbo is what you want when you are driving an interactive gallery or serving real users; the in-browser demo linked further down this page runs Turbo against the public Hugging Face Space.
DiT backbone config
The diffusion backbone lives in the Hugging Face repository under the
transformer subfolder and exposes its architecture through a plain
config.json. The declared class is
ErnieImageTransformer2DModel — a single-stream Diffusion Transformer in the
lineage of Peebles & Xie's original DiT paper, applied to a latent-space text-to-image
setting rather than ImageNet class-conditional generation. The core dimensions are the
following:
- _class_name
- ErnieImageTransformer2DModel
- num_layers
- 36
- hidden_dim
- 4096
- ffn_hidden_size
- 12288
- num_attention_heads
- 32
- in_channels
- 128
- out_channels
- 128
- text_in_dim
- 3072
- total parameters
- ~8B
A couple of specifics worth noting. The expansion ratio on the feed-forward block is
ffn_hidden_size / hidden_dim = 3.0, on the lower end of what modern DiT-style
denoisers publish, which aligns with ERNIE Image's positioning as compute-efficient
enough to run on a single 24 GB consumer GPU. The input and output channel counts are
both 128, which implies a 128-channel VAE latent space rather than the more common
16-channel arrangement of earlier latent-diffusion pipelines — a choice that gives the
transformer proportionally more signal per spatial token at the cost of a more
expensive VAE. And the text conditioning enters the backbone at dimension 3072,
which matches the hidden size of the Prompt Enhancer discussed in §4 and is the glue
that ties the two modules together.
Prompt Enhancer (Ministral3ForCausalLM)
The Prompt Enhancer is an independent component shipped inside the ERNIE Image
repository under the prompt_enhancer subfolder. Its job is narrow: take a
terse user description plus a target output resolution, and emit a rich, structured
visual description for the diffusion backbone to consume. It is a small decoder-only
language model constrained by an external chat template; it does not participate in
denoising and can be toggled off entirely.
- architectures
- Ministral3ForCausalLM
- hidden_size
- 3072
- num_hidden_layers
- 26
- vocab_size
- 131 072
- max_position_embeddings
- 262 144
- rope_scaling
- YaRN
- pipeline toggle
- use_pe (True / False)
The chat template in chat_template.jinja enforces a strict output contract:
the enhancer must return only the rewritten visual description, with no
preamble, no commentary, and no explanation. The resulting text is what the backbone
actually conditions on, and the pipeline exposes it through
output.revised_prompts so production systems can log the enhanced prompts
alongside their outputs for audit, reproducibility, and offline prompt-engineering
review.
One subtlety that matters for deployments: turning the Prompt Enhancer on is
not unambiguously better. On GenEval, enabling use_pe raises
Counting from 0.7781 to 0.8187 and Position from 0.8550 to 0.8625, but drops Overall
from 0.8856 to 0.8728 because richer enhancer output trades off some attribute-binding
precision for elaborative detail. The right move is to treat use_pe as a
per-scene switch rather than a global default — enabled for layout-heavy or
detail-dense scenes and disabled when the user's literal wording carries critical
binding constraints.
The enhancer tokenizer also carries a batch of special tokens that hint at more
ambition than the ERNIE Image model card documents: [SYSTEM_PROMPT],
[IMG], [IMG_END], [THINK], and several
tool-calling markers. Not all of them participate in the current text-to-image
inference path — they are the residue of a more general multimodal tokenizer lineage —
but they are worth flagging for anyone building on top of the pipeline.
Text encoder — Mistral3Model with Pixtral vision config
The single most unusual thing in the ERNIE Image configuration is the text encoder.
It is not a conventional CLIP text tower, and it is not the Prompt Enhancer either. It
is a Mistral3Model whose config.json contains both a
text_config and a vision_config, and the vision half declares
model_type: pixtral.
- model_type
- pixtral
- patch_size
- 14
- num_hidden_layers
- 24
- hidden_size
- 1024
- image_token_index
- present
Read literally, this means the module labeled "text encoder" in the ERNIE Image pipeline is structurally capable of ingesting image tokens alongside text tokens, even though the public model card focuses strictly on text-to-image generation and the Quickstart does not expose an image conditioning input. Whether this is residual capacity from an earlier training stage, scaffolding for a future image-to-image release, or a latent capability already active inside the encoder is not something the model card answers. For anyone reverse-engineering the inference path, it is the single most interesting loose thread in the ERNIE Image release.
Inference pipeline
Assembled from the Diffusers Quickstart and the reference
infer_demo.py, the end-to-end ERNIE Image inference flow is a four-stage
cascade:
-
Prompt intake. The caller hands the pipeline a string plus a
resolution preset. If
use_pe=True, the Prompt Enhancer rewrites the string into a richer visual description and the rewritten text becomes the conditioning input; otherwise the literal string is used. -
Tokenization and encoding. The text (or enhanced text) is tokenized
and fed through the Mistral3 text encoder, which emits a 3072-dimensional
conditioning tensor that matches the backbone's
text_in_dim. -
Latent denoising. A noise latent is sampled and the
ErnieImageTransformer2DModelbackbone iteratively denoises it. The SFT variant runs ~50 steps at guidance 4.0; the Turbo variant runs 8 steps at guidance 1.0 (no classifier-free guidance needed after DMD + RL distillation). -
VAE decoding. The final latent is passed through the VAE decoder to
produce a pixel image. If Prompt Enhancer was enabled, the pipeline also returns
output.revised_promptsso the enhanced text can be persisted for audit.
The reference implementation documents seven resolution presets baked into the
pipeline: 1024×1024, 848×1264, 1264×848,
768×1376, 1376×768, 896×1200, and
1200×896. Together they span square, portrait, landscape, tall, and
widescreen aspect ratios without forcing post-hoc cropping.
Try it in-browser
The Turbo checkpoint is hosted as a public Hugging Face Space at
baidu/ERNIE-Image-Turbo, and the Space is embeddable over an iframe. The
block below runs against the upstream Space directly; queue times depend on Space
traffic.
Want the full product instead of a raw Space? The live demo wraps the same Turbo checkpoint with presets, bilingual prompt helpers, and a gallery.
Open the live demo →Benchmark tables
All four tables below are verbatim transcriptions of the public ERNIE Image and ERNIE
Image Turbo Hugging Face model cards, evaluated on April 15, 2026. Scores are in the
[0, 1] range; higher is better. Rows corresponding to ERNIE Image
checkpoints are highlighted.
On ranking stability: automatic text-to-image benchmarks drift as evaluator scripts age and as new models saturate the scoring heuristics. The GenEval 2 work flags GenEval drift explicitly. Read every number on this page as a point-in-time snapshot from 2026-04-15, not as a durable ranking.
GenEval — compositional text-to-image
| Model | Single | Two | Count | Colors | Position | Attr | Overall |
|---|---|---|---|---|---|---|---|
| ERNIE Image (w/o PE) | 1.0000 | 0.9596 | 0.7781 | 0.9282 | 0.8550 | 0.7925 | 0.8856 |
| ERNIE Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | 0.8625 | 0.7225 | 0.8728 |
| Qwen-Image | 0.9900 | 0.9200 | 0.8900 | 0.8800 | 0.7600 | 0.7700 | 0.8683 |
| ERNIE Image Turbo (w/o PE) | 1.0000 | 0.9621 | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 |
| ERNIE Image Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 |
| FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 |
| Z-Image | 1.0000 | 0.9400 | 0.7800 | 0.9300 | 0.6200 | 0.7700 | 0.8400 |
| Z-Image-Turbo | 1.0000 | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |
OneIG-EN and OneIG-ZH — bilingual multi-axis
| Track / model | Alignment | Text | Reasoning | Style | Diversity | Overall |
|---|---|---|---|---|---|---|
| EN · ERNIE Image (w/ PE) | 0.8678 | 0.9788 | 0.3566 | 0.4309 | 0.2411 | 0.5750 |
| EN · ERNIE Image Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 |
| EN · ERNIE Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 |
| ZH · ERNIE Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 |
| ZH · ERNIE Image Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 |
| ZH · ERNIE Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 |
The EN–ZH Overall gap on the top ERNIE Image row is a mere
0.5750 − 0.5543 = 0.0207, which is small enough to treat English and
Chinese as a single pipeline rather than as two language tracks stitched together with
a translation layer. This is the most concrete quantitative support for the model
card's "native bilingual" positioning.
LongTextBench — in-image text rendering
| Model | EN | ZH | Avg |
|---|---|---|---|
| ERNIE Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 |
| ERNIE Image Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 |
| ERNIE Image Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 |
| ERNIE Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 |
LongTextBench measures how accurately long strings of text are rendered inside a generated image, with separate English and Chinese subsets. The 0.9733 average on the top ERNIE Image row is the single strongest benchmark result in this release, and it is also the only benchmark in this suite that most competing open text-to-image models do not currently report — which makes direct head-to-head impossible, but also makes ERNIE Image the de facto reference point for this capability.
Caveats and unknowns
A technical reference is only as trustworthy as its list of things it does not know. Items the ERNIE Image model card deliberately does not disclose, and which downstream deployments should treat as open questions rather than infer from the model's behavior:
- Training data. Source datasets, licensing provenance, scale (number of image-text pairs), language mix, and synthetic-data fraction are not published. Compliance-sensitive deployments need to design their own filtering, dedup, and audit layer on top of the public weights.
- Training objective details. Whether the diffusion objective is ε-prediction, v-prediction, or something else, what noise schedule was used, and which loss weighting was applied — none of these are stated in the public model card, though they can be partially reconstructed from the Diffusers pipeline code.
- Alignment and reward modeling. ERNIE Image Turbo is described as "DMD + RL distilled," but the reward model, the reward signal, and the RL algorithm are not documented.
- Safety evaluation. Red-team coverage, refusal behavior on disallowed content, and watermarking / provenance are not part of the public release. Operators deploying into regulated markets need to layer their own content governance.
-
Multimodal capacity of the text encoder. See §5: the encoder's
vision_configdeclares a pixtral vision tower, but no image input path is exposed in the public Quickstart. Whether this is vestigial, dormant, or accessible via an undocumented API is not something the model card answers.
References
- ERNIE Image model card, Hugging Face — huggingface.co/baidu/ERNIE-Image.
- ERNIE Image Turbo model card, Hugging Face — huggingface.co/baidu/ERNIE-Image-Turbo.
- ERNIE Image Turbo live Space (iframe source) — huggingface.co/spaces/baidu/ERNIE-Image-Turbo.
- ERNIE Image reference code — github.com/baidu/ernie-image.
- Peebles & Xie, Scalable Diffusion Models with Transformers — arXiv:2212.09748.
- Ghosh et al., GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment — arXiv:2310.11513.
- OneIG-Bench — arXiv:2506.07977.
- Yin et al., Distribution Matching Distillation for Diffusion Models — arXiv:2311.18828.
- Canonical product page and free browser demo — ernie-image.org.