/Tech2d ago

Google releases DiffusionGemma, an open-source text diffusion model that generates over 1,000 tokens per second in parallel blocks

Quantized GGUF versions run locally using llama.cpp

6569K1.2K3.6K1.3M

Original post

Quentin Berthet#1542

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

9:06 AM · Jun 10, 2026 · 880.8K Views

Sentiment

Many users praised the open DiffusionGemma model for enabling 4x faster local inference and accessible research advances, while some directed general frustration at AI companies and unrelated stability issues.

Pos

90.9%

Neg

9.1%

454 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS259.5KLIKES3.1KREPLIES167

Sundar Pichai@sundarpichai

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It’s a racehorse 🏇achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token (word-by-word) output!

2d259.5K3.1K530

BOOKMARKS923

Google@Google

Meet DiffusionGemma ⚡ Our latest experimental open model (Apache 2.0) that generates text up to 4x faster.

Instead of predicting and typing just one word at a time like most language models, it drafts and refines entire blocks of text simultaneously.

Here’s how it works 🧵 ↓

2d213.9K3.1K923

RETWEETS734

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d880.8K5K2K

Unsloth AI@UnslothAI

Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM.

It supports high-speed text generation, thinking, image, video and 256K context.

Run and train via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF Guide: https://unsloth.ai/docs/models/diffusiongemma

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d172.3K1.6K887

Google DeepMind@GoogleDeepMind

DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs.

Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time.

2d161K2.3K469

Demis Hassabis@demishassabis

Awesome to see this innovation in text diffusion. DiffusionGemma is lightning fast, 4x faster than other Gemma 4 models! Congrats to @bodonoghue85 and the team who worked so hard on this - excited to see what people build with it!

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d163.4K1.6K276

Volodymyr Kuleshov 🇺🇦@volokuleshov

Congratulations to Google on open-sourcing Gemma Diffusion!

I want to give a shout-out to a group of really talented Cornell students who developed in the lab a lot of the new ideas that we see in this model:

@mariannearr -- Block diffusion is what enables Gemma Diffusion to generate arbitrary length sequences and support KV caching.

@mariannearr @SchiffYair -- Efficient encoder-decoder diffusion (E2D2) extends block diffusion and is part of what makes Gemma really fast, speeding up inference by running a smaller decoder model.

@SchiffYair @ssahoo_ @Guanghan__Wang -- Uniform diffusion LMs (UDLMs) are the family of discrete diffusion models that underlie Gemma and define its noise process and training objective. This work builds on our earlier simplified losses in MDLMs.

@ssahoo_ -- Uniform diffusion supports built-in error correction and is especially effective with distilled fast samplers like the ones introduced in Duo.

This is a great overview of Gemma Diffusion: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-diffusiongemma

Check out the students' papers below:

1d21.3K501362

Omar Sanseviero@osanseviero

Introducing DiffusionGemma, our first exploration with open diffusion text generation models

🔥Generate blocks of text at a time 🤏26B MoE built on top of Gemma 4 ⚡️Up to 4x faster in popular consumer GPUs 🤗Apache 2.0

Excited to see what the community builds with it!

2d51.3K887245

Daniel Han@danielhanchen

We made DiffusionGemma run via llama.cpp locally! It works well with Unsloth GGUFs and you can run it in realtime visualization mode or normal chat CLI mode!

See our docs https://unsloth.ai/docs/models/diffusiongemma on how to set it up!

Unsloth AI@UnslothAI

Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM.

It supports high-speed text generation, thinking, image, video and 256K context.

Run and train via Unsloth Studio.

GGUF: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF Guide: https://unsloth.ai/docs/models/diffusiongemma

2d34.9K396239

Philipp Schmid@_philschmid

Gemma goes diffusion! DiffusionGemma with up to 1000+ tokens per second! 🌬️

- Built on Gemma 4 as a 26B MoE model. - 3.8B parameters during inference. - Generates text in 256-token blocks in parallel. - Fits within 18 GB VRAM limits when quantized. - Apache 2.0

2d40.1K516213

Prince Canuma@Prince_Canuma

Massive congrats to @GoogleDeepMind on DiffusionGemma! 🎉

We collaborated closely with the team to Day-0 MLX-VLM — native diffusion decoding on Apple Silicon, release dropping later today (~3-4h), meanwhile you can install from source. ⚡🍎

This is genuinely different beast — instead of token-by-token, it generates 256-token blocks in parallel with bi-directional attention and iteratively self-corrects. 26B MoE, only 3.8B active, fits in 18GB when quantized.

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d37.6K460209

Google for Developers@googledevs

Want 4x faster local inference on dedicated GPUs for your interactive apps? DiffusionGemma is an experimental, open 26B MoE model that generates entire blocks of text simultaneously instead of token-by-token.

By shifting the local decoding bottleneck from memory-bandwidth to compute, it hits speeds over 700 tokens/sec on a single NVIDIA RTX 5090 GPU. This diffusion unlocks unique local workflows like real-time inline editing, code infilling, and instant self-correction.

📥 Download the Apache 2.0 weights on @HuggingFace: https://goo.gle/4xqzKTA

📖 Read the full technical announcement on the blog: https://goo.gle/4ursgwI

2d43.1K534149

vLLM@vllm_project

Congrats to @GoogleDeepMind on DiffusionGemma 🎉 A 26B diffusion language model on the Gemma4 backbone, and the first dLLM natively supported in vLLM.

It denoises 256-token blocks in parallel instead of generating one token at a time: 1200+ output tok/s at batch size 1 on a single H200 (FP8).

Built on model runner v2's ModelState plus the existing speculative decoding path, with minimal scheduler or runner changes. FP8 and NVFP4 checkpoints are on the @RedHat_AI hub. Thanks to the @GoogleDeepMind, @RedHat_AI, and @NVIDIAAI teams!

🔗 http://vllm.ai/blog/2026-06-10-diffusion-gemma

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d38.2K516135

merve@mervenoyann

DiffusionGemma is out 🔥

it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) 💨

also great on coding, generate and iterate on any code from 3D generation to front-end ⤵️

2d36.6K300170

Unsloth AI@UnslothAI

@googlegemma Google Deepmind once again delivering when it comes to open-source! 🙏🥰

You can run DiffusionGemma locally on 18GB RAM via our GGUFs: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF

Google Gemma@googlegemma

Meet DiffusionGemma!

An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.

Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

2d13.5K350138

Google AI Developers@googleaidevs

DiffusionGemma, our experimental open model released under an Apache 2.0 license, explores text diffusion, an exceptionally fast approach to text generation.

Here’s how DiffusionGemma accelerates development:

+ Faster token output: By shifting the bottleneck from memory bandwidth to raw compute, the model generates up to 4x faster token output on dedicated GPUs + Accessible hardware footprint: Activates just 3.8B parameters during inference, fitting comfortably within 24GB-VRAM high-end consumer GPUs when quantized + Novel workflows: Parallel token generation enables self-correction, making it ideal for code infilling, in-line editing, and non-linear structures

DiffusionGemma prioritizes speed over raw quality and accelerates best on compute-bound hardware (like @NVIDIAAI GPUs). Standard @GoogleGemma 4 remains recommended for production quality and memory-bound devices.

2d102.3K400105

Google@Google

We're releasing DiffusionGemma as an open model under an Apache 2.0 license for anyone to experiment with.

Download the model weights on @huggingface, and learn more about DiffusionGemma → http://goo.gle/3Sy0Is7

Google@Google

Because it generates everything at once, DiffusionGemma unlocks new patterns of model behavior.

⚡ Fast: Generates up to 1,000+ tokens a second for up to 4x faster text generation.

💻 Lightweight: Runs smoothly right on 18GB consumer graphics cards.

🧠 Smart editing: Since it processes larger amounts of information at once, it can easily fill in blanks, format code, and fix its own errors in real time.

2d43.7K45192

Google Gemma@googlegemma

⚡ Blazing Fast: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma delivers up to a 4x speedup on standard accelerators. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090!)

2d24K39337

Google@Google

Because it generates everything at once, DiffusionGemma unlocks new patterns of model behavior.

⚡ Fast: Generates up to 1,000+ tokens a second for up to 4x faster text generation.

💻 Lightweight: Runs smoothly right on 18GB consumer graphics cards.

🧠 Smart editing: Since it processes larger amounts of information at once, it can easily fill in blanks, format code, and fix its own errors in real time.

Google@Google

Most large language models predict answers by guessing the single best word to say next, then the next, and so on... 🔎

It's highly capable, but not necessarily fast. The model waits to finish one word before it can think about the next.

DiffusionGemma skips the wait.

It uses "diffusion" to generate text by refining noise step by step — drafting and error-correcting whole blocks simultaneously. This makes it incredibly fast, and helpful for editing complex math and code.

2d41.9K31038

Google Gemma@googlegemma

Read more in our blog: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/

2d19.5K17756

Digg Deeper

Kevin Roseasked

who would use this and why?

AI researchers, local developers, and power users with 18-24GB GPUs would run DiffusionGemma.

It delivers 4x faster block-wise generation (1,000+ tokens/sec) with built-in self-correction and markdown/code editing, all locally under an Apache 2.0 license on Hugging Face.

The trade-off is modestly lower benchmark scores than standard Gemma 4 on tasks like coding and reasoning.