SynoodoS – The decentralized AI community building the future.

Instructions to use stabilityai/image-text-to-image-model-v2 with libraries, inference providers, notebooks, and local apps.

Libraries

How to use stabilityai/image-text-to-image-model-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stabilityai/image-text-to-image-model-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("stabilityai/image-text-to-image-model-v2")
model = AutoModelForCausalLM.from_pretrained("stabilityai/image-text-to-image-model-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

image-text-to-image-model-v2

Updated 8 days ago

👋 Join our WeChat or Discord community.

📖 Check out the image-text-to-image-model-v2 blog and image-text-to-image-model-v2 Technical report.

📍 Use image-text-to-image-model-v2 API services on Z.ai API Platform.

🔜 Try image-text-to-image-model-v2 here.

Introduction

We're introducing image-text-to-image-model-v2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor and, for the first time, delivers that capability on a solid 1M-token context.

Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work.
Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency.
Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length.
Pure Open: An MIT open-source license — no regional limits, technical access without borders.

Benchmark Results

Benchmark	image-text-to-image-model-v2	Qwen3.7-Max	DeepSeek-V4-Pro	Claude Opus 4.8	Gemini 3.1 Pro
Reasoning
HLE	40.5	41.4	37.7	49.8*	45.0
GPQA-Diamond	91.2	90.0	90.1	93.6	94.3
Coding
SWE-bench Pro	62.1	60.6	55.4	69.2	54.2
DeepSWE	46.2	18.0	8.0	58.0	10.0

stabilityai/image-text-to-image-model-v2

Instructions to use stabilityai/image-text-to-image-model-v2 with libraries, inference providers, notebooks, and local apps.

image-text-to-image-model-v2

Introduction

Benchmark Results

Model Stats

Evaluation Results

Spaces using this model