stabilityai/image-text-to-image-model-v2

Text GenerationtransformerssafetensorsEnglishChineseglm_moe_dsaconversationalEval Resultsarxiv:2602.15763arxiv:2603.12201License: mit

Instructions to use stabilityai/image-text-to-image-model-v2 with libraries, inference providers, notebooks, and local apps.

Libraries
How to use stabilityai/image-text-to-image-model-v2 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stabilityai/image-text-to-image-model-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("stabilityai/image-text-to-image-model-v2")
model = AutoModelForCausalLM.from_pretrained("stabilityai/image-text-to-image-model-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

image-text-to-image-model-v2

Updated 8 days ago
πŸ‘‹ Join our WeChat or Discord community.
πŸ“– Check out the image-text-to-image-model-v2 blog and image-text-to-image-model-v2 Technical report.
πŸ“ Use image-text-to-image-model-v2 API services on Z.ai API Platform.
πŸ”œ Try image-text-to-image-model-v2 here.

Introduction

We're introducing image-text-to-image-model-v2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor and, for the first time, delivers that capability on a solid 1M-token context.

  • Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work.
  • Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency.
  • Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9Γ— at a 1M context length.
  • Pure Open: An MIT open-source license β€” no regional limits, technical access without borders.

Benchmark Results

Benchmarkimage-text-to-image-model-v2Qwen3.7-MaxDeepSeek-V4-ProClaude Opus 4.8Gemini 3.1 Pro
Reasoning
HLE40.541.437.749.8*45.0
GPQA-Diamond91.290.090.193.694.3
Coding
SWE-bench Pro62.160.655.469.254.2
DeepSWE46.218.08.058.010.0

Model Stats

Downloads last month67,107
Model size753B params
Tensor typeBF16 / F32

Evaluation Results

SWE Bench Pro62.1
ScaleAI/SWE-bench_Pro
Diamond91.2
Idavidrein/gpqa
Deep Swe46.2
datacurve/deep-swe

Spaces using this model

41
πŸ€–smolagents/ml-intern
🐠akhaliq/GLM-5.2
πŸƒangelorovatti/zai-org-GLM-5.2