Model Pricing Overview

AI Gateway lets you call virtually all mainstream large models on the market (Claude, GPT, Gemini, Qwen, DeepSeek, GLM, Kimi, Doubao, and more) through a single interface and a single bill — no need to register, integrate, or top up accounts on each provider's platform separately. The table below lists the official list prices for each model, organized by currency zone — domestic (CNY) and international (USD) — with a separate table per vendor for easy comparison.


Billing in Three Minutes

Billing dimension descriptions

DimensionDescription
InputUnit price per prompt token in the user's request
OutputUnit price per token of model-generated content
Explicit cache writeUnit price for writing a prompt into the context cache for the first time
Explicit cache write · 5m / 1hAnthropic's two cache write price tiers
Cache hitDiscounted price for reusing a cached prompt on subsequent requests

Large models are not billed per call — they are billed by the volume of text processed, measured in units called tokens.


What Is Caching and Why Does It Save Money?

If you repeatedly send the same opening content (for example, the same long system prompt or the same knowledge base document every time), the model can "remember" that content and reuse it directly on subsequent calls without recomputing it — this is caching. The cached portion is priced far below the normal input price, resulting in significant savings.

There are two types of caching:

Cache typePlain-language explanationBilling characteristics
Explicit cache (manual)You actively tell the model "store this segment." Like renting a locker: storing it incurs a one-time write/create fee (slightly above the input price); each subsequent retrieval incurs a low hit fee; long-term storage may also incur a storage fee.Write fee + hit fee (+ storage fee)
Implicit cache (automatic)The system automatically detects repeated prefixes and caches them for you — no action required. Like a store automatically giving a discount to regular customers: no write fee; you simply enjoy the lower price on a cache hit.Hit fee only, no write fee
ProviderExplicit cacheImplicit cacheNotes
Alibaba Cloud · Qwen✅ Supported✅ SupportedBoth types, highest flexibility
OpenAI · GPT❌ Not supported✅ SupportedAutomatic implicit cache only, no manual operation needed
Anthropic · Claude✅ Supported❌ Not supportedManual explicit cache only, write available in 5-minute / 1-hour tiers
Google · Gemini✅ Supported❌ Not supportedExplicit cache, additional hourly storage fee
DeepSeekPartialPartialv3.2 supports both; r1 / v3.1 implicit only; v4 series not yet available
Zhipu · GLMPartial✅ SupportedGLM-5.1 supports both; others implicit only
Moonshot · Kimi✅ Supported✅ SupportedBoth types
MiniMax❌ Not supported✅ SupportedImplicit only
ByteDance · Doubao✅ Supported✅ SupportedExplicit incurs storage fee; implicit takes effect in batch mode

When reading the tables: a "Explicit·Write/Create" column = explicit cache supported; an "Implicit·Hit" column = implicit cache supported; or a missing column = that model does not support that cache type.


Domestic Zone (CNY · ¥ / million tokens)

Alibaba Cloud · Qwen Series

> Positioning: China's all-around model, covering general conversation, coding, vision, speech, and multimodal, with a context window up to 1 million tokens. Both explicit and implicit caching are supported.

ModelContext windowInputOutputExplicit·CreateExplicit·HitImplicit·Hit
qwen3.6-max-preview (strongest)0–128K95411.250.9
128K–256K159018.751.5
qwen3.6-plus (flagship general)0–256K2122.50.2
256K–1M848100.8
qwen3.6-flash (fast & low-cost)0–256K1.27.21.50.12
256K–1M4.828.860.48
qwen3.5-plus0–128K0.84.810.080.16
128K–256K2122.50.20.4
256K–1M42450.40.8
qwen3.5-flash0–128K0.220.250.02
128K–256K0.8810.08
256K–1M1.2121.50.12
qwen3-max0–32K2.5103.1250.250.5
32K–128K41650.40.8
128K–256K7288.750.71.4
qwen3-coder-plus (coding)0–32K41650.40.8
32K–128K6247.50.61.2
128K–256K104012.512
256K–1M202002524

DeepSeek Series

> Positioning: Known for exceptional cost-effectiveness and strong reasoning capabilities, suitable for budget-sensitive scenarios that still demand quality. Cache support varies by model version (see table below; missing columns indicate no support).

ModelInputOutputExplicit·CreateExplicit·HitImplicit·Hit
deepseek-v4-pro (flagship)12242.4
deepseek-v4-flash (fast)120.2
deepseek-v3.2232.50.20.4
deepseek-r1 (deep reasoning)4160.8

Zhipu · GLM Series

> Positioning: Balanced domestic general-purpose model; GLM-5 series is the new flagship generation. Most versions support implicit cache only; GLM-5.1 additionally supports explicit cache.

ModelContext windowInputOutputExplicit·CreateExplicit·HitImplicit·Hit
glm-5.1 (flagship)0–32K6247.50.61.2
32K–200K828100.81.6
glm-50–32K4180.8
32K–198K6221.2
glm-4.70–32K3140.6
32K–166K4160.8

Moonshot · Kimi Series

> Positioning: Excels at understanding and processing ultra-long text. Both explicit and implicit caching are supported.

ModelInputOutputExplicit·CreateExplicit·HitImplicit·Hit
kimi-k2.66.5278.1250.651.3
kimi-k2.542150.40.8

MiniMax Series

> Positioning: Cost-effective general-purpose model. Implicit cache only (system discounts automatically, no action required).

ModelInputOutputImplicit·Hit
MiniMax-M2.72.18.40.42
MiniMax-M2.52.18.40.42

Volcano Engine (Doubao)

> Positioning: China's high-value all-in-one suite, covering text, vision, video, images, and 3D. Explicit cache (additional storage fee of ¥0.017/million token·hour) and implicit cache (effective in batch mode) are both supported. The table below shows standard "online inference" prices; Doubao also offers approximately 50% off for batch inference.

ModelContext windowInputOutputExplicit·Hit
doubao-seed-2.0-pro (flagship)[0, 32K]3.2160.64
(32K, 128K]4.8240.96
(128K, 256K]9.6481.92
doubao-seed-2.0-code (coding)[0, 32K]3.2160.64
(32K, 128K]4.8240.96
(128K, 256K]9.6481.92
doubao-seed-2.0-lite[0, 32K]0.63.60.12
(32K, 128K]0.95.40.18
(128K, 256K]1.810.80.36
doubao-seed-2.0-mini (cheapest)[0, 32K]0.220.04
(32K, 128K]0.440.08
(128K, 256K]0.880.16
doubao-seed-1.6[0, 32K]0.82 / 8 ※0.16
(32K, 128K]1.2160.16
(128K, 256K]2.4240.16
doubao-seed-1.6-flash (fast)[0, 32K]0.151.50.03
(32K, 128K]0.330.03
(128K, 256K]0.660.03
doubao-seed-1.6-vision (vision)[0, 32K]0.880.16
(32K, 128K]1.2160.16
(128K, 256K]2.4240.16
doubao-1.5-pro-32k0.820.16
doubao-1.5-lite-32k0.30.60.06