Available models - Weights & Biases Documentation

Serverless Inference provides access to several open source foundation models. Each model has different strengths and use cases.

Generally available models

The following models are generally available:

Model	Model ID (for API usage)	Type	Context Window	Parameters	Description
DeepSeek V4-Flash	`deepseek-ai/DeepSeek-V4-Flash`	Text	1049k	13B-284B (Active-Total)	DeepSeek V4-Flash is an MoE model with 1M context length great for coding, reasoning, and agentic workloads.
DeepSeek V4-Pro	`deepseek-ai/DeepSeek-V4-Pro`	Text	1049k	49B-1.6T (Active-Total)	DeepSeek V4-Pro is a 1.6T-parameter MoE model with 49B active parameters excelling at advanced reasoning, coding, and complex agentic workloads.
DeepSeek V3.1	`deepseek-ai/DeepSeek-V3.1`	Text	161k	37B-671B (Active-Total)	A large hybrid model that supports both thinking and non-thinking modes via prompt templates.
Google Gemma 4 31B	`google/gemma-4-31B-it`	Text, Vision	262k	31B (Total)	Gemma 4 31B Dense is designed for advanced reasoning, agentic workflows, and longer context and is natively trained on 140+ languages.
IBM Granite 4.1 8B	`ibm-granite/granite-4.1-8b`	Text	131k	8B (Total)	Granite 4.1 8B is a long-context instruct model capable of enhanced tool calling, instruction following, and chat capabilities.
Meta Llama 3.3 70B	`meta-llama/Llama-3.3-70B-Instruct`	Text	128k	70B (Total)	Multilingual model excelling in conversational tasks, detailed instruction-following, and coding.
Meta Llama 3.1 70B	`meta-llama/Llama-3.1-70B-Instruct`	Text	128k	70B (Total)	Efficient conversational model optimized for responsive multilingual chatbot interactions.
Meta Llama 3.1 8B	`meta-llama/Llama-3.1-8B-Instruct`	Text	128k	8B (Total)	Efficient conversational model optimized for responsive multilingual chatbot interactions.
Microsoft Phi 4 Mini 3.8B	`microsoft/Phi-4-mini-instruct`	Text	128k	3.8B (Total)	Compact, efficient model ideal for fast responses in resource-constrained environments.
MiniMax M2.5	`MiniMaxAI/MiniMax-M2.5`	Text	197k	10B-230B (Active-Total)	MoE model with a highly sparse architecture designed for high-throughput and low latency with strong coding capabilities.
Moonshot AI Kimi K2.6	`moonshotai/Kimi-K2.6`	Text, Vision	262k	32B-1T (Active-Total)	Kimi K2.6 is a multimodal Mixture-of-Experts language model featuring 32 billion activated parameters and a total of 1 trillion parameters.
Moonshot AI Kimi K2.5	`moonshotai/Kimi-K2.5`	Text, Vision	262k	32B-1T (Active-Total)	Kimi K2.5 is a multimodal Mixture-of-Experts language model featuring 32 billion activated parameters and a total of 1 trillion parameters.
NVIDIA Nemotron 3 Super 120B	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`	Text	262k	12B-120B (Active-Total)	Nemotron 3 is a LatentMoE model designed to deliver strong agentic, reasoning, and conversational capabilities.
OpenAI GPT OSS 120B	`openai/gpt-oss-120b`	Text	131k	5.1B-117B (Active-Total)	Efficient Mixture-of-Experts model designed for high-reasoning, agentic and general-purpose use cases.
OpenAI GPT OSS 20B	`openai/gpt-oss-20b`	Text	131k	3.6B-20B (Active-Total)	Lower latency Mixture-of-Experts model trained on OpenAI’s Harmony response format with reasoning capabilities.
OpenPipe Qwen3 14B Instruct	`OpenPipe/Qwen3-14B-Instruct`	Text	32.8k	14.8B (Total)	An efficient multilingual, dense, instruction-tuned model, optimized by OpenPipe for building agents with finetuning.
Qwen3.6 35B A3B	`Qwen/Qwen3.6-35B-A3B`	Text, Vision	262k	3B-35B (Active-Total)	Qwen3.6-35B-A3B is an MoE multimodal model with 262K context optimized for agentic coding workflows.
Qwen3.6 27B	`Qwen/Qwen3.6-27B`	Text, Vision	262k	27B (Total)	Qwen3.6-27B is a 27B dense multimodal model with 262K context built for flagship-level agentic coding.
Qwen3.5 35B A3B	`Qwen/Qwen3.5-35B-A3B`	Text, Vision	262k	3B-35B (Active-Total)	Qwen3.5-35B-A3B is an open-weights multimodal MoE model built for efficient, high-throughput inference across chat, reasoning, and agentic tasks.
Qwen3 235B A22B Thinking-2507	`Qwen/Qwen3-235B-A22B-Thinking-2507`	Text	262k	22B-235B (Active-Total)	High-performance Mixture-of-Experts model optimized for structured reasoning, math, and long-form generation.
Qwen3 235B A22B-2507	`Qwen/Qwen3-235B-A22B-Instruct-2507`	Text	262k	22B-235B (Active-Total)	Efficient multilingual, Mixture-of-Experts, instruction-tuned model, optimized for logical reasoning.
Qwen3 30B A3B	`Qwen/Qwen3-30B-A3B-Instruct-2507`	Text	262k	3.3B-30.5B (Active-Total)	Qwen3-30B-A3B-Instruct-2507 is a 30.5B MoE instruction-tuned model with enhanced reasoning, coding, and long-context understanding.
Qwen3 Coder 480B A35B	`Qwen/Qwen3-Coder-480B-A35B-Instruct`	Text	262k	35B-480B (Active-Total)	Mixture-of-Experts model optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning.
Z.AI GLM 5.1	`zai-org/GLM-5.1`	Text	203k	40B-744B (Active-Total)	Powerful MoE model for long-horizon agentic engineering and advanced reasoning.

Experimental models

The following models are experimental:

Model	Model ID (for API usage)	Type	Context Window	Parameters	Description
Qwen3.5 27B	`Qwen/Qwen3.5-27B`	Text, Vision	262k	27B (Total)	Qwen3.5-27B is a dense model from the Qwen3.5 family built for high performance across a large range of benchmarks.

Deprecated models

The following models are deprecated: None currently

Use model IDs

To specify a model when calling the API, use its Model ID from the preceding tables. For example:

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[...]
)

Next steps

After you’ve chosen a model, continue with one of the following resources:

Check usage limits and pricing for each model.
See the API reference for how to use these models.
Try models in the W&B Playground.

Documentation Index

​Generally available models

​Experimental models

​Deprecated models

​Use model IDs

​Next steps

Generally available models

Experimental models

Deprecated models

Use model IDs

Next steps