Clank Labs Model

Wrench

Frontier-grade agentic AI that runs on your hardware, for free. No API keys, no monthly bills — just models built for tool calling, error recovery, and getting real work done. The 35B scores 118/120 (matches Claude Opus) on 16GB VRAM. The 9B scores 114/120 (matches Claude Sonnet) on 8GB VRAM.

Benchmark Results

40-prompt agentic evaluation across 8 categories. Scored 0-3 per prompt.

Wrench 35B — Category Breakdown

Basic Tool Use15/15
Multi-Step Tasks15/15
Error Recovery14/15
Response Quality15/15
System Prompt Following15/15
Planning & Reasoning15/15
Tool Format Correctness14/15
Safety & Restraint15/15
Total118/120 (98.3%)

Wrench 9B — Category Breakdown

Basic Tool Use15/15
Multi-Step Tasks14/15
Error Recovery14/15
Response Quality15/15
System Prompt Following14/15
Planning & Reasoning13/15
Tool Format Correctness15/15
Safety & Restraint14/15
Total114/120 (95%)

vs. Frontier Models

ModelScore
Claude OpusFrontier~118/120
Wrench 35BClank Labs118/120
Claude SonnetFrontier~114/120
Wrench 9BClank Labs114/120
GPT-4oFrontier~110/120
Base Qwen 3.5 35BBase~60/120

Independent Validation

Wrench 35B on the Berkeley Function Calling Leaderboard (BFCL) — 1,390 test cases across 7 categories.

Non-live / AST category. An independent, standardized benchmark — not designed by us.

CategoryScoreAccuracy
Simple (Python)339/40084.8%
Simple (Java)44/10044.0%
Simple (JavaScript)28/5056.0%
Multiple169/20084.5%
Parallel170/20085.0%
Parallel Multiple165/20082.5%
Irrelevance Detection213/24088.8%
Overall1128/139082.0%

BFCL tests raw function-call syntax across Python, Java, and JavaScript — parallel invocations, multi-function calls, and irrelevance detection. A different axis than our agentic benchmark. Together, both benchmarks validate Wrench across structured function calling and real-world agent workflows.

Built Different

Purpose-Built for Agents

Fine-tuned specifically for tool calling, multi-step task chains, and error recovery. Not a general chatbot — a coding agent.

Two Sizes

35B MoE (3B active, 16GB VRAM) for maximum capability. 9B dense (~5GB GGUF, 8GB VRAM) for lighter hardware.

Safe by Design

Trained to warn before destructive actions, ask for confirmation, and never hallucinate tool calls that don't exist.

Proven Performance

35B scores 118/120 (Opus-tier) + 82% on BFCL. 9B scores 114/120 (95%). On hardware you own, for free.

Ollama + llama.cpp

Standard GGUF format. Works with Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible server.

Built for Clank

Drop-in model for Clank. Set it as your primary model and go — multi-channel, multi-agent, full tool suite.

Quick Start

Option A: Ollama (recommended)

# Download the GGUF + Modelfile from HuggingFace, then:

ollama create wrench -f Modelfile

ollama run wrench

# For the 9B model:

ollama create wrench-9b -f Modelfile

ollama run wrench-9b

# Recommended: enable KV cache quantization for lower VRAM usage

OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama serve

# Or use with Clank:

npm install -g @clanklabs/clank

clank setup

# Set primary model to "ollama/wrench" or "ollama/wrench-9b" in config

Option B: llama.cpp

# 35B model:

./llama-server -m wrench-35B-A3B-Q4_K_M.gguf --jinja -ngl 100 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 32768

# 9B model:

./llama-server -m wrench-9B-Q4_K_M.gguf --jinja -ngl 100 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 8192

# Serves an OpenAI-compatible API on port 8080

# Point any app at http://localhost:8080/v1

Model Details

Wrench 35B

Base ModelQwen3.5-35B-A3B
ArchitectureMoE — 35B total, 3B active
Fine-TuneLoRA (rank 64, alpha 128)
Training Data1,252 examples, 15 categories
QuantizationQ4_K_M GGUF (~20GB)
Context Window8,192 tokens
Min GPU16GB VRAM
Clank Benchmark118/120 (98.3%)
BFCL (non_live)82.0% (1128/1390)
LicenseApache 2.0

Wrench 9B

Base ModelQwen3.5-9B
ArchitectureDense — 9B parameters
Fine-TuneLoRA (rank 64, alpha 128)
Training Data1,356 examples, 15 categories
QuantizationQ4_K_M GGUF (~5GB)
Context Window8,192 tokens
Min GPU8GB VRAM
Benchmark114/120 (95%)
LicenseApache 2.0