Clank Labs Model
Wrench
Frontier-grade agentic AI that runs on your hardware, for free. No API keys, no monthly bills — just models built for tool calling, error recovery, and getting real work done. The 35B scores 118/120 (matches Claude Opus) on 16GB VRAM. The 9B scores 114/120 (matches Claude Sonnet) on 8GB VRAM.
Benchmark Results
40-prompt agentic evaluation across 8 categories. Scored 0-3 per prompt.
Wrench 35B — Category Breakdown
Wrench 9B — Category Breakdown
vs. Frontier Models
| Model | Score |
|---|---|
| Claude OpusFrontier | ~118/120 |
| Wrench 35BClank Labs | 118/120 |
| Claude SonnetFrontier | ~114/120 |
| Wrench 9BClank Labs | 114/120 |
| GPT-4oFrontier | ~110/120 |
| Base Qwen 3.5 35BBase | ~60/120 |
Independent Validation
Wrench 35B on the Berkeley Function Calling Leaderboard (BFCL) — 1,390 test cases across 7 categories.
Non-live / AST category. An independent, standardized benchmark — not designed by us.
| Category | Score | Accuracy |
|---|---|---|
| Simple (Python) | 339/400 | 84.8% |
| Simple (Java) | 44/100 | 44.0% |
| Simple (JavaScript) | 28/50 | 56.0% |
| Multiple | 169/200 | 84.5% |
| Parallel | 170/200 | 85.0% |
| Parallel Multiple | 165/200 | 82.5% |
| Irrelevance Detection | 213/240 | 88.8% |
| Overall | 1128/1390 | 82.0% |
BFCL tests raw function-call syntax across Python, Java, and JavaScript — parallel invocations, multi-function calls, and irrelevance detection. A different axis than our agentic benchmark. Together, both benchmarks validate Wrench across structured function calling and real-world agent workflows.
Built Different
Purpose-Built for Agents
Fine-tuned specifically for tool calling, multi-step task chains, and error recovery. Not a general chatbot — a coding agent.
Two Sizes
35B MoE (3B active, 16GB VRAM) for maximum capability. 9B dense (~5GB GGUF, 8GB VRAM) for lighter hardware.
Safe by Design
Trained to warn before destructive actions, ask for confirmation, and never hallucinate tool calls that don't exist.
Proven Performance
35B scores 118/120 (Opus-tier) + 82% on BFCL. 9B scores 114/120 (95%). On hardware you own, for free.
Ollama + llama.cpp
Standard GGUF format. Works with Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible server.
Built for Clank
Drop-in model for Clank. Set it as your primary model and go — multi-channel, multi-agent, full tool suite.
Quick Start
Option A: Ollama (recommended)
# Download the GGUF + Modelfile from HuggingFace, then:
ollama create wrench -f Modelfile
ollama run wrench
# For the 9B model:
ollama create wrench-9b -f Modelfile
ollama run wrench-9b
# Recommended: enable KV cache quantization for lower VRAM usage
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama serve
# Or use with Clank:
npm install -g @clanklabs/clank
clank setup
# Set primary model to "ollama/wrench" or "ollama/wrench-9b" in config
Option B: llama.cpp
# 35B model:
./llama-server -m wrench-35B-A3B-Q4_K_M.gguf --jinja -ngl 100 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 32768
# 9B model:
./llama-server -m wrench-9B-Q4_K_M.gguf --jinja -ngl 100 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 8192
# Serves an OpenAI-compatible API on port 8080
# Point any app at http://localhost:8080/v1
Model Details
Wrench 35B
| Base Model | Qwen3.5-35B-A3B |
| Architecture | MoE — 35B total, 3B active |
| Fine-Tune | LoRA (rank 64, alpha 128) |
| Training Data | 1,252 examples, 15 categories |
| Quantization | Q4_K_M GGUF (~20GB) |
| Context Window | 8,192 tokens |
| Min GPU | 16GB VRAM |
| Clank Benchmark | 118/120 (98.3%) |
| BFCL (non_live) | 82.0% (1128/1390) |
| License | Apache 2.0 |
Wrench 9B
| Base Model | Qwen3.5-9B |
| Architecture | Dense — 9B parameters |
| Fine-Tune | LoRA (rank 64, alpha 128) |
| Training Data | 1,356 examples, 15 categories |
| Quantization | Q4_K_M GGUF (~5GB) |
| Context Window | 8,192 tokens |
| Min GPU | 8GB VRAM |
| Benchmark | 114/120 (95%) |
| License | Apache 2.0 |