内容摘录
<p align="center">
<a href="https://minebench.ai">
<img src=".github/assets/readme/minebench-banner.png" style="height: 10em" alt="MineBench banner"/>
</a>
</p>
<p align="center">
<a href="docs/README.md"><strong>[ Read the Docs ]</strong></a>
</p>
<p align="center">
<a href="https://minebench.ai">
<img alt="Live" src="https://img.shields.io/badge/Live-minebench.ai-0ea5e9?style=flat&logo=vercel&logoColor=white" />
</a>
<a href="LICENSE">
<img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-3b82f6?style=flat" />
</a>
<a href="https://buymeacoffee.com/ammaaralam">
<img alt="Support" src="https://img.shields.io/badge/Support-Buy%20Me%20a%20Coffee-ffdd00?style=flat&logo=buy-me-a-coffee&logoColor=000000" />
</a>
</p>
---
MineBench
**A benchmark for evaluating AI spatial reasoning through Minecraft-style voxel construction.**
Models are given a natural-language prompt and must produce raw 3D coordinates as JSON. In tool mode, models call voxel.exec (minimal primitives: block, box, line) to generate large builds beyond token-only JSON limits. MineBench visualizes the output and ranks models via head-to-head voting with a confidence-aware Glicko-style system (public ordering by conservative score).
**Try it live**
!MineBench arena — Opus 4.5 versus Opus 4.6
!MineBench default Arena landing page
Why MineBench?
Most LLM benchmarks test text and raw accuracy. MineBench instead tests whether a model reason about 3D space. Given a prompt like "a medieval castle with four towers", the model must mentally construct geometry, pick materials, and output thousands of precise block coordinates. No vision model or diffusion – just math and spatial logic.
As it turns out, this kind of spatial reasoning correlates strongly with a model's raw general intelligence; the MineBench leaderboard tracks, anecdotally, the same hierarchy that most people observe in real-world usage: the smartest reasoning models are clearly visible when asked to produce visual builds.
MineBench, unlike other benchmarks, gives an easy way to visually determine (at least one aspect of) a model's raw intelligence. The ranking system also highlights which models are clearly 'bench-maxed' (i.e. when a model has amazing benchmarks on paper, but clearly lacks in real world usage).
!MineBench arena — two AI models building a medieval castle side-by-side
Features
**Arena** — blind head-to-head comparisons of pre-generated builds with confidence-aware ranking
**Sandbox** — compare existing builds or generate new ones live with your own API keys
**Local Lab** — copy the benchmark prompt, run it in any model, paste the JSON back to render
**Leaderboard** — live rankings with win/loss/draw stats across all models
Documentation
Full docs index: docs/README.md
Ranking math and matchmaking walkthrough: docs/arena-ranking-system.md
Ranking policy: docs/arena-ranking-validity-policy-v2.md
Voxel tool runtime, conversion, and import workflows: docs/voxel-exec-raw-output.md
!MineBench leaderboard showing model rankings
Supported Models
MineBench currently benchmarks models from OpenAI, Anthropic, Google, Moonshot, DeepSeek, xAI, Z.AI, Qwen, Meta, and any model available through OpenRouter.
Contributing
Contributions are welcome! See CONTRIBUTING.md for how to add new models, submit benchmark prompts, improve the UI, or fix bugs.
Support MineBench
Running MineBench is expensive: model inference, storage, and hosting costs add up quickly as the benchmark grows.
If MineBench is useful to you and you want to help keep updates and new model runs coming, you can support it here:
**Buy Me a Coffee**
License
MIT
Texture pack: Faithful (see assets/texture-pack/LICENSE.txt)
Inspired by MC-Bench (GitHub)
---
Quick Start (Local)
This path lets you run the full app and compare existing builds from uploads/ without generating new ones.
1) Prerequisites
Node.js 18+
pnpm
Docker (for local Postgres)
2) Install dependencies
3) Create env file
4) Start app + database
pnpm dev:setup will:
ensure .env exists
build the texture atlas
reset local Docker Postgres volume
run Prisma migrations
start Next.js dev server on http://localhost:3000
5) Seed local DB from checked-in uploads/
In a second terminal:
Then open:
http://localhost:3000/ (Arena)
http://localhost:3000/sandbox (Benchmark Compare works immediately)
http://localhost:3000/leaderboard
Alternative startup (keep DB state)
If you do not want to reset the DB volume each time:
Live Generation (Bring Your Own Keys)
To generate fresh builds in /sandbox -> Live Generate:
Open http://localhost:3000/sandbox
Switch to Live Generate
Enter either:
an OpenRouter key (recommended), or
provider-specific keys (OpenAI/Anthropic/Gemini/Moonshot/DeepSeek)
Pick 2 models and click Generate
Notes:
Keys entered in Sandbox are stored in browser localStorage and sent only with that request.
In production, /api/generate requires request keys unless MINEBENCH_ALLOW_SERVER_KEYS=1.
Environment Variables
Copy .env.example to .env and set what you need:
Core
DATABASE_URL (required): pooled/runtime Postgres URL
DIRECT_URL (required): direct Postgres URL for Prisma migrations
ADMIN_TOKEN (required for /api/admin/*)
CRON_SECRET (recommended if using Vercel Cron for /api/admin/rank-snapshots/capture)
SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY (required for large build upload/download via Supabase Storage)
SUPABASE_STORAGE_BUCKET (optional, default builds)
SUPABASE_STORAGE_PREFIX (optional, default imports)
Provider keys (any subset)
OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_AI_API_KEY
MOONSHOT_API_KEY
DEEPSEEK_API_KEY
OPENROUTER_API_KEY
Optional provider/runtime tuning
MINEBENCH_ALLOW_SERVER_KEYS=1 (production opt-in for server env keys in /api/generate)
ANTHROPIC_OPUS_4_6_EFFORT=low|medium|high|max
ANTHROPIC_SONNET_4_6_EFFORT=low|medium|high|max (runtime falls back automatically if provider rejects max)
ANTHROPIC_STREAM_RESPONSES=1
OPENAI_STREAM_RESPONSES=1 (applies to live-delta callers; batch gener…