Leverage: Building a Production-Grade AI Bot Competition Platform
I’ve spent the last several weeks rewriting Leverage — an online judge and bot competition platform — from the ground up. What started as “just a rewrite” turned into a comprehensive system with sandboxed execution, a dual leaderboard, real-time human-vs-bot matches, and an AI-designed game pipeline. Here’s the technical story.
What Leverage Does
Leverage is two things at once:
- An Online Judge (OJ) — students submit code, it runs against test cases, gets a verdict
- A Bot Competition Platform — bots play strategic games against each other, ELO ratings evolve, humans can join matches in real time
The original system was a Vue 2 frontend + PHP backend with a Django judge server. After 18 months of accumulated technical debt, we did a full rewrite: NestJS backend, Nuxt 4 frontend, and a brand-new judge engine called botzone-neo.
Architecture Overview
┌────────────────┐ REST/SSE ┌─────────────────┐
│ Nuxt 4 SPA │◄──────────────►│ NestJS Backend │
│ (52 pages) │ │ (742 tests) │
└────────────────┘ └────────┬─────────┘
│ Bull queue
┌────────▼─────────┐
│ botzone-neo │
│ (400+ tests) │
└────────┬─────────┘
│
┌────────────▼───────────┐
│ shimmy sandbox │
│ (Direct/Sandlock/WASM) │
└────────────────────────┘
The backend never talks directly to the sandbox — everything goes through botzone-neo, which handles compilation, multi-round game orchestration, and result callbacks.
The Judge Engine: botzone-neo
botzone-neo is where most of the interesting engineering lives. It implements a clean DDD architecture:
domain/ — Match aggregate, Bot entity, Verdict types
application/ — RunMatchUseCase, RunOjUseCase
infrastructure/ — Sandbox backends, Compile cache, Callback service
strategies/ — Restart, Longrun, Webhook, UserJudge strategies
Multi-Strategy Sandboxing
Bots run in sandboxes via a pluggable ISandbox interface:
interface ISandbox {
compile: (language: number, source: string) => Promise<CompiledArtifact>
run: (artifact: CompiledArtifact, input: string, limits: ResourceLimits) => Promise<RunResult>
}
Three backends: DirectBackend (subprocess, dev only), SandlockBackend (Linux cgroups), WasmBackend (Python in browser-compatible WASM). The strategy pattern means we can swap backends per deployment.
The Judge Protocol
Games use a long-running judge process. Each round:
[judge stdin] ← {"round": 3, "responses": {"0": "47", "1": "62"}}
[judge stdout] → {"commands": {"0": {...}, "1": {...}}, "display": {...}, "verdict": "continue"}
The judge is sandboxed too — it can crash without affecting the match record. A UserJudgeStrategy compiles and manages the judge lifecycle, passing round data via stdin/stdout pipes.
Bot Output Parsing
Bots can output either raw values or a JSON envelope:
# Simple mode
print(42)
# Debug mode — move + debug info in one message
print(json.dumps({"move": 42, "debug": f"guessing {guess}, range [{lo},{hi}]"}))
The BotOutputParser handles both, extracting move and debug fields. stderr is also captured separately and surfaced in the match timeline.
LRU Compile Cache
Compilation is expensive. We cache compiled artifacts by (language, source_hash) with an LRU eviction policy:
class CompileCache {
private cache = new LRUCache<string, CompiledArtifact>({ max: 50 })
async getOrCompile(language, source, compiler): Promise<CompiledArtifact> {
const key = `${language}:${sha256(source)}`
if (this.cache.has(key))
return this.cache.get(key)!
const artifact = await compiler(language, source)
this.cache.set(key, artifact)
return artifact
}
}
For restartable bots this means zero recompilation after the first round.
Dual Leaderboard & ELO
The platform has two leaderboards:
- 内榜 (Inner) — code-type bots only,
elocolumn - 外榜 (Outer) — all types (code + webhook + human),
eloExternalcolumn
ELO updates are pairwise, supporting N-player games:
// For every unique pair (i, j) in the match
for (let i = 0; i < gamerIds.length; i++) {
for (let j = i + 1; j < gamerIds.length; j++) {
const actualA = scoreA > scoreB ? 1 : scoreA === scoreB ? 0.5 : 0
const expected = 1 / (1 + 10 ** ((eloB - eloA) / 400))
deltaA += K * (actualA - expected)
}
}
This naturally extends to 3+ players without any algorithmic changes.
Webhook & Human Bot Types
Beyond code bots, Leverage supports:
Webhook bots — External services that respond to HTTP callbacks. Useful for LLM-powered bots:
class WebhookRunner {
async runRound(bot: Bot, input: BotInput): Promise<BotOutput> {
const response = await fetch(bot.webhookUrl, {
method: 'POST',
body: JSON.stringify(input),
headers: { 'X-Bot-Key': bot.apiKey }
})
return { response: await response.text() }
}
}
Human bots — Real players competing via SSE real-time UI. The browser connects to GET /compete/matches/:id/human-sse?token=<jwt>, and the judge waits for human input via POST /compete/bot-respond.
Browser ──SSE──► [NestJS HumanTurnService]
│
awaits human move
│
◄── POST /compete/bot-respond
Bot API Key System
External bots get a 7-day temporary API key on creation:
POST /compete/gamers → { id, botApiKey: "abc123...(48 hex)", botApiKeyExpiresAt }
Both GET /compete/bot-turn and POST /compete/bot-respond accept X-Bot-Key: <token>. The key is stored with select: false on the TypeORM entity — it’s only returned at creation.
Fork-on-Edit
One key design decision: bots are immutable after creation. When a user edits a bot, the backend creates a new Gamer row with the updated code, and the frontend redirects to the new gamer page. The original bot keeps its ELO history intact.
This is important for leaderboard integrity — you can’t retroactively fix a bot that already won matches.
The Playground
The Playground is a browser-based IDE with five tabs:
- Bot 测试 — Write code, pick an opponent, run a test match
- 裁判测试 — Write a custom judge, test it with two existing bots
- 裁判+Bot组合 — Test judge + bot together
- 渲染器 — Write an HTML renderer, test with sample game state
- Wiki/教程 — Interactive tutorial with inline code editors
The tutorial mode locks tabs and adds confetti when users reach key steps. Test results are tracked — if you edit code after testing, an ⚠️ 过时的 badge appears on the old result.
Custom Judge Protocol
Supervisors can upload Python judge programs. The playground tests them live:
[Backend] POST /compete/games/:id/playground-judge
→ Bull job → botzone-neo → sandbox
→ Callback → match result with full round-by-round timeline
The timeline shows each round’s judge commands, bot responses, display data, and debug output — including stderr from bot processes.
Auto-Match Scheduler
The AutoMatchSchedulerService runs on a per-minute cron and dispatches matches for enabled games:
// @Cron(CronExpression.EVERY_MINUTE)
async function tick() {
const games = await gameRepo.find({ where: { autoMatchEnabled: true } })
for (const game of games) {
if (Date.now() < state.nextRunAt)
continue
const { created } = await competeService.triggerAutoMatch(game.id, 8)
// Adaptive backoff: if ELO is stable, schedule less frequently
if (avgEloDelta > 20)
state.intervalMs = Math.max(60000, state.intervalMs / 2)
else if (avgEloDelta < 5)
state.intervalMs = Math.min(30 * 60000, state.intervalMs * 2)
}
}
When the leaderboard is volatile (big ELO changes), matches run every minute. When ratings stabilize, the scheduler backs off to 30-minute intervals.
Multi-Player Support
The pipeline supports N-player games via combinatorial match generation. When triggerAutoMatch fires for an N-player game, it samples bots and generates C(n, k) combinations:
function combinations<T>(arr: T[], k: number): T[][] {
if (k === 1)
return arr.map(x => [x])
return arr.flatMap((x, i) =>
combinations(arr.slice(i + 1), k - 1).map(rest => [x, ...rest])
)
}
The ELO update for multi-player games ranks players by score, then applies pairwise ELO adjustments. Matches are capped at 20 per scheduler tick to avoid bursts.
The judge protocol is already N-player at the transport level — commands and responses are dicts keyed by player index "0", "1", …, "N-1". Judges receive null for inactive players (e.g., a turn-based game where only one player acts per round) — botzone-neo skips null-command bots and doesn’t invoke them that round.
MCP Server — AI Game Design
The platform ships with a 13-tool MCP server that exposes Leverage as AI-callable tools. Any MCP-compatible client (Claude Desktop, OpenClaw, Codex) can autonomously design, test, and deploy competitive games.
LEVERAGE_TOKEN=<jwt> pnpm run mcp
Available tools
| Tool | What it does |
|---|---|
list_games | Browse existing games |
test_judge | Run a judge + bots, get full round-by-round results |
test_bot | Test a bot against existing opponents |
submit_judge | Upload a judge program to a game |
submit_bot | Register a new bot on the leaderboard |
submit_renderer | Upload an HTML renderer |
get_judge | Fetch current judge source |
list_gamers | List bots for a game |
get_leaderboard | ELO rankings |
get_match_result | Full round-by-round match data |
list_matches | Find matches by gameId/gamerId/status |
get_gamer | Read a bot’s source code |
analyze_match | Pre-process match into debugHighlights for fast AI debugging |
AI-driven workflow
The AI reads GET /ai (a public plaintext endpoint with the full platform protocol) and starts designing:
list_games()— browse existing games for context- Write judge code based on
/aiprotocol docs test_judge(gameId, judgerCode, bot0Code, bot1Code)— run a test match- Inspect
rounds[]— did it finish? Are scores correct? analyze_match(matchId)— getdebugHighlightsfor fast debugging- Iterate until
verdict=finish, thensubmit_judge+submit_bot
We used this pipeline to generate 4 complete games (囚徒困境, 廿一点, 骰子游戏, 数字拍卖) end-to-end with Codex — each with a Python judge, 4 bots (Python + JS), an HTML renderer, and full end-to-end verification.
What’s Next
Production deployment — Nginx reverse proxy, SSL, production env vars, domain setup.
shimmy upstream PR — The sandbox improvements in our fork of lambda-feedback/shimmy need to be submitted as a PR.
Sandlock Phase 2 — Linux cgroups memory enforcement in botzone-neo’s SandlockBackend (currently only time limits are enforced).
Pipeline Extensions — The evaluation pipeline is general enough to support:
- RL training environments (judge = step function, match = episode)
- LLM capability benchmarks (bots are LLM API calls)
- Mechanism design research (judge = market rules, bots = bidding strategies)
- Automated CS course grading (problems as OJ testcases + SPJ)
Numbers
- Backend: 742 unit tests, TypeORM + MariaDB
- botzone-neo: DDD architecture, 3-strategy sandbox, LRU compile cache
- shimmy sandbox: 119 tests, 92.6% coverage
- Frontend: 52 pages, Nuxt 4 + Naive UI, SPA mode
- Sample games: 4 AI-generated games (囚徒困境/廿一点/骰子游戏/数字拍卖), each with Python judge + 4 bots + HTML renderer
- MCP server: 13 tools, full AI-to-platform pipeline
- Rewrite time: ~3 weeks of parallel agent swarm development