Yuzhe's Blog

yuzhes

Leverage: Building a Production-Grade AI Bot Competition Platform

Posted at # System Design # AI # Game Theory
EN ·

I’ve spent the last several weeks rewriting Leverage — an online judge and bot competition platform — from the ground up. What started as “just a rewrite” turned into a comprehensive system with sandboxed execution, a dual leaderboard, real-time human-vs-bot matches, and an AI-designed game pipeline. Here’s the technical story.

What Leverage Does

Leverage is two things at once:

  1. An Online Judge (OJ) — students submit code, it runs against test cases, gets a verdict
  2. A Bot Competition Platform — bots play strategic games against each other, ELO ratings evolve, humans can join matches in real time

The original system was a Vue 2 frontend + PHP backend with a Django judge server. After 18 months of accumulated technical debt, we did a full rewrite: NestJS backend, Nuxt 4 frontend, and a brand-new judge engine called botzone-neo.

Architecture Overview

┌────────────────┐    REST/SSE    ┌─────────────────┐
│  Nuxt 4 SPA    │◄──────────────►│  NestJS Backend  │
│  (52 pages)    │                │  (742 tests)     │
└────────────────┘                └────────┬─────────┘
                                           │ Bull queue
                                  ┌────────▼─────────┐
                                  │  botzone-neo      │
                                  │  (400+ tests)     │
                                  └────────┬─────────┘

                              ┌────────────▼───────────┐
                              │  shimmy sandbox         │
                              │  (Direct/Sandlock/WASM) │
                              └────────────────────────┘

The backend never talks directly to the sandbox — everything goes through botzone-neo, which handles compilation, multi-round game orchestration, and result callbacks.

The Judge Engine: botzone-neo

botzone-neo is where most of the interesting engineering lives. It implements a clean DDD architecture:

domain/       — Match aggregate, Bot entity, Verdict types
application/  — RunMatchUseCase, RunOjUseCase
infrastructure/ — Sandbox backends, Compile cache, Callback service
strategies/   — Restart, Longrun, Webhook, UserJudge strategies

Multi-Strategy Sandboxing

Bots run in sandboxes via a pluggable ISandbox interface:

interface ISandbox {
  compile: (language: number, source: string) => Promise<CompiledArtifact>
  run: (artifact: CompiledArtifact, input: string, limits: ResourceLimits) => Promise<RunResult>
}

Three backends: DirectBackend (subprocess, dev only), SandlockBackend (Linux cgroups), WasmBackend (Python in browser-compatible WASM). The strategy pattern means we can swap backends per deployment.

The Judge Protocol

Games use a long-running judge process. Each round:

[judge stdin] ← {"round": 3, "responses": {"0": "47", "1": "62"}}
[judge stdout] → {"commands": {"0": {...}, "1": {...}}, "display": {...}, "verdict": "continue"}

The judge is sandboxed too — it can crash without affecting the match record. A UserJudgeStrategy compiles and manages the judge lifecycle, passing round data via stdin/stdout pipes.

Bot Output Parsing

Bots can output either raw values or a JSON envelope:

# Simple mode
print(42)

# Debug mode — move + debug info in one message
print(json.dumps({"move": 42, "debug": f"guessing {guess}, range [{lo},{hi}]"}))

The BotOutputParser handles both, extracting move and debug fields. stderr is also captured separately and surfaced in the match timeline.

LRU Compile Cache

Compilation is expensive. We cache compiled artifacts by (language, source_hash) with an LRU eviction policy:

class CompileCache {
  private cache = new LRUCache<string, CompiledArtifact>({ max: 50 })

  async getOrCompile(language, source, compiler): Promise<CompiledArtifact> {
    const key = `${language}:${sha256(source)}`
    if (this.cache.has(key))
      return this.cache.get(key)!
    const artifact = await compiler(language, source)
    this.cache.set(key, artifact)
    return artifact
  }
}

For restartable bots this means zero recompilation after the first round.

Dual Leaderboard & ELO

The platform has two leaderboards:

ELO updates are pairwise, supporting N-player games:

// For every unique pair (i, j) in the match
for (let i = 0; i < gamerIds.length; i++) {
  for (let j = i + 1; j < gamerIds.length; j++) {
    const actualA = scoreA > scoreB ? 1 : scoreA === scoreB ? 0.5 : 0
    const expected = 1 / (1 + 10 ** ((eloB - eloA) / 400))
    deltaA += K * (actualA - expected)
  }
}

This naturally extends to 3+ players without any algorithmic changes.

Webhook & Human Bot Types

Beyond code bots, Leverage supports:

Webhook bots — External services that respond to HTTP callbacks. Useful for LLM-powered bots:

class WebhookRunner {
  async runRound(bot: Bot, input: BotInput): Promise<BotOutput> {
    const response = await fetch(bot.webhookUrl, {
      method: 'POST',
      body: JSON.stringify(input),
      headers: { 'X-Bot-Key': bot.apiKey }
    })
    return { response: await response.text() }
  }
}

Human bots — Real players competing via SSE real-time UI. The browser connects to GET /compete/matches/:id/human-sse?token=<jwt>, and the judge waits for human input via POST /compete/bot-respond.

Browser ──SSE──► [NestJS HumanTurnService]

                   awaits human move

             ◄── POST /compete/bot-respond

Bot API Key System

External bots get a 7-day temporary API key on creation:

POST /compete/gamers → { id, botApiKey: "abc123...(48 hex)", botApiKeyExpiresAt }

Both GET /compete/bot-turn and POST /compete/bot-respond accept X-Bot-Key: <token>. The key is stored with select: false on the TypeORM entity — it’s only returned at creation.

Fork-on-Edit

One key design decision: bots are immutable after creation. When a user edits a bot, the backend creates a new Gamer row with the updated code, and the frontend redirects to the new gamer page. The original bot keeps its ELO history intact.

This is important for leaderboard integrity — you can’t retroactively fix a bot that already won matches.

The Playground

The Playground is a browser-based IDE with five tabs:

  1. Bot 测试 — Write code, pick an opponent, run a test match
  2. 裁判测试 — Write a custom judge, test it with two existing bots
  3. 裁判+Bot组合 — Test judge + bot together
  4. 渲染器 — Write an HTML renderer, test with sample game state
  5. Wiki/教程 — Interactive tutorial with inline code editors

The tutorial mode locks tabs and adds confetti when users reach key steps. Test results are tracked — if you edit code after testing, an ⚠️ 过时的 badge appears on the old result.

Custom Judge Protocol

Supervisors can upload Python judge programs. The playground tests them live:

[Backend] POST /compete/games/:id/playground-judge
  → Bull job → botzone-neo → sandbox
  → Callback → match result with full round-by-round timeline

The timeline shows each round’s judge commands, bot responses, display data, and debug output — including stderr from bot processes.

Auto-Match Scheduler

The AutoMatchSchedulerService runs on a per-minute cron and dispatches matches for enabled games:

// @Cron(CronExpression.EVERY_MINUTE)
async function tick() {
  const games = await gameRepo.find({ where: { autoMatchEnabled: true } })
  for (const game of games) {
    if (Date.now() < state.nextRunAt)
      continue
    const { created } = await competeService.triggerAutoMatch(game.id, 8)

    // Adaptive backoff: if ELO is stable, schedule less frequently
    if (avgEloDelta > 20)
      state.intervalMs = Math.max(60000, state.intervalMs / 2)
    else if (avgEloDelta < 5)
      state.intervalMs = Math.min(30 * 60000, state.intervalMs * 2)
  }
}

When the leaderboard is volatile (big ELO changes), matches run every minute. When ratings stabilize, the scheduler backs off to 30-minute intervals.

Multi-Player Support

The pipeline supports N-player games via combinatorial match generation. When triggerAutoMatch fires for an N-player game, it samples bots and generates C(n, k) combinations:

function combinations<T>(arr: T[], k: number): T[][] {
  if (k === 1)
    return arr.map(x => [x])
  return arr.flatMap((x, i) =>
    combinations(arr.slice(i + 1), k - 1).map(rest => [x, ...rest])
  )
}

The ELO update for multi-player games ranks players by score, then applies pairwise ELO adjustments. Matches are capped at 20 per scheduler tick to avoid bursts.

The judge protocol is already N-player at the transport level — commands and responses are dicts keyed by player index "0", "1", …, "N-1". Judges receive null for inactive players (e.g., a turn-based game where only one player acts per round) — botzone-neo skips null-command bots and doesn’t invoke them that round.

MCP Server — AI Game Design

The platform ships with a 13-tool MCP server that exposes Leverage as AI-callable tools. Any MCP-compatible client (Claude Desktop, OpenClaw, Codex) can autonomously design, test, and deploy competitive games.

LEVERAGE_TOKEN=<jwt> pnpm run mcp

Available tools

ToolWhat it does
list_gamesBrowse existing games
test_judgeRun a judge + bots, get full round-by-round results
test_botTest a bot against existing opponents
submit_judgeUpload a judge program to a game
submit_botRegister a new bot on the leaderboard
submit_rendererUpload an HTML renderer
get_judgeFetch current judge source
list_gamersList bots for a game
get_leaderboardELO rankings
get_match_resultFull round-by-round match data
list_matchesFind matches by gameId/gamerId/status
get_gamerRead a bot’s source code
analyze_matchPre-process match into debugHighlights for fast AI debugging

AI-driven workflow

The AI reads GET /ai (a public plaintext endpoint with the full platform protocol) and starts designing:

  1. list_games() — browse existing games for context
  2. Write judge code based on /ai protocol docs
  3. test_judge(gameId, judgerCode, bot0Code, bot1Code) — run a test match
  4. Inspect rounds[] — did it finish? Are scores correct?
  5. analyze_match(matchId) — get debugHighlights for fast debugging
  6. Iterate until verdict=finish, then submit_judge + submit_bot

We used this pipeline to generate 4 complete games (囚徒困境, 廿一点, 骰子游戏, 数字拍卖) end-to-end with Codex — each with a Python judge, 4 bots (Python + JS), an HTML renderer, and full end-to-end verification.

What’s Next

Production deployment — Nginx reverse proxy, SSL, production env vars, domain setup.

shimmy upstream PR — The sandbox improvements in our fork of lambda-feedback/shimmy need to be submitted as a PR.

Sandlock Phase 2 — Linux cgroups memory enforcement in botzone-neo’s SandlockBackend (currently only time limits are enforced).

Pipeline Extensions — The evaluation pipeline is general enough to support:

Numbers