with one click
quanttutorbench-rest-agent
// Use when an external AI agent needs to join QuantAgentBench, complete benchmark sessions through MCP or REST, call allowed tools, reply to the user message, monitor progress, and hand off terminal runs.
// Use when an external AI agent needs to join QuantAgentBench, complete benchmark sessions through MCP or REST, call allowed tools, reply to the user message, monitor progress, and hand off terminal runs.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | quanttutorbench-rest-agent |
| description | Use when an external AI agent needs to join QuantAgentBench, complete benchmark sessions through MCP or REST, call allowed tools, reply to the user message, monitor progress, and hand off terminal runs. |
Use this guide to connect an external agent to QuantAgentBench and finish each assigned benchmark run end to end. The platform provides the user messages, task-visible background, available tools, and run state. Your job is to keep working through the API until the session reaches a terminal status.
The role you play in any given run — what to produce, who you are talking to,
what counts as success — is described by the active task bundle through
background and the user_message you receive. Read those two before you
decide your first action.
Use platform API responses and operator-provided connection values:
BASE, public task label, API key, run token, and control tokenrun_id, session_id, token, control_token, current_phase, next_alloweduser_message, visible background, tool schemas, tool outputs, and terminal statusChoose each next action from the latest user message, the visible background, the live tool catalog, and previous tool results.
Use the operator-provided BASE. Production:
BASE="https://benchmark-liard.vercel.app"
Use long request timeouts. Server-side actors and tool calls can take minutes. A 900 second timeout is a practical default.
Prefer MCP when your runtime supports it. Use REST everywhere else. The same run/session lifecycle applies in both modes:
completed or failed.Connect to Streamable HTTP MCP at:
${BASE}/mcp
Use the run token as a bearer token:
Authorization: Bearer <token>
Create or claim the run first through the UI or REST API. Start MCP after you have a run token.
MCP exposes lifecycle tools plus task tools for the active run. The lifecycle tools are:
register_sessionstart_sessionget_backgroundsend_messageAfter connecting, call register_session with {} or an operator-provided
persona_id, then call start_session. Use list_tools to read the task tool
catalog. Call task tools when their schemas match the current user need. Call
send_message to deliver each reply.
If a tool call happens outside the current phase, the response includes guidance
such as current_phase and next_allowed. Follow that guidance and continue
from the allowed lifecycle or task tool.
Use the REST endpoints below when your runtime has a standard HTTP client.
The API key authorizes task discovery, run creation, active-run listing, and
client-side cancellation. The run token authorizes /session/* requests. The
control token authorizes /ui/runs/{run_id}* monitoring requests.
The platform user generates a REST API key in the UI after GitHub OAuth login. Use it for task label discovery and run creation.
api_key: user API key for /client/tasks/catalog/labels,
/client/runs/start, /client/runs/active, and
/client/runs/{run_id}/cancel; send as
Authorization: Bearer <api_key>.token: run token for /session/*; send as Authorization: Bearer <token>.control_token: owner token for /ui/runs/{run_id}*; send as Authorization: Bearer <control_token>.Store raw tokens only in memory or secure local runtime state. Log token hints, run IDs, and session IDs.
When you have an API key, fetch the available public task labels:
GET /client/tasks/catalog/labels
Authorization: Bearer <api_key>
Response:
{
"tasks": [
{"label": "D01"},
{"label": "D02"}
]
}
Use each label value as the task field for /client/runs/start. To complete
the full benchmark, create one run per returned label and drive each run to a
terminal status.
When the platform UI gives you a run token:
POST /client/runs/claim
Content-Type: application/json
{
"run_token": "<token>",
"client": {"name": "external_agent", "version": "1.0"}
}
Use the returned or provided token for the session lifecycle.
When you have an API key and public task label:
POST /client/runs/start
Authorization: Bearer <api_key>
Content-Type: application/json
{
"task": "<public_task_label>",
"mode": "agent",
"client": {"name": "external_agent", "version": "1.0"}
}
Save run_id, token, control_token, public_task_label, and status.
POST /session/register
Authorization: Bearer <token>
Content-Type: application/json
{}
Use {} for default registration. Include persona_id only when the operator
explicitly provides one; bundles that do not consume persona_id ignore it.
Save session_id.
POST /session/{session_id}/start
Authorization: Bearer <token>
Content-Type: application/json
{}
Save the first user_message. Save background when present. The
background describes what the active bundle expects of you for this run.
If background.agent_brief is present, treat it as authoritative
bundle-authored framing for this run — the active bundle uses it to tell
you what role you are playing, what counts as success, and any constraints
beyond the platform contract. When it is absent, infer the role from
background and the first user_message.
GET /session/{session_id}/tools
Authorization: Bearer <token>
Use the returned tools[] schemas as the allowed tool surface. Lifecycle actions
use the dedicated REST endpoints in this section.
Call a domain tool:
POST /session/{session_id}/tool/{tool_name}
Authorization: Bearer <token>
Content-Type: application/json
{ "...arguments from schema...": "..." }
For asynchronous tool responses with 202, poll the provided poll_url or:
GET /session/{session_id}/tool/jobs/{job_id}
Authorization: Bearer <token>
Continue polling until status is completed or failed. Use completed tool
outputs to decide the next reply or tool call.
Each turn starts with the latest user_message. You should finish the task
without asking the operator for step-by-step instructions. Continue until the
server returns a terminal session status.
The external agent owns the full loop for every assigned run: read, decide, call tools, reply, and repeat until terminal status.
Recommended loop:
user_message and visible background.user_message.active, save the next user_message and continue.completed or failed, stop the loop and hand off the run summary.Send a reply:
POST /session/{session_id}/send
Authorization: Bearer <token>
Content-Type: application/json
{
"text": "<message delivered to the user>",
"attachments": ["optional_workspace_path"],
"reasoning": "optional private turn rationale"
}
Use attachments for up to three workspace paths that should be shared with the
user. Use reasoning for concise private rationale recorded in the trace;
the user receives text and attachments.
Handle the response:
status: "active": save the returned user_message and continue.status: "completed": record reason, stop the session loop, and hand off.status: "failed": record reason or error, stop the session loop, and hand off.Every reply should reflect the latest user message. Repeated identical reply
text can trigger agent_stuck.
With an API key, list active runs created by the same API-key subject:
GET /client/runs/active
Authorization: Bearer <api_key>
The response contains token-safe run summaries:
{
"runs": [
{
"run_id": "run_...",
"public_task_label": "D01",
"status": "claimed",
"session_id": null
}
],
"count": 1
}
Use this endpoint after an active-run quota response to find orphaned runs. To cancel a run owned by the same API-key subject:
POST /client/runs/{run_id}/cancel
Authorization: Bearer <api_key>
The returned run summary has status: "cancelled" when cancellation succeeds.
When control_token is available:
GET /ui/runs/{run_id}
Authorization: Bearer <control_token>
GET /ui/runs/{run_id}/live
Authorization: Bearer <control_token>
Use live monitoring for status, conversation, and recent tool logs. Cancel only when the operator requests it:
POST /ui/runs/{run_id}/cancel
Authorization: Bearer <control_token>
At terminal status, return this to the operator:
run_idsession_idstatusreason or error${BASE}/#/review/{session_id}Optional readbacks after terminal status:
GET /session/{session_id}/results
Authorization: Bearer <token>
GET /session/{session_id}/scores
Authorization: Bearer <token>
/scores returns a uniform v1 envelope across the auto-eval lifecycle:
{
"schema_version": "1.0",
"status": "completed",
"score_id": "score_1",
"score_status": "completed_scored",
"task_score": 0.93,
"task_pass": true,
"detail": {
"task_pass_threshold": {
"version": "task_pass_threshold_v1",
"value": 0.5
},
"...": "bundle-specific contents"
}
}
Public contract: schema_version, score_id, score_status, task_score,
task_pass, detail, plus envelope status. score_status is one of
pending | running | completed_scored | completed_not_computable | failed | interrupted. Pending/running responses carry the same fields with
task_score and task_pass set to null. detail.task_pass_threshold is
stable threshold metadata. Other detail fields are bundle-defined: the
dimensions list, per-track aggregates, and any reliability metadata are owned
by the active evaluator. Depending on those fields is a beta dependency.
task_pass is the official pass/fail label for completed scored responses. The
calibrated threshold is 0.5 under task_pass_threshold_v1, exposed at
detail.task_pass_threshold. Pending, running, failed, and
completed-not-computable responses carry task_score: null and
task_pass: null.
The external agent completes the job at terminal session status. Evaluation is operator-owned.
import httpx
BASE = "https://benchmark-liard.vercel.app"
TIMEOUT = httpx.Timeout(900.0)
def post_json(client, path, payload=None, token=None):
headers = {"Content-Type": "application/json"}
if token:
headers["Authorization"] = f"Bearer {token}"
r = client.post(path, json=payload or {}, headers=headers)
r.raise_for_status()
return r.json()
def get_json(client, path, token):
r = client.get(path, headers={"Authorization": f"Bearer {token}"})
r.raise_for_status()
return r.json()
with httpx.Client(base_url=BASE, timeout=TIMEOUT) as client:
api_key = "<api_key_from_ui>"
catalog = get_json(client, "/client/tasks/catalog/labels", api_key)["tasks"]
public_task_label = catalog[0]["label"]
run = post_json(client, "/client/runs/start", {
"task": public_task_label,
"mode": "agent",
"client": {"name": "external_agent", "version": "1.0"},
}, token=api_key)
token = run["token"]
sid = post_json(client, "/session/register", {}, token)["session_id"]
start = post_json(client, f"/session/{sid}/start", {}, token)
latest_user = start["user_message"]
background = start.get("background")
while True:
tools = get_json(client, f"/session/{sid}/tools", token)["tools"]
reply_text = compose_reply(latest_user, background, tools)
reply = post_json(client, f"/session/{sid}/send", {"text": reply_text}, token)
if reply.get("status") != "active":
print({
"run_id": run["run_id"],
"session_id": sid,
"status": reply.get("status"),
"reason": reply.get("reason") or reply.get("error"),
"review_url": f"{BASE}/#/review/{sid}",
})
break
latest_user = reply["user_message"]