Evals

Multi-agent evaluation with reputation-weighted scoring and A/B consensus comparisons.

Overview

@consensus-tools/evals provides multi-agent evaluation for consensus-tools. Run single-model evals, multi-agent A/B comparisons with reputation-weighted scoring, and validate LLM-generated scores. Composite scores are weighted by agent reputation so higher-quality evaluators have more influence.

Installation

pnpm add @consensus-tools/evals

For consensusEval(), you also need the Vercel AI SDK and a provider:

pnpm add ai @ai-sdk/anthropic

Quick start

import { evaluateWithAiSdk, generatePersonas } from "@consensus-tools/evals";

const personas = await generatePersonas({ count: 3 });

const result = await evaluateWithAiSdk({
  model: "claude-sonnet-4-20250514",
  prompt: "Evaluate this submission...",
});

API reference

consensusEval(versionA, versionB, agents, model, promptBuilder, options?)

Run N agents that each score two versions on clarity, completeness, and actionability, then pick a winner.

function consensusEval(
  versionA: string,
  versionB: string,
  agents: AgentPersona[],
  model: LanguageModel,
  promptBuilder: PromptBuilder,
  options?: ConsensusEvalOptions
): Promise<ConsensusEvalResult>
  • versionA / versionB string -- the two versions to compare
  • agents AgentPersona[] -- evaluator agents with reputation scores
  • model -- Vercel AI SDK language model instance
  • promptBuilder (agent, versionA, versionB) => string -- builds the evaluation prompt per agent
  • options -- see below

Returns ConsensusEvalResult with winner, agreement (0.0-1.0), aComposite, bComposite, and perAgent results.

Options

consensusEval(versionA, versionB, agents, model, promptBuilder, {
  minQuorum: 3,        // minimum agents needed (default: 3)
  agentDelayMs: 15000, // delay between agent calls (default: 15000)
  temperature: 0.7,    // LLM temperature (default: 0.7)
  maxTokens: 1024,     // max tokens per response (default: 1024)
  onAgentError: (agent, err) => console.error(`${agent.name}: ${err.message}`),
});

evaluateWithAiSdk(options)

Single-model evaluation via the Vercel AI SDK.

function evaluateWithAiSdk(options: { model: string; prompt: string }): Promise<unknown>

generatePersonas(options)

Generate diverse evaluator personas.

function generatePersonas(options: { count: number }): Promise<AgentPersona[]>

ReputationTracker

Track agent reputation across evaluation rounds. Agents that align with ground truth gain reputation (+4), agents that disagree lose it (-4). Floor at 10 -- agents are never fully silenced.

class ReputationTracker {
  constructor(agents: AgentPersona[], storage?: ReputationStorage)
  settleEval(votes: Array<{ agentId: string; winner: string }>, groundTruth: string): ReputationDelta[]
  settleRound(votes, judgeScores, proposerId, decision, rewriteCount, maxRewrites): ReputationDelta[]
  syncToAgents(agents: AgentPersona[]): void
  loadFromStorage(): Promise<void>
  saveToStorage(): Promise<void>
}

validateScore(value)

Safely parse an LLM-generated score. Out-of-range, NaN, and non-numeric values default to 2.

function validateScore(value: unknown): number
validateScore(4);       // 4
validateScore("3.7");   // 4 (rounds)
validateScore(NaN);     // 2 (default)
validateScore(0);       // 2 (below range)

validateJudgeScore(raw)

Validate a full JudgeScore object with all three dimensions.

function validateJudgeScore(raw: Record<string, unknown>): JudgeScore
validateJudgeScore({ clarity: 4, completeness: "bad", actionability: 6 });
// => { clarity: 4, completeness: 2, actionability: 2, reasoning: "No reasoning provided" }

Examples

Multi-agent A/B evaluation

import { consensusEval, ReputationTracker, generatePersonas } from "@consensus-tools/evals";
import { createAnthropic } from "@ai-sdk/anthropic";

const anthropic = createAnthropic();
const model = anthropic("claude-sonnet-4-20250514");
const personas = await generatePersonas({ count: 5 });
const agents = personas.map((p) => ({ ...p, reputation: 100 }));

const result = await consensusEval(versionA, versionB, agents, model, (agent, a, b) => {
  return `You are ${agent.name}. Score both versions on clarity, completeness, and actionability (1-5). Pick a winner.

Version A:
${a}

Version B:
${b}

Respond with JSON: { "a_scores": { "clarity": N, "completeness": N, "actionability": N }, "b_scores": { ... }, "winner": "A"|"B"|"TIE", "reasoning": "..." }`;
});

console.log(result.winner);     // "A" | "B" | "TIE" | "UNKNOWN"
console.log(result.agreement);  // 0.0 - 1.0

Reputation tracking with persistence

import { ReputationTracker } from "@consensus-tools/evals";
import type { ReputationStorage } from "@consensus-tools/evals";

const storage: ReputationStorage = {
  async load() { return JSON.parse(await fs.readFile("rep.json", "utf-8")); },
  async save(state) { await fs.writeFile("rep.json", JSON.stringify(state)); },
};

const tracker = new ReputationTracker(agents, storage);
await tracker.loadFromStorage();

// Settle after an A/B eval
const deltas = tracker.settleEval(
  result.perAgent.map((a) => ({ agentId: a.agentId, winner: a.winner })),
  result.winner,
);

tracker.syncToAgents(agents);
await tracker.saveToStorage();

Cost awareness

consensusEval() makes one LLM call per agent. With 5 agents using Claude Sonnet, a single eval round costs roughly $0.05-0.10. Set agentDelayMs to stay within rate limits.