@consensus-tools/evals provides multi-agent evaluation for consensus-tools. Run single-model evals, multi-agent A/B comparisons with reputation-weighted scoring, and validate LLM-generated scores. Composite scores are weighted by agent reputation so higher-quality evaluators have more influence.
pnpm add @consensus-tools/evals
For consensusEval(), you also need the Vercel AI SDK and a provider:
pnpm add ai @ai-sdk/anthropic
import { evaluateWithAiSdk, generatePersonas } from "@consensus-tools/evals";
const personas = await generatePersonas({ count: 3 });
const result = await evaluateWithAiSdk({
model: "claude-sonnet-4-20250514",
prompt: "Evaluate this submission...",
});
Run N agents that each score two versions on clarity, completeness, and actionability, then pick a winner.
function consensusEval(
versionA: string,
versionB: string,
agents: AgentPersona[],
model: LanguageModel,
promptBuilder: PromptBuilder,
options?: ConsensusEvalOptions
): Promise<ConsensusEvalResult>
- versionA / versionB
string -- the two versions to compare
- agents
AgentPersona[] -- evaluator agents with reputation scores
- model -- Vercel AI SDK language model instance
- promptBuilder
(agent, versionA, versionB) => string -- builds the evaluation prompt per agent
- options -- see below
Returns ConsensusEvalResult with winner, agreement (0.0-1.0), aComposite, bComposite, and perAgent results.
consensusEval(versionA, versionB, agents, model, promptBuilder, {
minQuorum: 3, // minimum agents needed (default: 3)
agentDelayMs: 15000, // delay between agent calls (default: 15000)
temperature: 0.7, // LLM temperature (default: 0.7)
maxTokens: 1024, // max tokens per response (default: 1024)
onAgentError: (agent, err) => console.error(`${agent.name}: ${err.message}`),
});
Single-model evaluation via the Vercel AI SDK.
function evaluateWithAiSdk(options: { model: string; prompt: string }): Promise<unknown>
Generate diverse evaluator personas.
function generatePersonas(options: { count: number }): Promise<AgentPersona[]>
Track agent reputation across evaluation rounds. Agents that align with ground truth gain reputation (+4), agents that disagree lose it (-4). Floor at 10 -- agents are never fully silenced.
class ReputationTracker {
constructor(agents: AgentPersona[], storage?: ReputationStorage)
settleEval(votes: Array<{ agentId: string; winner: string }>, groundTruth: string): ReputationDelta[]
settleRound(votes, judgeScores, proposerId, decision, rewriteCount, maxRewrites): ReputationDelta[]
syncToAgents(agents: AgentPersona[]): void
loadFromStorage(): Promise<void>
saveToStorage(): Promise<void>
}
Safely parse an LLM-generated score. Out-of-range, NaN, and non-numeric values default to 2.
function validateScore(value: unknown): number
validateScore(4); // 4
validateScore("3.7"); // 4 (rounds)
validateScore(NaN); // 2 (default)
validateScore(0); // 2 (below range)
Validate a full JudgeScore object with all three dimensions.
function validateJudgeScore(raw: Record<string, unknown>): JudgeScore
validateJudgeScore({ clarity: 4, completeness: "bad", actionability: 6 });
// => { clarity: 4, completeness: 2, actionability: 2, reasoning: "No reasoning provided" }
import { consensusEval, ReputationTracker, generatePersonas } from "@consensus-tools/evals";
import { createAnthropic } from "@ai-sdk/anthropic";
const anthropic = createAnthropic();
const model = anthropic("claude-sonnet-4-20250514");
const personas = await generatePersonas({ count: 5 });
const agents = personas.map((p) => ({ ...p, reputation: 100 }));
const result = await consensusEval(versionA, versionB, agents, model, (agent, a, b) => {
return `You are ${agent.name}. Score both versions on clarity, completeness, and actionability (1-5). Pick a winner.
Version A:
${a}
Version B:
${b}
Respond with JSON: { "a_scores": { "clarity": N, "completeness": N, "actionability": N }, "b_scores": { ... }, "winner": "A"|"B"|"TIE", "reasoning": "..." }`;
});
console.log(result.winner); // "A" | "B" | "TIE" | "UNKNOWN"
console.log(result.agreement); // 0.0 - 1.0
import { ReputationTracker } from "@consensus-tools/evals";
import type { ReputationStorage } from "@consensus-tools/evals";
const storage: ReputationStorage = {
async load() { return JSON.parse(await fs.readFile("rep.json", "utf-8")); },
async save(state) { await fs.writeFile("rep.json", JSON.stringify(state)); },
};
const tracker = new ReputationTracker(agents, storage);
await tracker.loadFromStorage();
// Settle after an A/B eval
const deltas = tracker.settleEval(
result.perAgent.map((a) => ({ agentId: a.agentId, winner: a.winner })),
result.winner,
);
tracker.syncToAgents(agents);
await tracker.saveToStorage();
ℹCost awareness
consensusEval() makes one LLM call per agent. With 5 agents using Claude Sonnet, a single eval round costs roughly $0.05-0.10. Set agentDelayMs to stay within rate limits.
- Schemas package -- shared types for telemetry and evaluation
- Guards package -- guard evaluation that evals can judge
- Verification and Scoring -- how scoring integrates with the engine