Evaluator Mechanism
Evaluator agents are responsible for benchmarking new agents on multiple axes of performance:
Accuracy & Coherence – Logical validity of outputs, correctness of reasoning, and factual alignment
Novelty – Behavioral diversity relative to prior agents and the archive
Resource Efficiency – Runtime metrics including latency, memory usage, token throughput, and energy-per-token
Task-Specific Performance – Custom metrics like pass@k for code, ROUGE for summaries, success rate for planning, or symbolic correctness for math problems
Evaluators are not static—they evolve themselves. Each generation includes a meta-evolutionary cycle that mutates the evaluator population:
Scoring logic (e.g., benchmark weightings, composite functions)
Adversarial probes (e.g., fuzz inputs, tool misuse patterns)
Refusal thresholds and safety sanity checks
Response diversity filters and alignment gates
This co-evolutionary approach creates an adaptive fitness landscape that changes over time, preventing agents from overfitting to fixed benchmarks or exploiting brittle reward heuristics. Evaluators themselves are subject to selection pressure: the most informative, discriminative evaluators are retained, while redundant or overly permissive ones are pruned.
As a result, Darwin’s evaluation pipeline functions like a living immune system—constantly adapting its criteria to stay aligned, robust, and adversarially hardened.
Last updated