Evaluator Mechanism

Evaluator agents are responsible for benchmarking new agents on multiple axes of performance:

  • Accuracy & Coherence – Logical validity of outputs, correctness of reasoning, and factual alignment

  • Novelty – Behavioral diversity relative to prior agents and the archive

  • Resource Efficiency – Runtime metrics including latency, memory usage, token throughput, and energy-per-token

  • Task-Specific Performance – Custom metrics like pass@k for code, ROUGE for summaries, success rate for planning, or symbolic correctness for math problems

Evaluators are not static—they evolve themselves. Each generation includes a meta-evolutionary cycle that mutates the evaluator population:

  • Scoring logic (e.g., benchmark weightings, composite functions)

  • Adversarial probes (e.g., fuzz inputs, tool misuse patterns)

  • Refusal thresholds and safety sanity checks

  • Response diversity filters and alignment gates

This co-evolutionary approach creates an adaptive fitness landscape that changes over time, preventing agents from overfitting to fixed benchmarks or exploiting brittle reward heuristics. Evaluators themselves are subject to selection pressure: the most informative, discriminative evaluators are retained, while redundant or overly permissive ones are pruned.

As a result, Darwin’s evaluation pipeline functions like a living immune system—constantly adapting its criteria to stay aligned, robust, and adversarially hardened.

Last updated