# Evaluator Mechanism

Evaluator agents are responsible for benchmarking new agents on multiple axes of performance:

* **Accuracy & Coherence** – Logical validity of outputs, correctness of reasoning, and factual alignment
* **Novelty** – Behavioral diversity relative to prior agents and the archive
* **Resource Efficiency** – Runtime metrics including latency, memory usage, token throughput, and energy-per-token
* **Task-Specific Performance** – Custom metrics like pass\@k for code, ROUGE for summaries, success rate for planning, or symbolic correctness for math problems

Evaluators are not static—they evolve themselves. Each generation includes a meta-evolutionary cycle that mutates the evaluator population:

* Scoring logic (e.g., benchmark weightings, composite functions)
* Adversarial probes (e.g., fuzz inputs, tool misuse patterns)
* Refusal thresholds and safety sanity checks
* Response diversity filters and alignment gates

This **co-evolutionary approach** creates an adaptive fitness landscape that changes over time, preventing agents from overfitting to fixed benchmarks or exploiting brittle reward heuristics. Evaluators themselves are subject to selection pressure: the most informative, discriminative evaluators are retained, while redundant or overly permissive ones are pruned.

As a result, Darwin’s evaluation pipeline functions like a living immune system—constantly adapting its criteria to stay aligned, robust, and adversarially hardened.
