Reward-Hacking & Misalignment Mitigation

Synthetic Darwin’s security model treats reward hacking—an agent inflating its fitness without delivering the intended behaviour—as an evolutionary arms-race between candidate populations and an equally adaptive evaluator population.
Core Safeguards:
Coevolving Evaluators:
Every generation, a parallel GA run mutates evaluator scoring logic (weights, metrics, benchmarks).
Evaluator genotypes that best discriminate useful intelligence from “metric gaming” are preserved.
This ensures the fitness landscape is never static or memorisable.
Multi-Objective Fitness:
KPIs like accuracy, latency, energy, alignment, and stochastic sanity are scored on separate axes.
Only agents improving across a Pareto frontier are retained.
Scalar hacks (e.g. latency minimization at reasoning cost) do not propagate.
Adversarial Judge Circuits:
Certain judge agents are trained to generate worst-case inputs, adversarial fuzz prompts, and unsafe tool chains.
Agents failing these triggers receive automatic quarantine and review.
Lineage-Level Memory & Drift Detection:
Fingerprints of behavior are logged across generations.
Sudden regressions or alignment drift triggers a negative selection event — excising entire branches.
Sandboxed Execution & Kill-Switch:
All agent code runs inside containerized, syscall-whitelisted GPU VMs.
Real-time watchdogs monitor GPU power draw, IO anomalies, and request patterns.
Safety violations immediately halt execution and revert to the last known-good checkpoint.
This evolutionary arms race creates a Darwinian pressure for alignment: cheating becomes harder than cooperating with evolving evaluators.
Last updated