# Reward-Hacking & Misalignment Mitigation

<figure><img src="/files/83FN0FwezyxILfdEzkgT" alt=""><figcaption></figcaption></figure>

Synthetic Darwin’s security model treats reward hacking—an agent inflating its fitness without delivering the intended behaviour—as an evolutionary arms-race between candidate populations and an equally adaptive evaluator population.

**Core Safeguards:**

* **Coevolving Evaluators:**
  * Every generation, a parallel GA run mutates evaluator scoring logic (weights, metrics, benchmarks).
  * Evaluator genotypes that best discriminate useful intelligence from “metric gaming” are preserved.
  * This ensures the fitness landscape is never static or memorisable.
* **Multi-Objective Fitness:**
  * KPIs like accuracy, latency, energy, alignment, and stochastic sanity are scored on separate axes.
  * Only agents improving across a **Pareto frontier** are retained.
  * Scalar hacks (e.g. latency minimization at reasoning cost) do not propagate.
* **Adversarial Judge Circuits:**
  * Certain judge agents are trained to generate worst-case inputs, adversarial fuzz prompts, and unsafe tool chains.
  * Agents failing these triggers receive automatic quarantine and review.
* **Lineage-Level Memory & Drift Detection:**
  * Fingerprints of behavior are logged across generations.
  * Sudden regressions or alignment drift triggers a **negative selection** event — excising entire branches.
* **Sandboxed Execution & Kill-Switch:**
  * All agent code runs inside containerized, syscall-whitelisted GPU VMs.
  * Real-time watchdogs monitor GPU power draw, IO anomalies, and request patterns.
  * Safety violations immediately halt execution and revert to the last known-good checkpoint.

This evolutionary arms race creates a Darwinian pressure for alignment: cheating becomes harder than cooperating with evolving evaluators.

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.darwinslab.ai/security-and-alignment/overview/reward-hacking-and-misalignment-mitigation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
