top of page

By Anthony Aguirre

How can we ensure AI safety going forward?

Compute-based regulations won’t work forever. As costs lower and hardware improves, they will struggle to distinguish the development of Tool AI from work that might produce AGI. Ongoing research into AI safety will be needed. Current efforts include:

Alignment Techniques

We have ways of training AIs to share our values, such as:

  • RLHF (Reinforcement Learning from Human Feedback) where a human gives feedback to the model while it is in training.

  • CAI (Constitutional AI) where an explicit set of ethical principles is used to teach a model to evaluate and adjust its own outputs.

     

We will need ways of proving claims about a given system’s behaviors:

  • Formal Safety Guarantees - mathematical proofs that establish guaranteed behavioral limits based on a system’s structure.

Oversight Measures

We have ways to monitor the outputs of AI systems, such as:

  • Red-teaming, where a system is faced with simulated attacks

  • Third-party audits, where independent evaluators systematically review AI outputs

We will need:

  • Strong Interpretability - AI systems whose behaviors can be understood in a deep, mechanistic way that allows us to predict vulnerabilities and failure modes.

Security Systems

We have software security practices like:

  • Secure coding practices and vulnerability scans to identify risks during development.

  • Bug bounty programs that encourage users to find and report vulnerabilities.


We will need hardware-based security such as:

  • Hardware-software co-design (e.g. chips with built-in safety governors).

  • Geofencing, where chips can establish their locations and disable themselves if removed from a given country or set of countries.

bottom of page