Loading...

New AI Model Safety Breakthrough Reduces Risk From The Start

13 August 2025

Researchers from the University of Oxford, EleutherAI, and the UK AI Security Institute have unveiled a novel strategy for enhancing the safety of open-weight AI models. Spearheaded by Associate Professor Yarin Gal of Oxford’s Department of Computer Science, the study marks a significant leap towards safeguarding these models, which are crucial for transparent and collaborative AI research.

Embedding Safety from the Start

This innovative method shifts the focus to embedding safety measures from the outset, rather than adding them after the model is created. By reducing risks without sacrificing openness, the approach ensures that transparency and research progress are not hindered.

Open-weight models, which allow anyone to download, modify, and improve them, are vital for scientific progress. However, they pose significant risks as they can also be altered for malicious purposes. The Oxford-led team tackled this issue by developing a method that filters out risky information during the training stage, rather than relying on post-training modifications that can be easily reversed.

Filtering Out Biothreat Knowledge

The researchers focused on filtering out biothreat-related content, such as that related to virology and bioweapons, from the model’s training data. This preventive measure effectively denies the model any foundation to acquire dangerous capabilities, even after additional training.

The team employed a multi-stage filtering process that combined keyword blocklists with a machine-learning classifier to remove only high-risk content. This targeted approach preserved 91-92% of the general dataset, ensuring the model remained effective in standard tasks like commonsense reasoning and scientific questioning.

In rigorous testing, the filtered models demonstrated resilience against attempts to train them on up to 25,000 biothreat-related papers, significantly outperforming previous methods. This approach proved durable under adversarial attacks, withstanding extensive fine-tuning efforts.

Implications for AI Governance

The timing of this study is crucial as it aligns with growing concerns from global AI governance bodies about the potential misuse of open models. Reports from leading AI organizations like OpenAI and DeepMind have highlighted the risks of future models being used to create biological or chemical threats.

Co-author Stephen Casper from the UK AI Security Institute emphasized the importance of this research, stating that filtering data at the initial stages can effectively balance safety and innovation in open-source AI development.

The findings of this study, titled ‘Deep Ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs,’ have been published as a preprint on arXiv, setting a new standard for AI safety protocols.

The research mentioned in this article was originally published on University of Oxford's website