Adversarial Self-Correction for Domain-Specific LLM Detoxification
This is a deep learning research project from 11-785 Introduction to Deep Learning at Carnegie Mellon University. We developed Adversarial Self-Correction (ASC), a framework that reduces toxic outputs in medical AI models through automated adversarial training. Our method achieved 9.1% toxicity reduction while maintaining model helpfulness, addressing the critical challenge of aligning specialized LLMs for safe deployment in high-stakes domains.
Project Type
Academic Research
Platform
Python, PyTorch, HuggingFace
Tools
LoRA, DPO, LLM-as-Judge
Role(s)
Research & Evaluation
Course
CMU 11-785
Introduction to Deep Learning
Final presentation recording
Built in Framer
Created by Chih I Chou @2024