chihic@andrew.cmu.edu

Click to Copy!

Adversarial Self-Correction for Domain-Specific LLM Detoxification

This is a deep learning research project from 11-785 Introduction to Deep Learning at Carnegie Mellon University. We developed Adversarial Self-Correction (ASC), a framework that reduces toxic outputs in medical AI models through automated adversarial training. Our method achieved 9.1% toxicity reduction while maintaining model helpfulness, addressing the critical challenge of aligning specialized LLMs for safe deployment in high-stakes domains.

Project Type

Academic Research

Platform

Python, PyTorch, HuggingFace

Tools

LoRA, DPO, LLM-as-Judge

Role(s)

Research & Evaluation

Course

CMU 11-785
Introduction to Deep Learning

Final presentation recording

Built in Framer

Created by Chih I Chou @2024

Create a free website with Framer, the website builder loved by startups, designers and agencies.