论文阅读记录

Constitutional AI

  • 标题:Constitutional AI: Harmlessness from AI Feedback
  • 作者:Anthropic
  • 年份:2022
  • 链接Cai

Main

  • Motivation:

    • Scaling Supervision: leverage AI to help humans to more efficiently supervise AI

      • AI supervision may be more efficient than collecting human feedback
      • AI systems can already perform some tasks at or beyond human level
      • Visualization:
        1
        2
    • A Harmless but Non-Evasive (Still Helpful) Assistant

    • Simplicity and Transparency

  • The Constitutional AI Approach: human supervision will come entirely from a set of principles

    • (Supervised Stage) Critique → Revision → Supervised Learning
      -(RL Stage) AI Comparison Evaluations → Preference Model → Reinforcement Learning
  • Models and Data: Helpful 模型 (H), Helpful & Harmless 模型 (HH)

  • Constitutional AI: Critiques, Revisions, and Supervised Learning:

    • Method:
      2
      2
      2
      2
    • Results:
      • H-RLHF
        2
      • Numbers of Revisions
        2
      • w/o critique
        9
  • Constitutional AI: Reinforcement Learning from AI Feedback

    • Method:
      10
      10
      10

    • Results:
      10
      10
      10

    • optimization:Constitutional Principles, Ensembling, Preference Labels (Soft vs. Hard vs. Clamped)

    • Harmlessness vs. Evasiveness
      10