Though the overwhelming majority of our explanations scored poorly, we consider we will now use machine studying strategies to additional enhance our capacity to generate explanations. For instance, we discovered that we may enhance our scores by:
- Iterative rationalization. We will enhance the rating by asking GPT-4 to suggest attainable counterexamples after which modify the reason primarily based on their activation.
- Use bigger fashions to supply explanations. As the power of the explanatory mannequin will increase, the common rating will increase. Nevertheless, even GPT-4 gave worse explanations than people, suggesting there may be room for enchancment.
- Change the schema of the reason mannequin. Coaching fashions with completely different activation features improves the reason rating.
We’re open sourcing our dataset and visualization instruments for interpretation of all 307,200 neurons in GPT-2 written in GPT-4, in addition to code for interpretation and scoring utilizing publicly obtainable fashions on the OpenAI API. We hope that the analysis group will develop new strategies for producing higher-scoring explanations and higher instruments for utilizing explanations to discover GPT-2.
We discovered over 1,000 neurons with an evidence rating of not less than 0.8, that means that they clarify a lot of the neuron’s high activation habits in accordance with GPT-4. Most of those well-explained neurons usually are not very fascinating. Nevertheless, we additionally discovered many fascinating neurons that weren’t understood by GPT-4. We hope that as interpretations enhance, we could possibly rapidly uncover fascinating qualitative insights into mannequin calculations.