This question was inspired by Karpathy’s tweet.
This market seeks to predict if a sleeper agent attack, distinct from prompt injection or hacking, will occur on a major language model in 2024, resulting in significant real-life consequences. A sleeper agent attack refers to a situation where hidden, malicious functionalities in a language model are activated under specific conditions, due to deliberate manipulations in its training.
Resolution Criteria
The market will resolve to ‘Yes’ if, by December 31, 2024, the following criteria are met:
Nature of the Attack: A major language model (such as GPT, BERT, etc.) is targeted by a sleeper agent attack, not related to conventional cyber threats like prompt injection or hacking. The attack is characterized by the model performing unauthorized actions due to specific triggers embedded in its training data or design.
Public Recognition and Reporting: The incident is publicly reported and verified by credible sources such as major news outlets, cybersecurity authorities, or the company owning the language model.
Significant Real-Life Impact: The attack results in notable real-life consequences. These may include, but are not limited to, substantial data breaches, significant misinformation spread, tangible financial losses, or other serious impacts on individuals, organizations, or societal functions.
Verification and Documentation: There is comprehensive, publicly available evidence or reporting that confirms the specifics of the attack, its unique sleeper agent nature, and the consequent significant impacts.
The market will resolve to ‘No’ if no incident meeting all these criteria occurs by the end of 2024.
Additional Notes
The significance and real-life impact of the attack are key factors for this prediction market. Mere technical anomalies or minor incidents without substantial real-life effects do not qualify.
The sleeper agent attack must be distinct in its mechanism and effects from other types of cybersecurity threats.