CAIS and Scale AI are excited to announce the launch of Humanity's Last Exam, a project aimed at measuring how close we are to achieving expert-level AI systems. The exam is aimed at building the world's most difficult public AI benchmark gathering experts across all fields. People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool.
This post describes a superhuman forecasting AI called FiveThirtyNine, which generates probabilistic predictions for any query by retrieving relevant information and reasoning through it. We explain how the system works, its performance compared to human forecasters, and its potential applications in improving decision-making and public discussions.
Representation engineering is an exciting new field which explores how we can better understand traits like honesty, power seeking, and morality in LLMs. We show that these traits can be identified by looking at model activations, and these same traits can also be controlled. This method differs from mechanistic approaches which focus on bottom-up interpretations of node to node connections. In contrast, representation engineering looks at larger chunks of representations and higher-level mechanisms to understand models in a 'top-down' fashion.
Metrics drive the ML field, but defining these metrics is difficult. Successful benchmarks aren't the inevitable result of annotating a large enough dataset. Instead, effective ML benchmarks produce clear evaluations, have minimal barriers to entry, and concretize an important phenomena.