MMMLU:
Hendrycks et al 2020 - Measuring Massive Multitask Language Understanding This dataset is a popular dataset for LLMs to evaluate on (for example GPT-4, etc). However, it has two serious issues. 1) the test set is available on the web, which means LLMs are likely contaminated, and 2) the datasets has no in-domain training data, and can only be evaluated in a few-shot manner. This make is impossible to properly compare to prior fine-tuned methods.