💐 ClassEval Leaderboard

First class-level code generation benchmark
to evaluate LLMs in more real-world software development scenarios.

github github paper

📝 Notes

  1. We devise three distinct generation strategies for evaluating LLMs on class-level code generation: (1) Holistic Generation: the model is asked to generate the entire class all at once with the class skeleton as inputs. (2) Incremental Generation: the model is asked to generate the class in a method-by-method manner. Each iteration is based on the method bodies that have been generated in previous iterations. The iterative process repeats until all methods in the class are generated. (3) Compositional Generation: the model is asked to generate the class in a method-by-method manner. Each iteration is independent, without considering the other generated methods. All the generated methods are assembled to form the class lastly.
  2. All samples are generated from scratch using our codebase, where the raw generations can also be found here.
  3. By default, models are ranked according to pass@1 using greedy decoding. The results for other pass@k metrics are available here.
  4. The prompts for three distinct generation methods can be found here, and generations not following the format are considered incorrect.

🤗 Acknowledgement

Thanks for the authors of EvalPlus for sharing the template source code. In addition to ClassEval leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:

  1. EvalPlus Leaderboard
  2. CRUXEval Leaderboard
  3. Chatbot Arena Leaderboard
  4. Big Code Models Leaderboard
  5. InfiCoder-Eval
  6. TabbyML Leaderboard