ProjectEval Leaderboard

ProjectEval is a multi-level benchmark for evaluating LLMs on complex project-level code generation tasks. It simulates realistic software engineering workflows by combining natural language prompts, structured checklists, and code skeletons.

đź“„ Paper  |  🚀 Project  |  ✉️ Contact Us  |  📤 Submit Your Model's Result

Execution Metrics (Pass@5)

The Level 1 Input is a sentence or a few sentences of natural language (English) that describe the project. The Level 2 Input is the checklist of the project. The Level 3 Input is the skeleton of the project.
The Cascade mode ask the LLM agents generate checklist-skeleton-code by order while the Direct mode asking generating code directly.
Model Report By Report Date Output Format Cascade Direct All Avg.
Level 1 Level 2 Avg. Level 1 Level 2 Level 3 Avg.
Notes:

Objective Metrics

The Checklists (CL) similarity scores are computed based on sentence-level matching using Sentence Transformers (Reimers & Gurevych, 2020) and optimal alignment with Canonical Solutions via the Jonker–Volgenant algorithm (Jonker & Volgenant, 1987) .
The Skeletons (SK) and Codes evaluation uses CodeBLEU (Ren et al., 2020) , which incorporates both syntactic and semantic criteria. For short textual values like parameter names or URLs in Parameter Values (PV), cosine similarity is directly applied.
Model Report By Report Date Output Format Cascade Direct
Level 1 Level 2 Level 1 Level 2 Level 3
CL SK Code PV SK Code PV Code PV Code PV Code PV
Notes: