ProjectEval Leaderboard

ProjectEval is a multi-level benchmark for evaluating LLMs on complex project-level code generation tasks. It simulates realistic software engineering workflows by combining natural language prompts, structured checklists, and code skeletons.

📄 Paper | 🚀 Project | ✉️ Contact Us | 📤 Submit Your Model's Result

Execution Metrics (Pass@5)

The Level 1 Input is a sentence or a few sentences of natural language (English) that describe the project. The Level 2 Input is the checklist of the project. The Level 3 Input is the skeleton of the project.
The Cascade mode ask the LLM agents generate checklist-skeleton-code by order while the Direct mode asking generating code directly.
Model	Report By	Report Date	Output Format	Cascade			Direct				All Avg.
Model	Report By	Report Date	Output Format	Level 1	Level 2	Avg.	Level 1	Level 2	Level 3	Avg.	All Avg.

Objective Metrics

The Checklists (CL) similarity scores are computed based on sentence-level matching using Sentence Transformers (Reimers & Gurevych, 2020) and optimal alignment with Canonical Solutions via the Jonker–Volgenant algorithm (Jonker & Volgenant, 1987) .
The Skeletons (SK) and Codes evaluation uses CodeBLEU (Ren et al., 2020) , which incorporates both syntactic and semantic criteria. For short textual values like parameter names or URLs in Parameter Values (PV), cosine similarity is directly applied.
Model	Report By	Report Date	Output Format	Cascade							Direct
				Level 1				Level 2			Level 1		Level 2		Level 3
				CL	SK	Code	PV	SK	Code	PV	Code	PV	Code	PV	Code	PV