Introduction

HackerRank’s ASTRA benchmark is composed of multi-file, project-based problems designed to closely mimic real-world coding tasks. The objective is to evaluate the capabilities of advanced AI models across the entire SDLC. The initial release (v1) is primarily composed of frontend development problems and includes frameworks such as Node.js, React.js, Angular.js, Django, Java Spring Boot, Ruby on Rails, and .NET. In v1, the evaluation focuses exclusively on the model's ability to perform new feature development, assessed purely through code generation tasks. Both the input and output in the evaluation framework are text-based. The primary emphasis is on the correctness and consistency of the models, as these are fundamental to real-world applications. Evaluation metrics include average score and average pass@1, with the consistency (median standard deviation) considered as an additional reference.

Features of the HackerRank ASTRA Benchmark:

  1. Diverse skill domains: the v1 ASTRA Benchmark dataset comprises 65 project-based coding questions primarily focused on front-end, categorized into 10 primary coding skill domains and 34 subcategories. 
  2. Long context multi-file project questions: To mimic real-world development, HackerRank’s ASTRA Benchmark Dataset includes, on average, 12 source code and configuration files per question as model inputs. The problem statements alone typically have an average context length of 718 characters, with source code files, input strings averaging 22,863 characters in length, and output strings averaging 2,744 characters. On average, the benchmark dataset requires 84 lines of solution code per question.
  3. Model correctness and consistency evaluation: To assess the production reliability of the model, we prioritize metrics such as average scores, average pass@1, and median standard deviation with k=32, rather than relying on the traditional industry-standard pass@k. These metrics provide a more precise evaluation of the model's correctness and consistency in real-world scenarios.
  4. Wide test cases coverage: HackerRank’s ASTRA Benchmark Dataset contains an average of 6.7 test cases per question, designed to rigorously evaluate the correctness of implementations.
  5. Abundant content availability: HackerRank's expertise in developer skills is built on its extensive content library with a growing library of over 7,500 questions, advanced skills taxonomy, and data-driven insights. Informed by data from tens of thousands of job descriptions, HackerRank’s Roles Directory—spanning 9 job families, 77 roles, and 260 skills—leverages machine learning to identify key skills for various tech roles, ensuring alignment with real-world industry demands.

Evaluation Leaderboard

Model
Avg. Score
Avg. Pass@1
Consistency
1
o1
75.80%
63.92%
.11
3
o1-preview
75.55%
60.89%
.17
2
Claude-3.5-sonnet
75.07%
62.74%
.05
4
Gemini-1.5-pro
71.17%
58.15%
.13
5
GPT-4o
69.52%
50.91%
.20

Based on the analysis of average scores, the models o1, o1-preview, and Claude-3.5-Sonnet-1022 demonstrate superior performance for multi-file, real-world front-end coding tasks. However, due to the high variance within the average scores across 65 questions, a paired t-test reveals that, with the exception of GPT-4o-0513, the differences between model performances are not statistically significant. Despite this, the average score with k=32 indicates a meaningful practical impact in real-world production settings. Similar trends were observed when evaluating the models using the average pass@1 metric.

In our benchmark evaluation, we assessed the Consistency of LLMs using the standard deviation (SD) of their scores across 32 independent runs per question and then evaluated the median SD across 65 questions. The models demonstrated varying levels of performance stability, with Claude-3.5-Sonnet-1022 exhibiting the lowest variability (SD = 0.0497), indicating the highest consistency across problems. The difference between Claude-3.5-Sonnet-1022 and the rest of the models are statistically significant based on the paired t-test.

ASTRA Benchmark Dataset Description

The v1 ASTRA Benchmark Dataset comprises 65 project-based coding questions, systematically categorized into 10 primary coding skill domains and 34 subcategories.

The key statistics are summarized in the following table:

Total project questions
65
Average number of test cases
10
Count of sub skill category
34
Average number of test cases
6.7
Average input files
12
Average input character length
22,863
Average problem statement character length
718
Average output character length
2,744
Average expected lines of code
84
Average modified code files
2.3

Sample Data and Solution

Here is an example of a RESTful API project from the ASTRA Benchmark Dataset. The task involves developing a RESTful API for managing product records using Node.js and Express, reflecting a common real-world e-commerce development scenario. The project structure is depicted in the following screenshot:

The API includes the following endpoints:

  • POST /products: Add a new product.
  • GET /products: Retrieve existing products.
  • PATCH /products: Update product details.

Modifications or deletions via PUT and DELETE methods are explicitly disallowed, aligning with specific business requirements.The implementation mandates a modular code structure, with separate files for routes, controllers, and database interactions. Candidates are required to implement robust error handling and adhere to business logic, such as ensuring products are only published if they satisfy predefined criteria:

  • MRP ≥ Price
  • Stock > 0

This task reflects challenges encountered in building production-grade APIs, such as:

  • Defining clear and RESTful routes
  • Integrating controllers for business logic
  • Implementing validations and error handling
  • Ensuring modular programming principles for maintainability and scalability

Additionally, the problem emphasizes practical concerns like returning appropriate HTTP status codes, handling error responses, and following an organized project structure. These are critical components for building scalable and maintainable APIs.

Taking one of the GPT-4o-0513 solutions as an example: GPT-4o-0513 successfully implemented the core logic for the API. The controllers in controllers/products.js effectively handled operations such as adding, retrieving, and updating products. Moreover, the routes in routes/products.js were correctly defined to map API endpoints to their respective controllers.

The routes defined by GPT-4o-0513 for handling products were as follows:

Despite correctly implementing the core logic, GPT-4o-0513 missed a critical step: integrating the product routes into the main application file (app.js). Instead of importing productsRouter from routes/products.js and linking it to the / path, GPT-4o-0513 incorrectly used a placeholder indexRouter. This oversight caused all requests to /products to fail with a "404 Not Found" error, as the routes were not properly connected. Consequently, every test case expecting responses from /products failed.

Here is the app.js provided by GPT-4o-0513:

To fix this issue, the productsRouter from routes/products.js should be directly linked to the root (/) endpoint in app.js. This ensures that all product-related routes are accessible as expected since the absolute paths are already defined within routes/products.js.

Fixes required in app.js:

We have provided a detailed project walkthrough along with additional three examples in the following document. This resource is intended for those interested in exploring the question and solution structure in greater depth.

Methodology

Evaluation Criteria

The evaluation primarily targets code generation correctness and consistency, focusing exclusively on the model’s ability to generate accurate and functional solutions in response to a text-based API call.

Prompt

Evaluation methodology

1. Input Data Preparation

  • The evaluation process begins with reading a CSV file containing the list of questions. For each question, the corresponding project files are downloaded from the S3 bucket. This step ensures that the input data is well-prepared, consistent, and eliminates ambiguity for the model.
  • A structured prompt is created, which includes:

    • Question instructions.
    • A detailed problem statement.
    • Relevant project files.

2. Solution Generation

  • The prepared prompt is sent to the selected AI model(s) (e.g., GPT-4o, Gemini, Claude).
  • The model generates a solution, which is returned in the specified format (XML or JSON).

3. Post-Processing

  • The generated solutions are validated for structural and formatting issues, such as:

    • XML/JSON parsing errors.
    • Misformatted line breaks (\n) or tabs (\t).
  • Corrections are applied to ensure the solution adheres to the required structure and remains parsable.

4. Solution Integration

  • The validated solution is integrated into the project files. This involves updating the project with the solution provided by the model and preparing it for testing.

5. Test Case Validation

  • The updated project is executed in a Docker container, where it is evaluated against a set of pre-defined test cases.
  • The test cases act as the ground truth, ensuring that the solution's correctness and functionality are thoroughly assessed.

6. Store Partial Results

  • For each question, the evaluation results are recorded, including the number of test cases passed and their corresponding outputs.
  • These results are saved in a CSV file to enable further analysis and performance evaluation.

7. Overall Aggregation

Once all the questions have been evaluated, an aggreggation script is executed to compute key performance metrics for each question.

  • Average score
  • Average Pass@1
  • Standard deviation
  • Average test cases passed (which will be used for average score calculation)

Evaluation Metrics

  1. Average Score (Passed Test Cases / Total Test Cases) with k=32 across 65 problems: The Average Score with k=32 evaluates the model’s partial correctness and robustness by considering multiple attempts (up to k=32) for each problem. For each problem, the score is calculated as the average proportion of passed test cases across the 32 runs. Then, this score is aggregated across 65 problems to compute the final Average Score. Formally, if a model generates up to k=32 solutions for each problem and passes p[ij]​ test cases out of T[i] total test cases for the j-th run of the i-th problem:

Where:

  • n = number of problems (e.g., 65)
  • k = number of runs/solutions per problem (e.g., 32)
  • p[ij]​ = number of passed test cases for the j-th run of the i-th problem
  • T[i]​ = total number of test cases for the iii-th problem
  1. Average Pass@1 with k=32 across 65 questions: The Average pass@1 metric evaluates the frequency with which the model achieves a perfect score across k=32 runs for each problem. For a given problem, pass@1 is defined as the proportion of runs where the model achieves a perfect score (all test cases passed). The metric then aggregates this proportion across n=65 problems to compute the final Average pass@1.

Where:

  • n: number of problems (e.g., 65)
  • k: number of runs/solutions per problem (e.g., 32)
  • p[ij]: number of passed test cases for the j-th run of the i-th problem
  • T[i]​: total number of test cases for the i-th problem
  • I(⋅) is the indicator function, equal to 1 if a perfect score and 0 otherwise
  1. The Median Standard Deviation measures the consistency of the model's performance across n=65 problems. For each problem, the standard deviation of scores across k=32 solutions is computed. The final metric is the median of these standard deviations across all problems. The rationale for using the median instead of the mean is that the standard deviation of scores across different problems often deviates from a normal distribution. A lower median standard deviation indicates more consistent performance across runs, while a higher median standard deviation suggests greater variability in the model’s ability to solve problems.

These metrics are chosen for their alignment with real-world coding standards, where both complete and partially correct solutions carry significance. The Average Score accounts for the model’s incremental problem-solving ability, offering a granular view of how much of a solution’s functionality is achieved even when it is not fully correct. Pass@1 indicates how reliably a model can produce correct code immediately, which is crucial in real-world scenarios where developers aim to get solutions right with minimal revisions. The Median Standard Deviation reflects the consistency of a model’s solutions for each problem, highlighting whether the model performs steadily across its multiple attempts or exhibits significant variability.

Using k=32 provides a meaningful measure of a model’s capability to explore diverse solutions, as this number of attempts allows it to overcome minor variances while maintaining focus on a feasible solution space. We use the mean for metrics like average score and pass@1 because these aggregate metrics aim to capture the overall performance of the model across problems. For standard deviation, however, we use the median because the variability of scores across problems often contains outliers, and the median provides a more robust measure of the typical consistency of the model's performance.

Evaluation Insights Summary

Finding 1: ASTRA benchmark challenges LLMs with multi-file front-end projects

  • ASTRA evaluates models on real multi-file front-end challenges, yielding an average score around 70% and an average pass@1 around 60%. 
  • These results are lower than other benchmarks such as Scale AI’s SEAL Leaderboard, where top-tier models attain over 80% correctness. 
  • This highlights the difficulty of ASTRA’s coding tasks

Finding 2: o1, o1 preview and Claude 3.5 Sonnet are the leading models for front-end development (as of January 2025)

  • Which model is the best for front-end development? That depends. For skills with an occurrence of  ≥3 in the benchmark, o1, o1-preview, and Claude 3.5 Sonnet exhibited comparable performance levels.
  • For skills with only a single occurrence in the ASTRA dataset, such as Java and Selenium, Claude 3.5 Sonnet and Gemini 1.5 Pro tended to outperform OpenAI models.
  • Interestingly, o1 underperformed relative to its predecessor, o1-preview, and even GTP-4o in certain skills, including AngularJS, Java Spring Boot Backend, Java, and Selenium.
  • In proof that newer doesn’t always mean better, o1-preview demonstrated the highest performance of all models on tasks involving AngularJS.

Finding 3: o1 leads in average score and average pass@1, while Claude 3.5 Sonnet outperforms in consistency.

  • o1 and o1-preview posted average scores of 75.80% and 75.55% respectively, reflecting strong partial correctness across tasks. 
  • o1 leads in average pass@1 at 63.92%, indicating higher success generating correct solutions on the first attempt.
  • Claude 3.5 Sonnet posted a consistency score of 0.05, a signal of steadier, more reliable performance.

Finding 4: Model performance varies across subskills, suggesting that a “best” AI front-end development tool is dependent on specific use cases. 

  • Claude 3.5 Sonnet leads in API integration, data filtering, database interaction, and more
  • o1 leads in form handling, pagination and API, and EventEmitter
  • o1-preview and even GPT-4o outperform o1 in several subskills
Subskills (Occurrence)
Winning Models
Form Handling (31)
o1
API Integration (18)
claude-3.5-sonnet, o1
State Management (12)
claude-3.5-sonnet
Data Filtering (11)
claude-3.5-sonnet
Controlled Components (10)
gemini-1.5-pro
Search Functionality (9)
o1-preview
Database Interaction (8)
claude-3.5-sonnet
EventEmitter (6)
o1
Component Reuse (3)
claude-3.5-sonnet
Pagination and API (3)
o1
Regex (3)
o1-preview
Routng (3)
GPT-4o
Sorting (3)
claude-3.5-sonnet

Finding 5: XML performs better than JSON results across all the models

  • Our benchmark asks the models to return multi-file code solutions in both XML and JSON formats. After evaluating the average score and average pass@1 with k=32, we observed that XML consistently outperformed JSON across all models for both metrics.
  •  The difference is statistically significant, except for GPT-4o and the average score from Gemini 1.5 Pro.
  • This indicates there is more XML training data available for these LLMs compared to JSON.
  • When leveraging LLMs in similar development scenarios, developers should prefer XML over JSON.

XML format results

Model
Average Score
Average Pass@1
o1 preview
.755523
.608923
gemini
.750671
.627385
claude
.711671
.581538
gpt4o
.695280
.509077

JSON format results

Model
Average Score
Average Pass@1
o1 preview
.723566
.547538
gemini
.700809
.535565
claude
.700354
.574985
gpt4o
.681255
.503077

XML prompt

JSON prompt

Finding 6: ASTRA Benchmark reveals JSON escaping challenges and rare refusals in o1-preview and o1, emphasizing the need for refined guardrails.

  • ASTRA includes multi-file project questions requiring models to convert answers into JSON format. Most models handled this requirement seamlessly.
  • o1-preview faced significant challenges converting answers to JSON, particularly with escaping multiline strings. On average, o1-preview had a 2.3% error rate related to JSON escaping, even after detailed example guidance and emphasizing from the prompt.
  • o1-preview also occasionally refused to provide a solution (0.2% of cases). This is likely due to guardrail settings within the model. Similar results were found with o1.
  • These refusals highlight the importance of refining model guardrails to balance security constraints and usability. 

Finding 7: Common Errors from the model

User Interface and Presentation Issues: Errors that impact the visual or interactive aspects of the application, degrading the user experience by displaying incorrect or suboptimal layouts and requiring user intervention to correct.

Data Handling and Misuse Errors: Errors caused by improper or unnecessary manipulation of data files or structures, disrupting the application's expected functionality and potentially leading to runtime or compilation failures.

Typos, Syntax, and Misinterpretation Errors: Errors resulting from minor formatting issues, typographical mistakes, or misinterpretation of the problem statement. These errors typically involve incorrect output formatting or failure to adhere to the specified requirements.

Logical and Implementation Errors: Errors in the implementation that fail to account for specific conditions, edge cases, or problem constraints, despite having correct syntax.

Finding 8: Correlation Between Model Performance and Input/Output Length

The correlation between the average output length and the average score is approximately -0.560, indicating a moderate negative relationship. This suggests that longer outputs are generally associated with lower scores. In contrast, the correlation between input length and average score is approximately -0.164, reflecting a weak negative relationship. This implies that while longer inputs may slightly reduce the average score.

Use Cases

  1. Understanding Model Performance on Multi-File Project Coding Questions: The benchmark provides insights into a model’s ability to handle multi-file project-based coding tasks, which closely mirror real-world development scenarios. Evaluating solutions across multiple files highlights the model’s proficiency in managing modular project structures, handling interdependencies, and generating robust solutions.
  2.  Model Selection Reference for Developers: ASTRA benchmark serves as a reference for developers choosing models for specific development tasks. By providing a detailed evaluation of correctness and consistency across diverse coding scenarios.

Limitations and Future Directions

While our study provides valuable insights into AI model performance on multi-file real-world coding tasks, several limitations should be noted:

Limited Skill Coverage: The current version of the benchmark primarily focuses on front-end projects, such as React and Angular.js, which narrows the scope of skill evaluation. While these areas are critical, the lack of representation for back-end skills and other domains limits the comprehensiveness of the evaluation. In the next iteration, we aim to address this limitation by expanding the benchmark to include a broader range of back-end skills and technologies.

Absence of Agentic Approaches: Our evaluation does not yet leverage agentic methods to maximize model performance, where models are given the autonomy to iteratively explore, adapt, and refine their solutions within the benchmark constraints. Incorporating such approaches in future versions will enable a more realistic and nuanced understanding of the model’s potential in dynamic and complex problem-solving scenarios.

Lack of More Model and Iterative Feedback Mechanism: The current study evaluates a handful models by requesting outputs directly for each attempt without providing feedback based on test case results. This approach limits our ability to assess how models perform when given iterative guidance, which is an essential aspect of real-world coding.

Limited model selection: The current model selection is limited to a subset of top-tier models. However, we are actively working to expand testing by including additional models, such as DeepSeek and Llama, in future evaluations. Furthermore, we are developing a community-driven approach to benchmark testing, enabling broader model comparisons to enhance our leaderboard. As an initial step, with this release, we are open-sourcing all 65 project questions on GitHub and Hugging Face.

References

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770, 2023. https://arxiv.org/abs/2310.06770

Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang. HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation. arXiv preprint arXiv:2412.21199, 2024. https://arxiv.org/abs/2412.21199

Woojeong Kim, Ashish Jagmohan, Aditya Vempaty. Scale AI SEAL: Suite for Evaluating API-use of LLMs. arXiv preprint arXiv:2409.15523, 2024. https://arxiv.org/abs/2409.15523

Berk Atil, Alexa Chittams, Liseng Fu, Ferhan Ture, Lixinyu Xu, Breck Baldwin. LLM Stability: A Detailed Analysis with Some Surprises. arXiv preprint arXiv:2408.04667, 2024. https://arxiv.org/abs/2408.04667

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2405.19856, 2024. https://arxiv.org/abs/2405.19856

Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun Wan, Nan Duan. Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency. arXiv preprint arXiv:2309.17272, 2024. https://arxiv.org/abs/2309.17272

John Yang, Carlos E. Jimenez, Alexander L. Zhang, Kilian Lieret, Jiani Yang, Xinyun Wu, Ofir Press, Nils Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? arXiv preprint arXiv:2410.03859, 2024. https://arxiv.org/abs/2410.03859

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen. Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study. arXiv preprint arXiv:2403.08604, 2024. https://arxiv.org/abs/2403.08604

Qian Huang, Jian Vora, Percy Liang, Jure Leskovec. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. arXiv preprint arXiv:2310.03302, 2024. https://arxiv.org/abs/2310.03302

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688, 2023. https://arxiv.org/abs/2308.03688

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, Shikun Zhang. A Survey on Evaluating Large Language Models in Code Generation Tasks. arXiv preprint arXiv:2408.16498, 2024. https://arxiv.org/abs/2408.16498

Debalina Ghosh Paul, Hong Zhu, Ian Bayley. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. arXiv preprint arXiv:2406.12655, 2024. https://arxiv.org/abs/2406.12655

Weixi Tong, Tianyi Zhang. CodeJudge: Evaluating Code Generation with Large Language Models. arXiv preprint arXiv:2410.02184, 2024. https://arxiv.org/abs/2410.02184

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, Lingming Zhang. Evaluating Language Models for Efficient Code Generation. arXiv preprint arXiv:2408.06450, 2024. https://arxiv.org/abs/2408.06450

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun. Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models. arXiv preprint arXiv:2407.11470, 2024. https://arxiv.org/abs/2407.11470

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021. https://arxiv.org/abs/2107.03374

Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models. arXiv preprint arXiv:2309.17446, 2023. https://arxiv.org/abs/2309.17446

Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng. Towards More Realistic Evaluation of LLM-based Code Generation: An Experimental Study and Beyond. arXiv preprint arXiv:2406.06918, 2024. https://arxiv.org/abs/2406.06918

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599, 2024. https://arxiv.org/abs/2404.00599