From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.

1 stars 1 forks 1 watchers Python Apache License 2.0
ai benchmark benchmarking benchmarking-suite code-generation code-security evals evaluation generative-ai llm python python3 reproducibility risk-assessment sandbox security static-analysis testing
2 Open Issues Need Help Last updated: Sep 14, 2025

Open Issues Need Help

View All on GitHub
Test coverage about 2 months ago
help wanted

From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.

Python
#ai#benchmark#benchmarking#benchmarking-suite#code-generation#code-security#evals#evaluation#generative-ai#llm#python#python3#reproducibility#risk-assessment#sandbox#security#static-analysis#testing
bug good first issue

From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.

Python
#ai#benchmark#benchmarking#benchmarking-suite#code-generation#code-security#evals#evaluation#generative-ai#llm#python#python3#reproducibility#risk-assessment#sandbox#security#static-analysis#testing