Open Issues Need Help
View All on GitHub Test coverage about 2 months ago
help wanted
From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.
Python
#ai#benchmark#benchmarking#benchmarking-suite#code-generation#code-security#evals#evaluation#generative-ai#llm#python#python3#reproducibility#risk-assessment#sandbox#security#static-analysis#testing
scripts/validate_docs.py refactor about 2 months ago
bug good first issue
From prompt to paste: evaluate AI / LLM output under a strict Python sandbox and get actionable scores across 7 categories, including security, correctness and upkeep.
Python
#ai#benchmark#benchmarking#benchmarking-suite#code-generation#code-security#evals#evaluation#generative-ai#llm#python#python3#reproducibility#risk-assessment#sandbox#security#static-analysis#testing