Open Issues Need Help
View All on GitHubAI Summary: The issue proposes adding a new LLM-as-a-judge evaluation metric called "LLM Sycophancy" (SycEval), based on a provided research paper. This new metric needs to be integrated into both the Python SDK and the frontend UI for online evaluation, along with corresponding documentation updates, following the pattern of existing metrics like "Hallucination".
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
AI Summary: Integrate Gretel datasets into Comet Opik. This involves creating a Python SDK-based solution (ideally a Jupyter Notebook) to import Gretel datasets and upload them as Opik datasets, handling authentication and any necessary data conversion. The solution should be demonstrable in the Opik UI.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
AI Summary: Develop a Comet Opik integration to import datasets from Hugging Face Datasets. This involves creating a Python SDK function and a Jupyter Notebook example (cookbook) demonstrating the import and upload of a Hugging Face dataset to Comet Opik's dataset management system. The solution should handle dataset conversion and authentication as needed.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
AI Summary: Implement a new 'Trajectory Accuracy' evaluation metric for Opik, an open-source LLM evaluation platform. This metric should assess the accuracy of action sequences in ReAct-style agents, ideally using an LLM-as-a-judge approach. The implementation should include additions to the frontend UI (Online Evaluation tab), Python SDK, and documentation, along with a demonstration video.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.