There’s a lot of activity these days around understanding the inner workings of software solutions that integrate with large language models. Practitioners need help on this, and platform/tools providers are putting forward compelling offerings. Here are some quick notes from several recent conversations on this topic. I’ll update this from time to time when I come across new info.
You might ask… Link to heading
Q: How do I track what my LLM software is doing, so that I can tell what’s expensive or time-consuming, or find the source of quality problems?
Q: How do I objectively measure the quality of my LLM-based solution?
Q: How do I detect “drift” in the quality of my LLM-based solution over time, e.g. quality regressions when a new version of a foundation model is released?
Methodology Link to heading
20240524 - LLM Problem Solvers - Discussion Page - LLM Observability and Monitoring - Google Docs. This helpful doc contains discussion notes from a May 2024 AISC LLM School session on LLM Observability and Monitoring.
Tools Link to heading
In no particular order, …
openllmetry - Open-source observability for your LLM application
A set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions - Datadog, Honeycomb, and others.
deepeval - Open-source LLM Evaluation Framework from Confident AI
A simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.
Confident AI - Evaluation Infrastructure For LLMs
An all-in-one platform that unlocks
deepeval
’s full potential by allowing you to:
- evaluate LLM applications continously in production
- centralize and standardize evaluation datasets on the cloud
- trace and debug LLM applications during evaluation
- keep track of the evaluation history of your LLM application
- generate evaluation-based summary reports for relevant stakeholders
Arize Phoenix - Open-Source Tracing and Evaluation
Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by Arize AI, the company behind the the industry-leading AI observability platform, and a set of core contributors.
Arize - The AI Observability & LLM Evaluation Platform
Monitor, troubleshoot, and evaluate your machine learning models
A platform for building production-grade LLM applications. It allows you to closely monitor and evaluate your application, so you can ship quickly and with confidence. Use of LangChain is not necessary - LangSmith works on its own.