There’s a lot of activity these days around understanding the inner workings of software solutions that integrate with large language models. Practitioners need help on this, and platform/tools providers are putting forward compelling offerings. Here are some quick notes from several recent conversations on this topic. I’ll update this from time to time when I come across new info.

You might ask… Link to heading

Q: How do I track what my LLM software is doing, so that I can tell what’s expensive or time-consuming, or find the source of quality problems?

Q: How do I objectively measure the quality of my LLM-based solution?

Q: How do I detect “drift” in the quality of my LLM-based solution over time, e.g. quality regressions when a new version of a foundation model is released?

Methodology Link to heading

20240524 - LLM Problem Solvers - Discussion Page - LLM Observability and Monitoring - Google Docs. This helpful doc contains discussion notes from a May 2024 AISC LLM School session on LLM Observability and Monitoring.

Tools Link to heading

In no particular order, …

openllmetry - Open-source observability for your LLM application

A set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions - Datadog, Honeycomb, and others.

deepeval - Open-source LLM Evaluation Framework from Confident AI

A simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Confident AI - Evaluation Infrastructure For LLMs

An all-in-one platform that unlocks deepeval’s full potential by allowing you to:

evaluate LLM applications continously in production

centralize and standardize evaluation datasets on the cloud

trace and debug LLM applications during evaluation

keep track of the evaluation history of your LLM application

generate evaluation-based summary reports for relevant stakeholders

Arize Phoenix - Open-Source Tracing and Evaluation

Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve. Phoenix is built by Arize AI, the company behind the the industry-leading AI observability platform, and a set of core contributors.

Arize - The AI Observability & LLM Evaluation Platform

Monitor, troubleshoot, and evaluate your machine learning models

LangSmith

A platform for building production-grade LLM applications. It allows you to closely monitor and evaluate your application, so you can ship quickly and with confidence. Use of LangChain is not necessary - LangSmith works on its own.

Search for more Link to heading

on Google Search on Bing