Instrumenting & Evaluating LLMs
This lesson discusses instrumentation and evaluation of LLMs. Guest speakers Brian Bischof and Eugene Yan describe how they think about LLM evaluation in industry. Finally, Shreya Shankar discusses her research on LLM eval systems.
Slides, notes, and additional resources are available here: https://parlance-labs.com/education/fine_tuning_course/workshop_3.html
This is lesson of 3 of 4 course on applied fine-tuning:
1. When & Why to Fine-Tune: https://youtu.be/cPn0nHFsvFg
2. Fine-Tuning w/Axolotl: https://youtu.be/mmsa4wDsiy0
3. Instrumenting & Evaluating LLMs: https://youtu.be/SnbGD677_u0
4. Deploying Fine-Tuned LLMs: https://youtu.be/GzEcyBykkdo
*00:00 Overview*
*02:05 Evaluations: The Core of the Development Cycle*
Frequent evaluations and rapid updates are central to applied AI. Evaluations can range from automated tests to more manual human reviews.
*06:07 Walkthrough of a Unit Test*
Dan demonstrates a unit test in Python designed to test a simple LLM pipeline.
*08:55 Unit Tests for LLMs*
Hamel explains the necessity of unit tests and their role in automating the validation of outputs.
To create effective unit tests, enumerate all features the AI should cover, define scenarios for each feature, and generate test data. Synthetic data can be created using LLMs to test various scenarios.
*18:56 LLM as a Judge*
To trust an LLM as a judge, iterate its outputs and measure their correlation with a trusted human standard using spreadsheets. Gradually align the LLM with human critiques to build confidence in its judgments.
*21:18 Issues with Using LLMs as Judges*
Dan discusses potential issues with relying on LLMs as judges, primarily due to their inconsistency in results.
*23:00 Human Evaluations*
Ongoing human review, data examination, and regular updates are necessary to maintain accuracy and prevent overfitting.
*24:44 Rapid Evaluations Lead to Faster Iterations*
Using evaluation strategies effectively can help quickly identify and fix issues or failure cases.
*26:30 Issues with Human Evaluations*
Human evaluations can be subjective, potentially leading to varying scores for the same output at different times. A/B testing can help mitigate these issues to some extent.
*31:20 Analyzing Traces*
A trace is a sequence of events, such as multi-turn conversations or retrieval-augmented generation (RAG) interactions. Analyzing traces (datasets) should be seamless to understand your data effectively.
*35:30 Logging Traces*
Several tools, such as Langsmith, can log and view traces. It’s recommended to use off-the-shelf tools to speed up data analysis.
*39:15 Langsmith Walkthrough*
Harrison demonstrates Langsmith, a tool for logging and testing LLM applications. Langsmith also supports visualization of traces and offers features like experiment filtering.
*43:12 Datasets and Testing on Langsmith*
Langsmith allows various methods to import, filter, and group datasets. Experiments can be set up to assess model performance across these datasets.
*51:35 Common Mistakes in Evaluating LLMs*
Bryan provides a brief overview of common pitfalls in LLM evaluation and how to avoid them.
*1:12:40 Code Walkthrough: Evaluating Summaries for Hallucinations*
Eugene covers natural language inference (NLI) tasks and fine-tunes models to classify summaries as entailment, neutral, or disagreement.
*1:33:03 Evaluating Agents*
Eugene details a step-by-step approach to evaluating agents, including breaking down tasks into classification and quality assessment metrics.
*1:35:49 Evals, Rules, Guardrails, and Vibe Checks*
Effective AI evaluation requires a blend of general and task-specific metrics, along with tailored guardrails and validation to ensure accurate outputs.
*1:44:24 Auto-Generated Assertions*
Shreya introduces Spade, a tool for generating and refining assertion criteria for AI pipelines by analyzing prompt edits and failures.
*1:50:41 Interfaces for Evaluation Assistants*
Shreya discusses the development of more efficient UIs for evaluating and iterating on AI-generated outputs, emphasizing dynamic and human-in-the-loop interfaces to enhance evaluation criteria and processes.
*2:04:45 Q&A Session*
*2:05:58 Streamlining Unit Tests with Prompt History*
*2:09:52 Challenges in Unit Testing LLMs for Diverse Tasks*
*2:12:20 When to Build Evaluations*
*2:15:35 Fine-Tuning LLMs as Judges*
*2:17:00 Building Data Flywheels*
*2:17:59 Temperature Settings for LLM Calls*
*2:22:09 Metrics for Evaluating Retrieval Performance in RAG*
*2:26:13 Filtering Documents for Accuracy*
*2:28:14 Unit Tests during CI/CD*
*2:30:34 Checking for Contamination of Base Models with Evaluation Data*
Видео Instrumenting & Evaluating LLMs канала Hamel Husain
Slides, notes, and additional resources are available here: https://parlance-labs.com/education/fine_tuning_course/workshop_3.html
This is lesson of 3 of 4 course on applied fine-tuning:
1. When & Why to Fine-Tune: https://youtu.be/cPn0nHFsvFg
2. Fine-Tuning w/Axolotl: https://youtu.be/mmsa4wDsiy0
3. Instrumenting & Evaluating LLMs: https://youtu.be/SnbGD677_u0
4. Deploying Fine-Tuned LLMs: https://youtu.be/GzEcyBykkdo
*00:00 Overview*
*02:05 Evaluations: The Core of the Development Cycle*
Frequent evaluations and rapid updates are central to applied AI. Evaluations can range from automated tests to more manual human reviews.
*06:07 Walkthrough of a Unit Test*
Dan demonstrates a unit test in Python designed to test a simple LLM pipeline.
*08:55 Unit Tests for LLMs*
Hamel explains the necessity of unit tests and their role in automating the validation of outputs.
To create effective unit tests, enumerate all features the AI should cover, define scenarios for each feature, and generate test data. Synthetic data can be created using LLMs to test various scenarios.
*18:56 LLM as a Judge*
To trust an LLM as a judge, iterate its outputs and measure their correlation with a trusted human standard using spreadsheets. Gradually align the LLM with human critiques to build confidence in its judgments.
*21:18 Issues with Using LLMs as Judges*
Dan discusses potential issues with relying on LLMs as judges, primarily due to their inconsistency in results.
*23:00 Human Evaluations*
Ongoing human review, data examination, and regular updates are necessary to maintain accuracy and prevent overfitting.
*24:44 Rapid Evaluations Lead to Faster Iterations*
Using evaluation strategies effectively can help quickly identify and fix issues or failure cases.
*26:30 Issues with Human Evaluations*
Human evaluations can be subjective, potentially leading to varying scores for the same output at different times. A/B testing can help mitigate these issues to some extent.
*31:20 Analyzing Traces*
A trace is a sequence of events, such as multi-turn conversations or retrieval-augmented generation (RAG) interactions. Analyzing traces (datasets) should be seamless to understand your data effectively.
*35:30 Logging Traces*
Several tools, such as Langsmith, can log and view traces. It’s recommended to use off-the-shelf tools to speed up data analysis.
*39:15 Langsmith Walkthrough*
Harrison demonstrates Langsmith, a tool for logging and testing LLM applications. Langsmith also supports visualization of traces and offers features like experiment filtering.
*43:12 Datasets and Testing on Langsmith*
Langsmith allows various methods to import, filter, and group datasets. Experiments can be set up to assess model performance across these datasets.
*51:35 Common Mistakes in Evaluating LLMs*
Bryan provides a brief overview of common pitfalls in LLM evaluation and how to avoid them.
*1:12:40 Code Walkthrough: Evaluating Summaries for Hallucinations*
Eugene covers natural language inference (NLI) tasks and fine-tunes models to classify summaries as entailment, neutral, or disagreement.
*1:33:03 Evaluating Agents*
Eugene details a step-by-step approach to evaluating agents, including breaking down tasks into classification and quality assessment metrics.
*1:35:49 Evals, Rules, Guardrails, and Vibe Checks*
Effective AI evaluation requires a blend of general and task-specific metrics, along with tailored guardrails and validation to ensure accurate outputs.
*1:44:24 Auto-Generated Assertions*
Shreya introduces Spade, a tool for generating and refining assertion criteria for AI pipelines by analyzing prompt edits and failures.
*1:50:41 Interfaces for Evaluation Assistants*
Shreya discusses the development of more efficient UIs for evaluating and iterating on AI-generated outputs, emphasizing dynamic and human-in-the-loop interfaces to enhance evaluation criteria and processes.
*2:04:45 Q&A Session*
*2:05:58 Streamlining Unit Tests with Prompt History*
*2:09:52 Challenges in Unit Testing LLMs for Diverse Tasks*
*2:12:20 When to Build Evaluations*
*2:15:35 Fine-Tuning LLMs as Judges*
*2:17:00 Building Data Flywheels*
*2:17:59 Temperature Settings for LLM Calls*
*2:22:09 Metrics for Evaluating Retrieval Performance in RAG*
*2:26:13 Filtering Documents for Accuracy*
*2:28:14 Unit Tests during CI/CD*
*2:30:34 Checking for Contamination of Base Models with Evaluation Data*
Видео Instrumenting & Evaluating LLMs канала Hamel Husain
Комментарии отсутствуют
Информация о видео
22 июля 2024 г. 23:30:27
02:33:52
Другие видео канала