LangSmith

Your Complete Guide to Production-Ready LLM Observability and Testing

Building reliable LLM applications is one of the biggest challenges developers face today. You've probably experienced the frustration – your model works beautifully in development, but once it hits production, you're dealing with mysterious failures, unexpected outputs, and users complaining about quality issues. That's where LangSmith comes in as your complete observability, evaluation, and monitoring platform specifically designed for LLM applications.

If you've been struggling with debugging complex AI workflows, testing prompt variations, or monitoring your models in production, LangSmith is about to become your secret weapon. This comprehensive guide will take you through everything you need to know to master LangSmith and build bulletproof LLM applications.

LangSmith dashboard overview showing traces, evaluations, and monitoring interface with real project data

What Exactly is LangSmith?

LangSmith is a comprehensive platform for building production-grade LLM applications. Think of it as your mission control center for everything related to LLM observability, evaluation, and improvement. Whether you're using LangChain, LangGraph, or building completely custom AI applications, LangSmith gives you the tools to understand what's happening under the hood of your AI systems.

Here's what makes LangSmith fundamentally different from traditional monitoring tools: it's LLM-native. Traditional application monitoring tools weren't built for the unique challenges of language models – the non-deterministic nature, complex reasoning chains, multi-modal inputs and outputs, and the need to evaluate qualitative responses alongside quantitative metrics.

Core Capabilities That Transform LLM Development

LangSmith covers six main areas that every production LLM application desperately needs:

  • Observability: Deep tracing and monitoring of your LLM workflows with step-by-step visibility 
  • Evaluation: Automated testing with both AI judges and human feedback loops 
  • Prompt Engineering: Version control, collaboration, and A/B testing for prompts 
  • Monitoring: Real-time alerts and dashboards for production systems 
  • Dataset Management: Curated test datasets with schema validation and synthetic data generation 
  • Human-in-the-Loop: Annotation queues and feedback collection from subject matter experts

The best part? LangSmith is framework-agnostic. You can use it with LangChain, LangGraph, raw OpenAI calls, Anthropic models, or any custom LLM application you've built. It doesn't lock you into any particular ecosystem.

Why Traditional Monitoring Falls Short for LLMs

Traditional application monitoring assumes deterministic behavior – if you send the same input, you get the same output. LLMs break this assumption completely. They're designed to be creative, contextual, and sometimes even contradictory. This creates unique challenges:

  • Non-deterministic outputs: The same prompt can yield different results 
  • Complex reasoning chains: Modern AI applications involve multiple LLM calls, tool usage, and decision trees
  • Qualitative evaluation: You can't just check if output equals expected result 
  • Context dependency: Performance varies dramatically based on input context and conversation history 
  • Multi-modal complexity: Handling text, images, audio, and structured data in single workflows

LangSmith was built from the ground up to handle these challenges

Architecture diagram showing LangSmith integration with different LLM frameworks, cloud providers, and monitoring systems

Setting Up LangSmith: Getting Started the Right Way

Let's get you up and running with LangSmith. The setup is straightforward, but there are several optimization strategies you'll want to implement from the start.

Installation and Environment Setup

First, install the LangSmith SDK:

Responsive IDE Code Block
   Bash
pip install -U langsmith

For JavaScript/TypeScript projects:

Responsive IDE Code Block
   Bash
npm install langsmith
# or
yarn add langsmith

Next, you'll need to create an API key. Head to smith.langchain.com and navigate to your settings page. Create an API key – you'll see it only once, so store it securely in your password manager or secrets management system.

Now set up your environment variables:

Responsive IDE Code Block
   Bash
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-api-key-here"
export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
export LANGSMITH_PROJECT="your-project-name"

Pro tip: You can also configure these programmatically if environment variables aren't an option in your deployment environment:

Responsive IDE Code Block
   Python
from langsmith import Client, tracing_context
langsmith_client = Client(
    api_key="YOUR_LANGSMITH_API_KEY",
    api_url="https://api.smith.langchain.com"
)
# Use within a context manager for specific operations
with tracing_context(enabled=True):
    # Your LLM calls here
    response = llm.invoke("Hello, how are you?")

Your First Trace

Let's start with something simple. If you're using LangChain, tracing happens automatically once you set the environment variables:

Responsive IDE Code Block
   Python
# LangChain OpenAI Example
import os
from langchain_openai import ChatOpenAI

# Environment variables already set
llm = ChatOpenAI(temperature=0.7, model="gpt-4")
response = llm.invoke("Hello, how are you?")
print(response.content)

That's it! Head to your LangSmith dashboard and you'll see the trace appear with complete details about the request and response.

Advanced Configuration Options

For production deployments, you'll want to configure sampling and filtering:

Screenshot of first LangSmith trace showing detailed input, output, timing, token usage, and cost breakdown

Deep Dive: Observability That Actually Works

This is where LangSmith really shines compared to traditional debugging approaches. Trying to debug LLM applications with print statements or basic logging is like trying to perform surgery with a butter knife

Understanding Traces and Runs in Detail

Every interaction with your LLM application creates a trace – think of it as a complete execution record of what happened during a single user interaction. Within each trace, you have runs that represent individual steps like LLM calls, tool usage, retrieval operations, or chain execution.

Here's what you get with every trace:

  • Complete input/output history: Every prompt, response, and intermediate result
  • Token usage and costs: Detailed breakdown by model and operation type 
  • Latency measurements: End-to-end timing and step-by-step performance 
  • Error details and stack traces: Complete debugging information when things go wrong 
  • Intermediate steps in complex workflows: See inside multi-agent systems and complex chains 
  • Metadata and tags: Custom information for filtering and analysis

Tracing Complex Multi-Agent Systems

If you're building sophisticated multi-agent systems, LangSmith's distributed tracing becomes invaluable:

Responsive IDE Code Block
   Python
from langsmith import traceable
import openai
client = openai.Client()

@traceable(name="Planning Agent", run_type="agent")
def planning_agent(task: str) -> dict:
    """Agent that creates execution plans"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a planning agent. Create detailed exec"},
            {"role": "user", "content": f"Create a plan for: {task}"}
        ]
    )
    plan = {"steps": response.choices.message.content.split("\n")}
    return plan

@traceable(name="Research Agent", run_type="agent")
def research_agent(topic: str) -> str:
    """Agent that conducts research"""
    # Simulate research operation
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a research agent. Provide detailed res"},
            {"role": "user", "content": f"Research: {topic}"}
        ]
    )
    return response.choices.message.content

@traceable(name="Execution Agent", run_type="agent")
def execution_agent(plan: dict) -> list:
    """Agent that executes plans"""
    results = []
    for step in plan["steps"][:3]:  # Execute first 3 steps
        if "research" in step.lower():
            result = research_agent(step)
        else:
            result = f"Executed: {step}"
        results.append(result)
    return results

@traceable(name="Multi-Agent Workflow", run_type="workflow")
def multi_agent_system(user_task: str) -> dict:
    """Orchestrator for multi-agent workflow"""
    plan = planning_agent(user_task)
    results = execution_agent(plan)
    return {
        "task": user_task,
        "plan": plan,
        "results": results,
        "status": "completed"
    }

# Execute the workflow
result = multi_agent_system("Create a comprehensive marketing strategy for a new AI product")

All agent interactions will be captured in a single hierarchical trace, making debugging and optimization much easier.

Advanced Tracing Without LangChain

Not using LangChain? No problem. LangSmith works with any Python code using decorators and context managers:

Responsive IDE Code Block
   Python
import openai
from langsmith import traceable, Client
from langsmith.wrappers import wrap_openai
import json

# Wrap your OpenAI client for automatic tracing
client = wrap_openai(openai.Client())
langsmith_client = Client()

@traceable(run_type="retriever", name="Vector Search")
def retrieve_context(query: str, top_k: 5) -> list:
    """Simulate vector database retrieval"""
    # Your vector search logic here
    contexts = [
        f"Context {i+1} for query: {query}"
        for i in range(top_k)
    ]
    return contexts

@traceable(run_type="tool", name="Web Search")
def web_search(query: str) -> dict:
    """Simulate web search tool"""
    return {
        "query": query,
        "results": [
            {"title": f"Result 1 for {query}", "url": "https://example.com/1"},
            {"title": f"Result 2 for {query}", "url": "https://example.com/2"}
        ]
    }

@traceable(name="RAG Response Generator", run_type="chain")
def generate_rag_response(query: str, use_web_search: bool = False) -> dict:
    """Generate response using RAG pattern"""
    # Retrieve context
    contexts = retrieve_context(query)

    # Optional web search
    web_data = None
    if use_web_search:
        web_data = web_search(query)

    # Prepare system prompt
    context_text = "\n".join(contexts)
    system_prompt = f"""You are a helpful AI assistant. Use the following context to answ
Context:
{context_text}"""
    if web_data:
        system_prompt += f"\n\nWeb search results: {json.dumps(web_data)}"

    # Generate response
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ],
        temperature=0.7
    )
    return {
        "query": query,
        "context_count": len(contexts),
        "used_web_search": use_web_search,
        "response": response.choices.message.content,
        "total_tokens": response.usage.total_tokens
    }

# This will create a detailed nested trace
result = generate_rag_response(
    "What are the best practices for LLM application monitoring?",
    use_web_search=True
)

Monitoring with Real-Time Dashboards

LangSmith automatically creates comprehensive dashboards for each project. You'll get insights into:

  • Trace count and error rates: Spot issues before they impact users 
  • LLM call statistics: Monitor usage patterns and identify bottlenecks 
  • Token usage and costs: Track expenses across different models and operations 
  • Latency trends: Identify performance bottlenecks and optimization opportunities 
  • Tool usage patterns: See which tools and services are being called most frequently 
  • User feedback distribution: Understand satisfaction trends over time

LangSmith monitoring dashboard showing comprehensive metrics, charts, and real-time performance indicators

You can also create custom dashboards to track metrics specific to your use case:

Responsive IDE Code Block
   Python
from langsmith import Client
client = Client()

# Query custom metrics programmatically
import pandas as pd

# Get cost breakdown by model
cost_data = client.list_runs(
    project_name="your-project",
    start_time="2024-01-01",
    filter="and(eq(run_type, 'llm'), gte(start_time, '2024-01-01T00:00:00'))"
)

# Analyze patterns in your traces
df = pd.DataFrame([{
    'model': run.extra.get('invocation_params', {}).get('model', 'unknown'),
    'tokens': run.total_tokens,
    'cost': run.total_cost,
    'latency': (run.end_time - run.start_time).total_seconds() if run.end_time else None,
    'error': run.error is not None
} for run in cost_data])

print(f"Total cost this month: ${df['cost'].sum():.2f}")
print(f"Average latency: {df['latency'].mean():.2f} seconds")
print(f"Error rate: {df['error'].mean():.2%}")

Comprehensive Evaluation: Beyond Simple Testing

Traditional software testing doesn't work for LLM applications. You can't just assert that the output equals some expected string. LangSmith's evaluation framework addresses this with sophisticated approaches to testing AI applications.

The Three-Component Evaluation System

LangSmith's evaluation system consists of three core components:

1. Datasets: Your test inputs and expected outputs with rich metadata 

2. Target function: What you're evaluating (single LLM call, part of your app, or end-to-end system) 

3. Evaluators: Functions that score your outputs using various criteria

Let's build a comprehensive evaluation setup:

Responsive IDE Code Block
   Python
from langsmith import Client
from langsmith.evaluation import evaluate
from openevals import Correctness, Helpfulness, Harmfulness
import openai

# Initialize clients
ls_client = Client()
openai_client = openai.Client()

def my_rag_application(inputs: dict) -> dict:
    """Your RAG application to evaluate"""
    question = inputs["question"]
    context = inputs.get("context", "")
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": f"Answer the question using the provided context. Context: {co"
            },
            {"role": "user", "content": question}
        ]
    )
    return {
        "answer": response.choices.message.content,
        "model": "gpt-4",
        "context_used": bool(context)
    }

# Run comprehensive evaluation
results = evaluate(
    my_rag_application,
    data="customer_support_dataset", # Your dataset name
    evaluators=[
        Correctness(), # Factual accuracy
        Helpfulness(), # Response helpfulness
        Harmfulness(), # Safety check
    ],
    experiment_prefix="rag_v2_evaluation",
    max_concurrency=5, # Parallel evaluation
    metadata={
        "model_version": "gpt-4",
        "evaluation_date": "2024-08-21",
        "evaluator": "production_team"
    }
)
print(f"Evaluation completed: {results}")

Creating Sophisticated Custom Evaluators

While built-in evaluators are helpful, you'll often need custom evaluators for your specific domain:

Responsive IDE Code Block
   Python
# LangSmith custom evaluators example
from langsmith.schemas import Run, Example
import re
import json

def domain_expertise_evaluator(run: Run, example: Example) -> dict:
    """Custom evaluator that checks domain-specific expertise"""
    prediction = run.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    # Define domain-specific criteria
    technical_terms = ["API", "authentication", "encryption", "database", "microservices"]
    explanation_quality_markers = ["because", "therefore", "however", "for example"]
    # Score technical accuracy
    technical_score = sum(1 for term in technical_terms if term.lower() in prediction.lower())
    technical_score = min(technical_score / 3, 1.0)
    # Score explanation quality
    explanation_score = sum(1 for marker in explanation_quality_markers if marker in prediction.lower())
    explanation_score = min(explanation_score / 2, 1.0)
    # Check response completeness
    word_count = len(prediction.split())
    completeness_score = min(word_count / 50, 1.0)
    # Overall score
    overall_score = (technical_score + explanation_score + completeness_score) / 3
    return {
        "key": "domain_expertise",
        "score": overall_score,
        "comment": f"Technical: {technical_score:.2f}, Explanation: {explanation_score:.2f}, Completeness: {completeness_score:.2f}",
        "metadata": {
            "word_count": word_count,
            "technical_terms_found": technical_score * 3,
            "explanation_markers_found": explanation_score * 2
        }
    }

def response_structure_evaluator(run: Run, example: Example) -> dict:
    """Evaluator that checks if responses follow expected structure"""
    prediction = run.outputs.get("answer", "")
    # Check for structured response elements
    has_introduction = bool(re.search(r'^[A-Z].*[.!?]', prediction))
    has_examples = "example" in prediction.lower() or "for instance" in prediction.lower()
    has_conclusion = any(phrase in prediction.lower() for phrase in ["in conclusion", "to summarize"])
    structure_score = sum([has_introduction, has_examples, has_conclusion]) / 3
    return {
        "key": "response_structure",
        "score": structure_score,
        "comment": f"Introduction: {has_introduction}, Examples: {has_examples}, Conclusion: {has_conclusion}"
    }

# Use custom evaluators in evaluation
results = evaluate(
    my_rag_application,
    data="technical_qa_dataset",
    evaluators=[
        domain_expertise_evaluator,
        response_structure_evaluator,
        Correctness() # Mix custom and built-in evaluators
    ],
    experiment_prefix="custom_evaluation_v1"
)

LLM-as-a-Judge Pattern for Advanced Evaluation

Sometimes you need an LLM to evaluate LLM outputs. Here's a sophisticated implementation:

Responsive IDE Code Block
   Python
# LLM Judge Evaluator Function
def llm_judge_evaluator(run: Run, example: Example) -> dict:
    """
    Use GPT-4 as a judge to evaluate response quality
    """
    prediction = run.outputs.get("answer", "")
    question = example.inputs.get("question", "")
    reference = example.outputs.get("reference_answer", "") if example.outputs else ""
    evaluation_prompt = f"""
    You are an expert evaluator of AI assistant responses. Please evaluate the following resp
    1. Accuracy: Is the information factually correct?
    2. Completeness: Does the response fully answer the question?
    3. Clarity: Is the response clear and well-structured?
    4. Relevance: Does the response stay focused on the question?
    Question: {question}
    Response to evaluate: {prediction}
    Reference answer (if available): {reference}
    Please provide your evaluation as a JSON object with the following structure:
    {{
        "accuracy": ,
        "completeness": ,
        "clarity": ,
        "relevance": ,
        "overall": ,
        "reasoning": ""
    }}
    Only return the JSON, no additional text.
    """
    try:
        judge_response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": evaluation_prompt}],
            temperature=0.1
        )
        evaluation = json.loads(judge_response.choices.message.content)
        overall_score = evaluation.get("overall", 0) / 10 # Normalize to 0-1
        return {
            "key": "llm_judge_evaluation",
            "score": overall_score,
            "comment": evaluation.get("reasoning", ""),
            "metadata": evaluation
        }
    except Exception as e:
        return {
            "key": "llm_judge_evaluation",
            "score": 0.0,
            "comment": f"Evaluation failed: {str(e)}"
        }

Human Feedback Integration and Annotation Queues

Sometimes you need human judgment to validate your AI systems. LangSmith makes collecting and incorporating human feedback straightforward.

Setting Up Annotation Queues

Annotation queues provide a streamlined interface for human reviewers to evaluate AI outputs:

Responsive IDE Code Block
   Python
# Create an annotation queue programmatically
queue_config = {
    "name": "Customer Support Quality Review",
    "description": "Review customer support responses for quality and accuracy",
    "default_dataset": "customer_support_responses",
    "instructions": """
Please evaluate each response based on:
1. Accuracy of information provided
2. Helpfulness to the customer
3. Professional tone and clarity
4. Completeness of the answer
Rate each criterion on a scale of 1-5.
""",
    "feedback_keys": [
        {
            "key": "accuracy",
            "description": "Is the information factually correct?",
            "type": "categorical",
            "categories": ["1", "2", "3", "4", "5"]
        },
        {
            "key": "helpfulness",
            "description": "How helpful is this response to the customer?",
            "type": "categorical",
            "categories": ["1", "2", "3", "4", "5"]
        },
        {
            "key": "overall_quality",
            "description": "Overall response quality",
            "type": "categorical",
            "categories": ["Poor", "Fair", "Good", "Very Good", "Excellent"]
        }
    ],
    # Multiple reviewers for reliability
    "num_reviewers_per_run": 2,
    # Prevent conflicts
    "enable_reservations": True
}

Collecting and Using Human Feedback

Responsive IDE Code Block
   Python
# Add feedback to specific runs
ls_client.create_feedback(
    run_id="your-run-id",
    key="user_satisfaction",
    score=0.8, # 0-1 scale
    comment="Response was accurate but could be more detailed",
    metadata={
        "reviewer_id": "reviewer_123",
        "review_date": "2024-08-21",
        "criteria": {
            "accuracy": 5,
            "helpfulness": 4,
            "clarity": 4
        }
    }
)

# Query runs with specific feedback patterns
high_quality_runs = ls_client.list_runs(
    project_name="customer_support",
    filter='and(gte(feedback_avg_score, 0.8), has(feedback_key, "user_satisfaction"))'
)

# Use high-quality runs to create training datasets
for run in high_quality_runs:
    # Add to dataset for few-shot examples
    ls_client.create_example(
        dataset_id="high_quality_examples",
        inputs=run.inputs,
        outputs=run.outputs,
        metadata={
            "source": "human_validated",
            "quality_score": run.feedback_stats.get("user_satisfaction", {}).get("avg", 0)
        }
    )

LangSmith evaluation interface showing detailed test results, scores, and comparison between different prompt versions

Advanced Dataset Management and Curation

High-quality datasets are the foundation of reliable LLM applications. LangSmith provides sophisticated tools for creating, managing, and evolving your evaluation datasets.

Dataset Schema Validation and Management

LangSmith supports flexible dataset schemas that ensure consistency while allowing for iterative development:

Responsive IDE Code Block
   Python
# Langsmith dataset creation example
from langsmith import Client
import json

client = Client()

# Define a schema for your dataset
schema_definition = {
    "input_schema": {
        "question": {"type": "string", "required": True},
        "context": {"type": "string", "required": False},
        "user_id": {"type": "string", "required": False},
        "difficulty_level": {"type": "string", "enum": ["easy", "medium", "hard"]}
    },
    "output_schema": {
        "reference_answer": {"type": "string", "required": True},
        "expected_sources": {"type": "array", "items": {"type": "string"}},
        "quality_score": {"type": "number", "minimum": 0, "maximum": 1}
    }
}

# Create dataset with schema
dataset = client.create_dataset(
    dataset_name="customer_qa_v2",
    description="Customer Q&A with validation schema",
    metadata={
        "schema": schema_definition,
        "version": "2.0",
        "created_by": "data_team"
    }
)

# Add examples that conform to schema
examples = [
    {
        "inputs": {
            "question": "How do I reset my password?",
            "difficulty_level": "easy",
            "user_id": "user_123"
        },
        "outputs": {
            "reference_answer": "To reset your password, go to the login page and click ...",
            "expected_sources": ["help_documentation", "user_guide"],
            "quality_score": 0.95
        },
        "metadata": {
            "source": "customer_service_team",
            "validated": True
        }
    }
]

# Batch create examples
for example_data in examples:
    client.create_example(
        dataset_id=dataset.id,
        inputs=example_data["inputs"],
        outputs=example_data["outputs"],
        metadata=example_data["metadata"]
    )

Synthetic Data Generation

LangSmith can generate synthetic examples to enhance your datasets:

Responsive IDE Code Block
   Python
# Generate synthetic examples based on existing patterns
def generate_synthetic_examples(base_dataset: str, num_examples: 50):
    """Generate synthetic examples using LLM"""
    # Get existing examples for pattern learning
    existing_examples = list(client.list_examples(dataset_name=base_dataset, limit=10))
    
    # Create generation prompt based on patterns
    pattern_examples = "\n\n".join([
        f"Question: {ex.inputs['question']}\nAnswer: {ex.outputs['reference_answer']}"
        for ex in existing_examples[:5]
    ])
    
    generation_prompt = f"""
    Based on these example question-answer pairs, generate {num_examples} new similar example
    {pattern_examples}
    Generate new examples that follow the same pattern and domain. For each example, provide:
    1. A unique question
    2. A comprehensive answer
    3. A difficulty level (easy/medium/hard)
    Format as JSON array with objects containing 'question', 'answer', and 'difficulty_level'
    """
    
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{'role': 'user', 'content': generation_prompt}],
        temperature=0.7
    )
    
    try:
        synthetic_examples = json.loads(response.choices.message.content)
        
        # Add synthetic examples to dataset
        for example in synthetic_examples:
            client.create_example(
                dataset_id=dataset.id,
                inputs={
                    'question': example['question'],
                    'difficulty_level': example['difficulty_level']
                },
                outputs={
                    'reference_answer': example['answer'],
                    'quality_score': 0.8 # Lower score for synthetic data
                },
                metadata={
                    'source': 'synthetic_generation',
                    'generated_at': '2024-08-21',
                    'needs_review': True
                }
            )
    except json.JSONDecodeError:
        print("Failed to parse synthetic examples")

# Generate synthetic data
generate_synthetic_examples("customer_qa_v2", num_examples=25)

Dataset Versioning and Lifecycle Management

Responsive IDE Code Block
   Python
# Create dataset versions for different stages of development
def create_dataset_version(base_dataset: str, version_name: str, filter_criteria: dict):
    """Create a new version of a dataset with specific criteria"""
    # Query examples based on criteria
    examples = list(client.list_examples(
        dataset_name=base_dataset,
        limit=1000 # Adjust based on your needs
    ))
    # Filter examples based on criteria
    filtered_examples = []
    for example in examples:
        if all(
            example.metadata.get(key) == value
            for key, value in filter_criteria.items()
        ):
            filtered_examples.append(example)
    # Create new dataset version
    new_dataset = client.create_dataset(
        dataset_name=f"{base_dataset}_{version_name}",
        description=f"Version {version_name} of {base_dataset} dataset",
        metadata={
            "base_dataset": base_dataset,
            "version": version_name,
            "filter_criteria": filter_criteria,
            "example_count": len(filtered_examples)
        }
    )
    # Copy filtered examples to new dataset
    for example in filtered_examples:
        client.create_example(
            dataset_id=new_dataset.id,
            inputs=example.inputs,
            outputs=example.outputs,
            metadata={**example.metadata, "version": version_name}
        )
    return new_dataset

# Create different dataset versions
prod_dataset = create_dataset_version(
    "customer_qa_v2",
    "production",
    {"validated": True, "quality_score": 0.9}
)
test_dataset = create_dataset_version(
    "customer_qa_v2",
    "regression_test",
    {"source": "customer_service_team"}
)
  

Dataset management interface showing schema validation, version control, and synthetic data generation features

Prompt Engineering and Version Control

Managing prompts across a team can quickly become chaotic without proper version control. LangSmith's prompt hub provides Git-like functionality specifically designed for prompt management.

Creating and Managing Prompts at Scale

You can create and manage prompts directly in the LangSmith UI or programmatically:

Responsive IDE Code Block
  Python
# LangSmith advanced prompt and chat templates
from langsmith import Client
import json

client = Client()

# Create a sophisticated prompt template
advanced_prompt_template = """
You are an expert {role} with {experience_years} years of experience. Your task is to {ta
Context Information:
{context}
Specific Requirements:
{requirements}
Constraints:
- Keep response under {max_words} words
- Use {tone} tone
- Include {num_examples} specific examples
- Target audience: {audience_level}
Additional Instructions:
{additional_instructions}
Please provide your response following this structure:
1. Executive Summary
2. Detailed Analysis
3. Recommendations
4. Next Steps
Response:
"""

# Save advanced prompt to LangSmith
client.create_prompt(
    prompt_name="expert_consultant_template",
    object_type="prompt",
    template=advanced_prompt_template,
    metadata={
        "category": "consulting",
        "complexity": "advanced",
        "variables": [
            "role", "experience_years", "task_description",
            "context", "requirements", "max_words", "tone",
            "num_examples", "audience_level", "additional_instructions"
        ],
        "use_cases": ["business_consulting", "technical_analysis", "strategic_planning"]
    }
)

# Create chat-based prompt for conversational AI
chat_prompt_template = [
{
    "role": "system",
    "content": """You are a {personality_type} AI assistant specialized in {domain}.
Your communication style:
- {communication_style}
- Always acknowledge the user's context: {user_context}
- Adapt your expertise level to: {expertise_level}
Remember these key principles:
{key_principles}"""
},
{
    "role": "user",
    "content": "{user_input}"
}
]

client.create_prompt(
    prompt_name="adaptive_chat_assistant",
    object_type="chat",
    template=chat_prompt_template,
    metadata={
        "type": "conversational",
        "adaptivity": "high",
        "domains": ["customer_support", "technical_help", "sales"]
    }
)

Advanced Prompt Versioning and Collaboration

Here's how to implement sophisticated prompt version control:

Responsive IDE Code Block
   Python
# LangChain prompt deployment example
from langchain import hub
import hashlib
import datetime

def deploy_prompt_version(prompt_name: str, environment: str, test_results: dict):
    """Deploy a prompt version after validation"""
    # Pull latest version for testing
    latest_prompt = hub.pull(f"your-org/{prompt_name}")
    # Run validation tests
    validation_passed = all(
        score >= 0.8 for score in test_results.values()
    )
    if validation_passed:
        # Tag the version for deployment
        version_hash = hashlib.md5(str(latest_prompt).encode()).hexdigest()[:8]
        # Create deployment tag
        hub.push(
            f"your-org/{prompt_name}",
            latest_prompt,
            commit_message=f"Deploy to {environment} - validation scores: {test_results}",
            tags=[f"{environment}-v{version_hash}", f"deployed-{datetime.date.today()}"]
        )
        return version_hash
    else:
        raise ValueError(f"Validation failed: {test_results}")

# Example deployment workflow
test_results = {
    "accuracy": 0.92,
    "helpfulness": 0.87,
    "safety": 0.95
}
try:
    version = deploy_prompt_version("customer_support_agent", "production", test_results)
    print(f"Successfully deployed version {version} to production")
except ValueError as e:
    print(f"Deployment failed: {e}")

# Use specific version in production
production_prompt = hub.pull(f"your-org/customer_support_agent:production-v{version}")

A/B Testing Different Prompt Versions

Implement sophisticated A/B testing for prompt optimization:

Responsive IDE Code Block
   Python
import random
from langsmith import traceable
import hashlib

class PromptABTester:
    def __init__(self, prompt_variants: dict, traffic_split: dict):
        self.prompt_variants = prompt_variants
        self.traffic_split = traffic_split

    def get_prompt_version(self, user_id: str) -> tuple:
        """Deterministic assignment based on user_id"""
        user_hash = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        user_bucket = user_hash % 100
        cumulative_split = 0
        for version, percentage in self.traffic_split.items():
            cumulative_split += percentage
            if user_bucket < cumulative_split:
                return version, self.prompt_variants[version]
        # Default fallback
        return "control", self.prompt_variants["control"]

# Define prompt variants
prompt_variants = {
    "control": """Answer the user's question directly and concisely.\nQuestion: {question}\nAnswer:""",
    "detailed": """Provide a comprehensive answer to the user's question. Include context\nQuestion: {question}\nAnswer:""",
    "conversational": """You're a friendly expert. Answer the user's question in a conver\nQuestion: {question}\nAnswer:"""
}

# Set up A/B test
ab_tester = PromptABTester(
    prompt_variants=prompt_variants,
    traffic_split={ "control": 40, "detailed": 30, "conversational": 30 }
)

@traceable(name="AB Test Response Generator")
def ab_test_response(user_id: str, question: str) -> dict:
    """Generate response using A/B tested prompts"""
    variant_name, prompt_template = ab_tester.get_prompt_version(user_id)
    formatted_prompt = prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": formatted_prompt}],
        temperature=0.7
    )
    # Tag the run with variant info for analysis
    return {
        "response": response.choices.message.content,
        "variant": variant_name,
        "user_id": user_id,
        "question": question,
        "prompt_used": formatted_prompt
    }

# Test the A/B system
for i in range(10):
    result = ab_test_response(f"user_{i}", "What is machine learning?")
    print(f"User {i}: Variant {result['variant']}")

Prompt engineering interface showing A/B testing results, version history, and collaborative editing features

Production Monitoring and Alerting

Once your LLM application is live, comprehensive monitoring becomes critical. LangSmith's alerting system helps you catch issues before your users do.

Setting Up Intelligent Alerts

LangSmith supports alerts on a wide range of metrics:

Responsive IDE Code Block
   Python
# Configure comprehensive alert system
alert_configurations = [
{
"name": "High Error Rate Alert",
"metric": "error_rate",
"threshold": 0.05, # Alert if error rate exceeds 5%
"aggregation_window": "15_minutes",
"notification_channels": [
    {
        "type": "webhook",
        "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
        "headers": {
            "Content-Type": "application/json",
            "Authorization": "Bearer your-token"
        },
        "body_template": {
            "text": "🚨 LangSmith Alert: Error rate exceeded {threshold}% in project",
            "channel": "#ai-alerts",
            "username": "LangSmith Bot"
        }
    },
    {
        "type": "pagerduty",
        "integration_key": "your-pagerduty-key",
        "severity": "warning"
    }
]
},
# Additional alerts follow similarly...
]
# Alerts Configuration JSON
{
  "name": "Latency Spike Alert",
  "metric": "p95_latency",
  "threshold": 10.0, # Alert if P95 latency exceeds 10 seconds
  "aggregation_window": "10_minutes",
  "notification_channels": [
    {
      "type": "webhook",
      "url": "https://your-monitoring-system.com/alerts",
      "body_template": {
        "alert_type": "latency_spike",
        "threshold": "{threshold}",
        "current_value": "{current_value}",
        "project": "{project_name}",
        "timestamp": "{timestamp}"
      }
    }
  ],
}
{
  "name": "Cost Threshold Alert",
  "metric": "total_cost",
  "threshold": 100.0, # Alert if daily cost exceeds $100
  "aggregation_window": "1_day",
  "notification_channels": [
    {
      "type": "email",
      "recipients": ["team@company.com", "finance@company.com"],
      "subject": "LangSmith Daily Cost Alert - {project_name}",
      "body": "Daily LLM costs have exceeded ${threshold} for project {project_}"
    }
  ],
},
{
  "name": "Feedback Score Drop",
  "metric": "avg_feedback_score",
  "threshold": 0.7, # Alert if average feedback drops below 70%
  "comparison": "less_than",
  "aggregation_window": "1_hour",
  "notification_channels": [
    {
      "type": "webhook",
      "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
      "body_template": {
        "text": "📉 Quality Alert: User feedback scores dropped to {current_va}",
        "channel": "#quality-alerts"
      }
    }
  ]
  }
]      
  

Custom Monitoring Dashboards

Create custom monitoring solutions for specific business metrics:

Responsive IDE Code Block
   Python
# LangSmith Monitoring System Example
from langsmith import Client
import pandas as pd
import json
from datetime import datetime, timedelta

class LangSmithMonitor:
    def __init__(self, client: Client):
        self.client = client

    def get_business_metrics(self, project_name: str, days: 7) -> dict:
        """Get business-focused metrics"""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=days)
        # Get all runs in the time period
        runs = list(self.client.list_runs(
            project_name=project_name,
            start_time=start_time.isoformat(),
            end_time=end_time.isoformat(),
            limit=10000
        ))
        # Calculate business metrics
        total_requests = len(runs)
        successful_requests = len([r for r in runs if r.status == 'success'])
        failed_requests = total_requests - successful_requests
        # Cost analysis
        total_cost = sum(r.total_cost or 0 for r in runs)
        cost_by_model = {}
        for run in runs:
            model = run.extra.get('invocation_params', {}).get('model', 'unknown')
            cost_by_model[model] = cost_by_model.get(model, 0) + (run.total_cost or 0)
        # Performance analysis
        latencies = [(r.end_time - r.start_time).total_seconds() for r in runs if r.end_time and r.start_time]
        # User satisfaction analysis
        feedback_scores = []
        for run in runs:
            if hasattr(run, 'feedback_stats') and run.feedback_stats:
                for key, stats in run.feedback_stats.items():
                    if 'avg' in stats:
                        feedback_scores.append(stats['avg'])
        return {
            "period": f"{days} days",
            "total_requests": total_requests,
            "success_rate": successful_requests / total_requests if total_requests > 0 else 0,
            "error_rate": failed_requests / total_requests if total_requests > 0 else 0,
            "total_cost": total_cost,
            "cost_per_request": total_cost / total_requests if total_requests > 0 else 0,
            "cost_by_model": cost_by_model,
            "avg_latency": sum(latencies) / len(latencies) if latencies else 0,
            "p95_latency": pd.Series(latencies).quantile(0.95) if latencies else 0,
            "avg_satisfaction": sum(feedback_scores) / len(feedback_scores) if feedback_scores else 0,
            "satisfaction_responses": len(feedback_scores)
        }
        # LangSmith Daily Report Example
def generate_daily_report(self, project_name: str) -> str:
    """Generate a comprehensive daily report"""
    metrics = self.get_business_metrics(project_name, days=1)
    week_metrics = self.get_business_metrics(project_name, days=7)
    report = f"""
# LangSmith Daily Report - {project_name}
Date: {datetime.now().strftime('%Y-%m-%d')}
## 📊 Performance Summary (Last 24 Hours)
- **Total Requests**: {metrics['total_requests']:,}
- **Success Rate**: {metrics['success_rate']:.2%}
- **Average Latency**: {metrics['avg_latency']:.2f} seconds
- **P95 Latency**: {metrics['p95_latency']:.2f} seconds
## 💰 Cost Analysis
- **Total Cost**: ${metrics['total_cost']:.2f}
- **Cost per Request**: ${metrics['cost_per_request']:.4f}
- **Top Models by Cost**:
"""
    # Add top 3 models by cost
    sorted_models = sorted(metrics['cost_by_model'].items(), key=lambda x: x[^1], rev)
    for model, cost in sorted_models[:3]:
        report += f" - {model}: ${cost:.2f}\n"
    # Add satisfaction metrics if available
    if metrics['avg_satisfaction'] is not None:
        report += f"\n## 😊 User Satisfaction\n"
        report += f"- **Average Score**: {metrics['avg_satisfaction']:.2f}/5.0\n"
        report += f"- **Total Responses**: {metrics['satisfaction_responses']}\n"
    # Weekly comparison
    report += f"\n## 📈 Weekly Trends\n"
    report += f"- **Request Volume**: {((metrics['total_requests'] * 7) / week_metrics)}\n"
    report += f"- **Cost Trend**: {((metrics['total_cost'] * 7) / week_metrics)}\n"
    return report

# Use the monitoring system
monitor = LangSmithMonitor(ls_client)
daily_report = monitor.generate_daily_report("your_production_project")
print(daily_report)

# Set up automated reporting (example with email)
def send_daily_report(project_name: str):
    """Send daily report via email or Slack"""
    report = monitor.generate_daily_report(project_name)
    # Example: Send to Slack
    slack_webhook = "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    payload = {
        "text": f"``````",
        "channel": "#ai-reports",
        "username": "LangSmith Reporter"
    }
    # In production, you'd actually send this request
    print(f"Would send report to Slack: {len(report)} characters")

Production monitoring dashboard showing real-time alerts, cost tracking, performance metrics, and business KPIs

Enterprise Features and Advanced Integrations

For large-scale deployments, LangSmith offers enterprise features including self-hosting, advanced security, and extensive integration capabilities.

Self-Hosted Deployment Architecture

LangSmith can be deployed in your own infrastructure for maximum security and control:

Responsive IDE Code Block
   YAML
# docker-compose.yml for self-hosted LangSmith
version: '3.8'
services:
  langsmith-frontend:
    image: langchain/langsmith-frontend:latest
    ports:
      - "3000:3000"
    environment:
      - BACKEND_URL=http://langsmith-backend:8000
    depends_on:
      - langsmith-backend
  langsmith-backend:
    image: langchain/langsmith-backend:latest
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:password@postgres:5432/langsmith
      - REDIS_URL=redis://redis:6379
      - CLICKHOUSE_URL=http://clickhouse:8123
    depends_on:
      - postgres
      - redis
      - clickhouse
  langsmith-platform-backend:
    image: langchain/langsmith-platform-backend:latest
    environment:
      - DATABASE_URL=postgresql://user:password@postgres:5432/langsmith
  langsmith-playground:
    image: langchain/langsmith-playground:latest
    environment:
      - BACKEND_URL=http://langsmith-backend:8000
  langsmith-queue:
    image: langchain/langsmith-queue:latest
    environment:
      - REDIS_URL=redis://redis:6379
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_DB=langsmith
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres_data:/var/lib/postgresql/data
  redis:
    image: redis:6-alpine
    volumes:
      - redis_data:/data
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    environment:
      - CLICKHOUSE_DB=langsmith
      - CLICKHOUSE_USER=default
      - CLICKHOUSE_PASSWORD=password
    volumes:
      - clickhouse_data:/var/lib/clickhouse
volumes:
  postgres_data:
  redis_data:
  clickhouse_data:

Advanced Security and Compliance

For enterprises with strict security requirements:

Responsive IDE Code Block
   Python
# Secure LangSmith Client Example
from langsmith import Client
import hashlib
from cryptography.fernet import Fernet

class SecureLangSmithClient:
    def __init__(self, api_key: str, encryption_key: bytes):
        self.client = Client(api_key=api_key)
        self.cipher = Fernet(encryption_key)

    def create_secure_example(self, dataset_id: str, inputs: dict, outputs: dict, metadata: dict):
        """Create example with PII encryption"""
        # Identify and encrypt PII fields
        secure_inputs = self._encrypt_pii_fields(inputs)
        secure_outputs = self._encrypt_pii_fields(outputs)
        # Hash user identifiers
        if 'user_id' in metadata:
            metadata['user_id_hash'] = hashlib.sha256(
                metadata['user_id'].encode()
            ).hexdigest()
            del metadata['user_id']
        return self.client.create_example(
            dataset_id=dataset_id,
            inputs=secure_inputs,
            outputs=secure_outputs,
            metadata=metadata
        )

    def _encrypt_pii_fields(self, data: dict) -> dict:
        """Encrypt potential PII fields"""
        pii_fields = ['email', 'phone', 'ssn', 'credit_card', 'name']
        secure_data = data.copy()
        for field in pii_fields:
            if field in secure_data and isinstance(secure_data[field], str):
                secure_data[field] = self.cipher.encrypt(
                    secure_data[field].encode()
                ).decode()
        return secure_data

    def decrypt_pii_fields(self, data: dict) -> dict:
        """Decrypt PII fields for authorized access"""
        # Implementation for decryption with proper authorization checks
        pass

# Usage with enterprise security
encryption_key = Fernet.generate_key()
secure_client = SecureLangSmithClient("your-api-key", encryption_key)

CI/CD Integration and Automated Workflows

Integrate LangSmith into your development pipeline:

Responsive IDE Code Block
   YAML
# .github/workflows/langsmith-evaluation.yml
name: LangSmith Evaluation Pipeline
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  evaluate-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          pip install langsmith openevals openai
      - name: Run LangSmith Evaluations
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGSMITH_PROJECT: "ci-cd-evaluation-${{ github.run_id }}"
        run: |
          python scripts/run_evaluations.py
      - name: Post Results to PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('evaluation_results.json'));
            const comment = `## 🧪 LangSmith Evaluation Results
            - **Accuracy**: ${results.accuracy.toFixed(2)}
            - **Helpfulness**: ${results.helpfulness.toFixed(2)}
            - **Safety**: ${results.safety.toFixed(2)}
            ${results.accuracy >= 0.8 ? '✅' : '❌'} **Quality Gate**: ${results.accuracy >= `;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

  deploy-if-passed:
    needs: evaluate-changes
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Deploy to Production
        run: |
          echo "Deploying to production..."
          # Your deployment script here

Enterprise architecture diagram showing self-hosted deployment, security layers, and CI/CD integration

Advanced Use Cases and Real-World Examples

Let's explore some sophisticated real-world applications of LangSmith for complex AI systems.

Multi-Modal AI Application Monitoring

For applications handling text, images, and audio:

Responsive IDE Code Block
   Python
# Multi-modal AI analysis example
from langsmith import traceable
import openai
import base64
from PIL import Image
import io

@traceable(name="Multi-Modal Analysis", run_type="chain")
def analyze_customer_inquiry(text_input: str, image_data: bytes=None, audio_data: bytes=None) -> dict:
    """Analyze customer inquiry across multiple modalities"""
    
    analysis_results = {}
    
    # Process text input
    if text_input:
        text_analysis = analyze_text_sentiment(text_input)
        analysis_results['text_analysis'] = text_analysis
    
    # Process image if provided
    if image_data:
        image_analysis = analyze_product_image(image_data)
        analysis_results['image_analysis'] = image_analysis
    
    # Process audio if provided  
    if audio_data:
        audio_analysis = transcribe_and_analyze_audio(audio_data)
        analysis_results['audio_analysis'] = audio_analysis
    
    # Generate comprehensive response
    comprehensive_response = generate_multi_modal_response(analysis_results)
    
    return {
        "analysis_results": analysis_results,
        "response": comprehensive_response,
        "modalities_processed": len(analysis_results),
        "confidence_scores": extract_confidence_scores(analysis_results)
    }
    # Text and Image Analysis with LLM and Vision API

@traceable(name="Text Sentiment Analysis", run_type="llm")
def analyze_text_sentiment(text: str) -> dict:
    """Analyze sentiment and extract key information from text"""
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system",
             "content": "Analyze the sentiment and extract key issues from customer text."},
            {"role": "user", "content": text}
        ]
    )
    return {
        "sentiment": "positive",  # Parse from response
        "key_issues": ["billing", "service_quality"],  # Extract from response
        "urgency": "medium",  # Determine urgency level
        "confidence": 0.85
    }

@traceable(name="Product Image Analysis", run_type="tool")
def analyze_product_image(image_data: bytes) -> dict:
    """Analyze product images for defects or issues"""
    # Convert image data for vision API
    base64_image = base64.b64encode(image_data).decode()
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Analyze this product image for any defects, damage, or q"},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"}
                    }
                ]
            }
        ]
    )
    return {
        "defects_detected": False,
        "quality_score": 0.92,
        "identified_product": "smartphone_case",
        "analysis_confidence": 0.88
    }
    # Evaluate multi-modal system
def evaluate_multi_modal_system():
    """Comprehensive evaluation for multi-modal AI system"""
    test_cases = [
        {
            "text": "I'm having issues with my recent order",
            "image_path": "sample_images/damaged_product.jpg",
            "expected_urgency": "high",
            "expected_sentiment": "negative"
        },
        {
            "text": "Love the new features in your app!",
            "expected_urgency": "low",
            "expected_sentiment": "positive"
        }
    ]
    results = []
    for case in test_cases:
        image_data = None
        if case.get("image_path"):
            with open(case["image_path"], "rb") as f:
                image_data = f.read()
        result = analyze_customer_inquiry(
            text_input=case["text"],
            image_data=image_data
        )
        results.append({
            "case": case,
            "result": result,
            "passed": validate_multi_modal_result(result, case)
        })
    return results

# Multi-modal specific evaluator
def multi_modal_evaluator(run: Run, example: Example) -> dict:
    """Custom evaluator for multi-modal AI systems"""
    outputs = run.outputs
    expected = example.outputs
    modalities_score = 1.0 if outputs.get("modalities_processed", 0) >= expected.get("expected_modalities", 0) else 0.0
    confidence_scores = outputs.get("confidence_scores", {})
    avg_confidence = sum(confidence_scores.values()) / len(confidence_scores) if confidence_scores else 0
    confidence_score = 1.0 if avg_confidence >= 0.8 else avg_confidence
    overall_score = (modalities_score + confidence_score) / 2
    return {
        "key": "multi_modal_quality",
        "score": overall_score,
        "comment": f< span class="token-string">"Modalities: {modalities_score}, Confidence: {confidence_score:.2f}",
        "metadata": {
            "modalities_processed": outputs.get("modalities_processed", 0),
            "avg_confidence": avg_confidence
        }
    }

Large-Scale RAG System Optimization

For enterprise RAG systems serving thousands of users:

Responsive IDE Code Block
   Python
# Enterprise RAG System with LangSmith tracing
from langsmith import traceable, Client
import numpy as np
from typing import List, Dict
import asyncio

class EnterpriseRAGSystem:
    def __init__(self, client: Client):
        self.client = client
        self.vector_store = None # Your vector store
        self.reranker = None # Your reranking model

    @traceable(name="Hierarchical Retrieval", run_type="retriever")
    async def hierarchical_retrieve(self, query: str, user_context: dict) -> List[dict]:
        """Multi-stage retrieval with user context"""
        semantic_results = await self.semantic_search(query, top_k=50)
        filtered_results = self.filter_by_user_context(semantic_results, user_context)
        reranked_results = await self.rerank_results(query, filtered_results, top_k=10)
        return reranked_results

    @traceable(name="Adaptive Response Generation", run_type="llm")
    async def generate_adaptive_response(self, query: str, context: List[dict], user_profile: dict)->dict:
        """Generate response adapted to user profile and context"""
        style_prompt = self.get_style_prompt(user_profile)
        context_text = "\n".join([doc["content"] for doc in context])
        system_prompt = f"""
{style_prompt}
Use the following context to answer the user's question:
{context_text}
Remember to:
- Cite specific sources when making claims
- Adapt your explanation level to the user's expertise: {user_profile.get('expertise_level', 'general')}
- Include relevant examples for their industry: {user_profile.get('industry', 'general')}
"""
        response = await openai_client.achat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": query}
            ],
            temperature=0.7
        )
        return {
            "response": response.choices.message.content,
            "sources_used": [doc["source"] for doc in context],
            "adaptation_applied": True,
            "user_expertise_level": user_profile.get('expertise_level')
        }

    @traceable(name="Enterprise RAG Pipeline", run_type="chain")
    async def process_query(self, query: str, user_id: str) -> dict:
        """Complete RAG pipeline with enterprise features"""
        user_context = await self.get_user_context(user_id)
        user_profile = await self.get_user_profile(user_id)
        retrieved_docs = await self.hierarchical_retrieve(query, user_context)
        response_data = await self.generate_adaptive_response(query, retrieved_docs, user_profile)
        await self.log_user_interaction(user_id, query, response_data)
        return {
            **response_data,
            "query": query,
            "user_id": user_id,
            "retrieval_count": len(retrieved_docs),
            "personalized": True
        }
        # Enterprise RAG Pipeline & Evaluation
model="gpt-4",
messages=[
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": query}
],
temperature=0.7
)
return {
  "response": response.choices.message.content,
  "sources_used": [doc["source"] for doc in context],
  "adaptation_applied": True,
  "user_expertise_level": user_profile.get('expertise_level')
}
@traceable(name="Enterprise RAG Pipeline", run_type="chain")
async def process_query(self, query: str, user_id: str) -> dict:
  """Complete RAG pipeline with enterprise features"""
  user_context = await self.get_user_context(user_id)
  user_profile = await self.get_user_profile(user_id)
  retrieved_docs = await self.hierarchical_retrieve(query, user_context)
  response_data = await self.generate_adaptive_response(query, retrieved_docs, user_profile)
  await self.log_user_interaction(user_id, query, response_data)
  return {
    **response_data,
    "query": query,
    "user_id": user_id,
    "retrieval_count": len(retrieved_docs),
    "personalized": True
  }
retrieved_docs = await self.rag_system.hierarchical_retrieve(
  query,
  user_context=example.inputs.get("user_context", {})
)
retrieved_sources = [doc["source"] for doc in retrieved_docs]
precision = self.calculate_precision(retrieved_sources, expected_sources)
recall = self.calculate_recall(retrieved_sources, expected_sources)
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + re
results.append({
  "query": query,
  "precision": precision,
  "recall": recall,
  "f1_score": f1_score,
  "retrieved_count": len(retrieved_docs)
})
avg_precision = sum(r["precision"] for r in results) / len(results)
avg_recall = sum(r["recall"] for r in results) / len(results)
avg_f1 = sum(r["f1_score"] for r in results) / len(results)
return {
  "avg_precision": avg_precision,
  "avg_recall": avg_recall,
  "avg_f1_score": avg_f1,
  "total_queries": len(results),
  "detailed_results": results
}
def calculate_precision(self, retrieved: List[str], expected: List[str]) -> float:
  """Calculate precision@k"""
  if not retrieved:
    return 0.0
  relevant_retrieved = set(retrieved) & set(expected)
  return len(relevant_retrieved) / len(retrieved)
def calculate_recall(self, retrieved: List[str], expected: List[str]) -> float:
  """Calculate recall@k"""
  if not expected:
    return 1.0
  relevant_retrieved = set(retrieved) & set(expected)
  return len(relevant_retrieved) / len(expected)
# Comprehensive RAG evaluation
async def run_comprehensive_rag_evaluation():
  """Run complete evaluation of RAG system"""
  rag_system = EnterpriseRAGSystem(ls_client)
  evaluator = RAGEvaluationFramework(rag_system)
  # Evaluate different aspects
retrieval_results = await evaluator.evaluate_retrieval_quality("rag_test_dataset")
# End-to-end evaluation with LangSmith
end_to_end_results = evaluate(
    rag_system.process_query,
    data="rag_test_dataset",
    evaluators=[
        Correctness(),
        Helpfulness(),
        custom_rag_evaluator
    ],
    experiment_prefix="rag_comprehensive_eval"
)
return {
    "retrieval_metrics": retrieval_results,
    "end_to_end_metrics": end_to_end_results
}

def custom_rag_evaluator(run: Run, example: Example) -> dict:
    """Custom evaluator for RAG-specific metrics"""
    outputs = run.outputs
    response = outputs.get("response", "")
    sources_used = outputs.get("sources_used", [])
    # Check citation quality
    citation_score = 1.0 if len(sources_used) >= 2 else 0.5
    # Check response completeness
    word_count = len(response.split())
    completeness_score = min(word_count / 100, 1.0)
    # Check personalization effectiveness
    personalization_score = 1.0 if outputs.get("personalized", False) else 0.0
    overall_score = (citation_score + completeness_score + personalization_score) / 3
    return {
        "key": "rag_quality",
        "score": overall_score,
        "comment": f"Citations: {citation_score}, Completeness: {completeness_score}' Personalization: {personalization_score}",
        "metadata": {
            "sources_count": len(sources_used),
            "word_count": word_count,
            "personalized": outputs.get("personalized", False)
        }
    }

Complex RAG system architecture showing hierarchical retrieval, reranking, and personalized response generation

Cost Optimization and Performance Tuning

LLM applications can become expensive quickly. LangSmith provides detailed cost tracking and optimization tools:

Responsive IDE Code Block
   Python
from langsmith import Client
import pandas as pd
from datetime import datetime, timedelta

class CostOptimizationAnalyzer:
    def __init__(self, client: Client):
        self.client = client
    
    def analyze_cost_patterns(self, project_name: 'str', days: int = 30) -> dict:
        """Comprehensive cost analysis and optimization recommendations"""
        
        end_time = datetime.now()
        start_time = end_time - timedelta(days=days)
        
        # Get all runs with cost data
        runs = list(self.client.list_runs(
            project_name=project_name,
            start_time=start_time.isoformat(),
            end_time=end_time.isoformat(),
            limit=50000
        ))
        
        # Analyze cost by different dimensions
        cost_analysis = self._analyze_cost_dimensions(runs)
        
        # Generate optimization recommendations
        recommendations = self._generate_cost_recommendations(cost_analysis)
        
        return {
            "analysis_period": f"{days} days",
            "total_cost": cost_analysis["total_cost"],
            "cost_breakdown": cost_analysis,
            "optimization_recommendations": recommendations,
            "potential_savings": self._calculate_potential_savings(cost_analysis)
        }
    
    def _analyze_cost_dimensions(self, runs: list) -> dict:
        """Analyze costs across multiple dimensions"""
        
        df = pd.DataFrame([{
            'run_id': run.id,
            'model': run.extra.get('invocation_params', {}).get('model', 'unknown'),
            'total_tokens': run.total_tokens or 0,
            'prompt_tokens': getattr(run, 'prompt_tokens', 0),
            'completion_tokens': getattr(run, 'completion_tokens', 0),
            'total_cost': run.total_cost or 0,
            'latency': (run.end_time - run.start_time).total_seconds() if run.end_time and run.start_time else None,
            'run_type': run.run_type,
            'status': run.status,
            'timestamp': run.start_time,
            'user_id': run.inputs.get('user_id') if run.inputs else None,
            'session_id': run.inputs.get('session_id') if run.inputs else None
        } for run in runs])
        
        analysis = {
            "total_cost": df['total_cost'].sum(),
            "total_requests": len(df),
            "avg_cost_per_request": df['total_cost'].mean(),
            "cost_by_model": df.groupby('model')['total_cost'].agg(['sum', 'count', 'mean']).to_dict(),
            "cost_by_run_type": df.groupby('run_type')['total_cost'].agg(['sum', 'count', 'mean']).to_dict(),
            "daily_costs": df.groupby(df['timestamp'].dt.date)['total_cost'].sum().to_dict(),
            "high_cost_requests": df[df['total_cost'] > df['total_cost'].quantile(0.95)].to_dict('records'),
            "token_efficiency": {
                "avg_tokens_per_request": df['total_tokens'].mean(),
                "cost_per_token": df['total_cost'].sum() / df['total_tokens'].sum() if df['total_tokens'].sum() > 0 else 0,
                "prompt_to_completion_ratio": df['prompt_tokens'].sum() / df['completion_tokens'].sum() if df['completion_tokens'].sum() > 0 else 0
            }
        }
        
        return analysis
    
    def _generate_cost_recommendations(self, analysis: dict) -> list:
        """Generate specific cost optimization recommendations"""
        
        recommendations = []
        # Model optimization recommendations
model_costs = analysis["cost_by_model"]["sum"]
most_expensive_model = max(model_costs, key=model_costs.get)

if "gpt-4" in most_expensive_model and model_costs[most_expensive_model] > analysis["total_cost"] * 0.6:
    recommendations.append({
        "type": "model_optimization",
        "priority": "high",
        "description": f"Consider using GPT-3.5-turbo for simpler tasks. {most_expensive_model} accounts for {model_costs[most_expensive_model]/analysis['total_cost']:.1%} of total costs.",
        "potential_savings": model_costs[most_expensive_model] * 0.7,
        "implementation": "Route simple queries to cheaper models using a classifier"
    })

# Token efficiency recommendations
token_stats = analysis["token_efficiency"]
if token_stats["prompt_to_completion_ratio"] > 5:
    recommendations.append({
        "type": "prompt_optimization", 
        "priority": "medium",
        "description": f"High prompt-to-completion ratio ({token_stats['prompt_to_completion_ratio']:.1f}:1) suggests prompts may be too long",
        "potential_savings": analysis["total_cost"] * 0.2,
        "implementation": "Optimize prompts to reduce length while maintaining quality"
    })

# Request pattern optimization
if analysis["avg_cost_per_request"] > 0.10:
    recommendations.append({
        "type": "request_optimization",
        "priority": "medium", 
        "description": f"High average cost per request (${analysis['avg_cost_per_request']:.3f})",
        "potential_savings": analysis["total_cost"] * 0.15,
        "implementation": "Implement caching for common queries and batch similar requests"
    })

return recommendations

def implement_model_routing(self, query_complexity_threshold: float = 0.7) -> callable:
    """Implement intelligent model routing to optimize costs"""
    
    @traceable(name="Cost-Optimized Model Router", run_type="chain")
    def route_to_optimal_model(query: str, user_context: dict = None) -> dict:
        """Route queries to the most cost-effective model"""
        
        complexity_score = self.assess_query_complexity(query)
        
        if complexity_score < query_complexity_threshold:
            model = "gpt-3.5-turbo"
            max_tokens = 500
        elif complexity_score < 0.9:
            model = "gpt-4"
            max_tokens = 1000
        else:
            model = "gpt-4-32k"
            max_tokens = 2000
        
        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}],
            max_tokens=max_tokens,
            temperature=0.7
        )
        
        return {
            "response": response.choices[^0].message.content,
            "model_used": model,
            "complexity_score": complexity_score,
            "estimated_cost": self.estimate_cost(response.usage, model),
            "routing_decision": f"Routed to {model} based on complexity {complexity_score:.2f}"
        }
    
    return route_to_optimal_model

def assess_query_complexity(self, query: str) -> float:
    """Assess query complexity to determine optimal model"""
    
    complexity_indicators = {
        "length": len(query.split()) / 100,
        "question_words": len([w for w in query.lower().split() if w in ["what", "how", "why", "when", "where", "which"]]) / 10,
        "technical_terms": len([w for w in query.lower().split() if w in ["algorithm", "implementation", "architecture", "optimization"]]) / 5,
        "reasoning_required": 1.0 if any(phrase in query.lower() for phrase in ["compare", "analyze", "explain", "evaluate"]) else 0.3
    }
    weights = {"length": 0.2, "question_words": 0.3, "technical_terms": 0.3, "reasoning_required": 0.4}
    complexity = sum(min(score, 1.0) * weights[factor] for factor, score in complexity_indicators.items())
    
    return min(complexity, 1.0)

def setup_cost_alerts(self, project_name: str, daily_budget: float = 100.0):
    """Set up intelligent cost alerts"""
    
    alert_config = {
        "name": f"Smart Cost Alert - {project_name}",
        "conditions": [
            {"metric": "daily_cost", "threshold": daily_budget * 0.8, "type": "warning"},
            {"metric": "daily_cost", "threshold": daily_budget, "type": "critical"},
            {"metric": "cost_per_request", "threshold": 0.15, "type": "optimization_opportunity"}
        ],
        "actions": [
            {"type": "webhook",
             "url": "https://your-cost-management-system.com/alert",
             "body": {
                 "project": project_name,
                 "alert_type": "{alert_type}",
                 "current_cost": "{current_value}",
                 "threshold": "{threshold}",
                 "recommendations": "{optimization_recommendations}"
             }}
        ]
    }
    return alert_config

    # Usage example
cost_analyzer = CostOptimizationAnalyzer(ls_client)

# Analyze costs and get recommendations
cost_report = cost_analyzer.analyze_cost_patterns("production_app", days=30)
print(f"Total 30-day cost: ${cost_report['total_cost']:.2f}")
print(f"Potential monthly savings: ${cost_report['potential_savings']:.2f}")

# Implement cost-optimized routing
optimized_router = cost_analyzer.implement_model_routing(complexity_threshold=0.6)

# Test the router
test_queries = [
    "What is the weather today?",  # Simple - should use GPT-3.5
    "Explain the differences between REST and GraphQL APIs",  # Medium - GPT-4
    "Design a complete microservices architecture for an e-commerce platform"  # Complex - GPT-4
]

for query in test_queries:
    result = optimized_router(query)
    print(f"Query: {query[:50]}...")
    print(f"Model: {result['model_used']}, Complexity: {result['complexity_score']:.2f}")
    print(f"Estimated cost: ${result['estimated_cost']:.4f}\n")

Cost optimization dashboard showing model usage patterns, potential savings, and intelligent routing decisions

Future-Proofing Your LLM Applications

As the AI landscape evolves rapidly, LangSmith helps you build adaptable and maintainable LLM applications.

Building Evaluation-Driven Development Workflows

Responsive IDE Code Block
   Python
# Framework for evaluation-driven development of LLM applications
class EvaluationDrivenWorkflow:
    """Framework for evaluation-driven development of LLM applications"""
    
    def __init__(self, client: Client, base_project: str):
        self.client = client
        self.base_project = base_project
        self.evaluation_history = []
    
    def create_feature_branch_evaluation(self, feature_name: str, changes_description: str):
        """Create evaluation framework for new feature development"""
        
        project_name = f"{'{'}self.base_project{'}'}_feature_{'{'}feature_name{'}'}"
        
        # Define evaluation criteria specific to the feature
        feature_evaluators = self.get_feature_evaluators(feature_name)
        
        # Create baseline measurement
        baseline_results = self.run_baseline_evaluation(project_name, feature_evaluators)
        
        return {
            "project_name": project_name,
            "feature_name": feature_name,
            "baseline_results": baseline_results,
            "evaluators": feature_evaluators,
            "changes_description": changes_description
        }
    
    def continuous_evaluation_loop(self, project_name: str, evaluation_interval: int = 3600):
        """Run continuous evaluation loop for production monitoring"""
        
        import time
        import threading
        
        def evaluation_worker():
            while True:
                try:
                    # Run evaluation on recent data
                    results = self.run_production_evaluation(project_name)
                    
                    # Check for regressions
                    regressions = self.detect_regressions(results)
                    
                    if regressions:
                        self.trigger_regression_alert(project_name, regressions)
                    
                    # Store results for trending
                    self.store_evaluation_results(project_name, results)
                    
                except Exception as e:
                    print(f"Evaluation loop error: {e}")
                
                time.sleep(evaluation_interval)
        
        # Start evaluation worker thread
        worker_thread = threading.Thread(target=evaluation_worker, daemon=True)
        worker_thread.start()
        
        return worker_thread
    
    def automated_regression_testing(self, old_version: str, new_version: str) -> dict:
        """Automated regression testing between versions"""
        
        # Run parallel evaluations
        old_results = evaluate(
            target=self.get_model_version(old_version),
            data="regression_test_dataset",
            evaluators=[Correctness(), Helpfulness(), Safety()],
            experiment_prefix=f"regression_test_{'{'}old_version{'}'}"
        )
        
        new_results = evaluate(
            target=self.get_model_version(new_version),
            data="regression_test_dataset", 
            evaluators=[Correctness(), Helpfulness(), Safety()],
            experiment_prefix=f"regression_test_{'{'}new_version{'}'}"
        )
        
        # Compare results
        comparison = self.compare_evaluation_results(old_results, new_results)
        
        return {
            "old_version": old_version,
            "new_version": new_version,
            "comparison": comparison,
            "regression_detected": comparison["overall_change"] < -0.05,
            "recommendation": self.get_deployment_recommendation(comparison)
        }
        # Advanced monitoring and alerting system
class IntelligentMonitoring:
    """Intelligent monitoring system that learns from patterns"""
    
    def __init__(self, client: Client):
        self.client = client
        self.anomaly_detector = self.setup_anomaly_detection()
    
    def setup_anomaly_detection(self):
        """Set up ML-based anomaly detection for LLM metrics"""
        
        from sklearn.ensemble import IsolationForest
        import numpy as np
        
        # This would be trained on historical data
        detector = IsolationForest(contamination=0.1, random_state=42)
        
        return detector
    
    def detect_production_anomalies(self, project_name: str) -> dict:
        """Detect anomalies in production LLM behavior"""
        
        # Get recent production data
        recent_runs = list(self.client.list_runs(
            project_name=project_name,
            start_time=(datetime.now() - timedelta(hours=24)).isoformat(),
            limit=1000
        ))
        
        if len(recent_runs) < 50:
            return {"status": "insufficient_data", "message": "Need more data for anomaly detection"}
        
        # Extract features for anomaly detection
        features = self.extract_anomaly_features(recent_runs)
        
        # Detect anomalies
        anomaly_scores = self.anomaly_detector.decision_function(features)
        anomalies = np.where(anomaly_scores < -0.1)[0]
        
        if len(anomalies) > 0:
            anomalous_runs = [recent_runs[i] for i in anomalies]
            
            return {
                "status": "anomalies_detected",
                "anomaly_count": len(anomalies),
                "total_runs": len(recent_runs),
                "anomaly_percentage": len(anomalies) / len(recent_runs),
                "anomalous_runs": [run.id for run in anomalous_runs],
                "patterns": self.analyze_anomaly_patterns(anomalous_runs)
            }
        
        return {"status": "normal", "message": "No anomalies detected"}
    
    def extract_anomaly_features(self, runs: list) -> np.ndarray:
        """Extract features for anomaly detection"""
        
        features = []
        for run in runs:
            feature_vector = [
                run.total_tokens or 0,
                run.total_cost or 0,
                (run.end_time - run.start_time).total_seconds() if run.end_time and run.start_time else 0,
                len(run.inputs.get("messages", [])) if run.inputs else 0,
                len(str(run.outputs)) if run.outputs else 0,
                1 if run.error else 0,
                len(run.events or [])
            ]
            features.append(feature_vector)
        
        return np.array(features)
    
    def analyze_anomaly_patterns(self, anomalous_runs: list) -> dict:
        """Analyze patterns in anomalous runs"""
        
        patterns = {
            "high_cost_runs": len([r for r in anomalous_runs if (r.total_cost or 0) > 1.0]),
            "high_latency_runs": len([r for r in anomalous_runs if r.end_time and r.start_time and (r.end_time - r.start_time).total_seconds() > 30]),
            "error_runs": len([r for r in anomalous_runs if r.error]),
            "common_inputs": self.find_common_input_patterns(anomalous_runs),
            "time_patterns": self.analyze_temporal_patterns(anomalous_runs)
        }
        
        return patterns

        # Integration with external systems
class ExternalIntegrations:
    """Integrate LangSmith with external systems and tools"""
    
    def __init__(self, client: Client):
        self.client = client
    
    def setup_datadog_integration(self, datadog_api_key: 'str', datadog_app_key: 'str'):
        """Send LangSmith metrics to Datadog"""
        
        from datadog import initialize, api
        
        initialize(api_key=datadog_api_key, app_key=datadog_app_key)
        
        def send_metrics_to_datadog(project_name: 'str'):
            metrics = self.get_project_metrics(project_name)
            
            # Send custom metrics
            api.Metric.send([
                {
                    'metric': 'langsmith.requests.total',
                    'points': [(time.time(), metrics['total_requests'])],
                    'tags': [f'project:{project_name}']
                },
                {
                    'metric': 'langsmith.cost.total',
                    'points': [(time.time(), metrics['total_cost'])],
                    'tags': [f'project:{project_name}']
                },
                {
                    'metric': 'langsmith.latency.p95',
                    'points': [(time.time(), metrics['p95_latency'])],
                    'tags': [f'project:{project_name}']
                }
            ])
        
        return send_metrics_to_datadog
    
    def setup_slack_notifications(self, webhook_url: 'str'):
        """Enhanced Slack notifications with rich formatting"""
        
        import requests
        
        def send_evaluation_report(results: dict, project_name: 'str'):
            """Send formatted evaluation report to Slack"""
            
            color = "good" if results.get("overall_score", 0) > 0.8 else "warning" if results.get("overall_score", 0) > 0.6 else "danger"
            
            payload = {
                "attachments": [
                    {
                        "color": color,
                        "title": f"📊 LangSmith Evaluation Report - {project_name}",
                        "fields": [
                            {
                                "title": "Overall Score",
                                "value": f"{results.get('overall_score', 0):.2%}",
                                "short": True
                            },
                            {
                                "title": "Tests Run",
                                "value": str(results.get('total_tests', 0)),
                                "short": True
                            },
                            {
                                "title": "Accuracy",
                                "value": f"{results.get('accuracy', 0):.2%}",
                                "short": True
                            },
                            {
                                "title": "Helpfulness",
                                "value": f"{results.get('helpfulness', 0):.2%}",
                                "short": True
                            }
                        ],
                        "footer": "LangSmith Evaluation System",
                        "ts": int(time.time())
                    }
                ]
            }
            
            response = requests.post(webhook_url, json=payload)
            return response.status_code == 200
        
        return send_evaluation_report

Future-proofing architecture showing evaluation-driven development, anomaly detection, and external integrations

Best Practices and Common Pitfalls to Avoid

Based on real-world deployments and user experiences, here are the most important practices for LangSmith success.

Performance Optimization and Resource Management

Responsive IDE Code Block
   Python
# Performance optimization strategies
class LangSmithOptimizer:
    def __init__(self, client: Client):
        self.client = client
        
    def optimize_tracing_overhead(self):
        """Minimize tracing overhead in high-volume applications"""
        
        # Strategy 1: Intelligent sampling
        import os
        import random
        
        def should_trace_request(user_id: str, request_type: str) -> bool:
            """Intelligent sampling based on user and request characteristics"""
            
            # Always trace for specific users (VIP customers, internal testing)
            vip_users = os.environ.get("VIP_USERS", "").split(",")
            if user_id in vip_users:
                return True
            
            # Higher sampling rate for errors and edge cases
            if request_type in ["error_recovery", "complex_query"]:
                return random.random() < 0.5  # 50% sampling
            
            # Lower sampling for routine operations
            base_sampling_rate = float(os.environ.get("LANGSMITH_SAMPLING_RATE", "0.1"))
            return random.random() < base_sampling_rate
        
        return should_trace_request
    
    def optimize_data_retention(self, project_name: str):
        """Optimize data retention and storage costs"""
        
        retention_policy = {
            "production_traces": {
                "high_value": 365,  # Keep high-value traces for 1 year
                "standard": 90,     # Keep standard traces for 3 months
                "low_value": 30     # Keep low-value traces for 1 month
            },
            "evaluation_results": 365,  # Keep evaluation results for 1 year
            "feedback_data": 730        # Keep feedback data for 2 years
        }
        
        # Implement retention policy
        cutoff_date = datetime.now() - timedelta(days=retention_policy["production_traces"]["low_value"])
        
        # Archive or delete old low-value traces
        old_runs = list(self.client.list_runs(
            project_name=project_name,
            end_time=cutoff_date.isoformat(),
            limit=10000
        ))
        
        low_value_runs = [
            run for run in old_runs 
            if self.classify_trace_value(run) == "low_value"
        ]
        
        return {
            "total_old_runs": len(old_runs),
            "low_value_runs": len(low_value_runs),
            "retention_policy": retention_policy,
            "action": f"Archive {len(low_value_runs)} low-value traces"
        }
        def classify_trace_value(self, run) -> str:
    """Classify trace value for retention decisions"""
    
    # High value: traces with errors, high costs, or user feedback
    if run.error or (run.total_cost and run.total_cost > 0.5):
        return "high_value"
    
    if hasattr(run, 'feedback_stats') and run.feedback_stats:
        return "high_value"
    
    # Standard value: regular successful traces
    if run.status == "success":
        return "standard"
    
    # Low value: simple, successful traces with no special characteristics
    return "low_value"

### **Security and Privacy Best Practices**

def implement_data_privacy_controls():
    """Implement comprehensive data privacy and security controls"""
    
    from cryptography.fernet import Fernet
    import hashlib
    import re
    
    class PrivacyAwareLangSmith:
        def __init__(self, client: Client, encryption_key: bytes):
            self.client = client
            self.cipher = Fernet(encryption_key)
            self.pii_patterns = {
                'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                'phone': r'\b\d{3}-\d{3}-\d{4}\b',
                'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
                'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
            }
        
        def sanitize_data(self, data: dict, context: "trace" = "trace") -> dict:
           """Sanitize sensitive data before sending to LangSmith"""
sanitized = {}
for key, value in data.items():
    if isinstance(value, str):
        sanitized[key] = self.redact_pii(value)
    elif isinstance(value, dict):
        sanitized[key] = self.sanitize_data(value, context)
    else:
        sanitized[key] = value

return sanitized

def redact_pii(self, text: str) -> str:
    """Redact PII from text while preserving utility"""
    redacted_text = text
    for pii_type, pattern in self.pii_patterns.items():
        redacted_text = re.sub(pattern, f"[REDACTED_{'{'}pii_type.upper(){'}'}]", redacted_text)
    return redacted_text

def secure_trace_creation(self, inputs: dict, outputs: dict, metadata: dict):
    """Create traces with privacy controls"""
    safe_inputs = self.sanitize_data(inputs, "input")
    safe_outputs = self.sanitize_data(outputs, "output")
    privacy_metadata = {
        **metadata,
        "privacy_applied": True,
        "sanitization_timestamp": datetime.now().isoformat(),
        "data_classification": self.classify_data_sensitivity(inputs)
    }
    return {
        "inputs": safe_inputs,
        "outputs": safe_outputs,
        "metadata": privacy_metadata
    }

def classify_data_sensitivity(self, data: dict) -> str:
    """Classify data sensitivity level"""
    data_str = str(data).lower()
    if any(pattern.search(data_str) for pattern in self.pii_patterns.values()):
        return "high_sensitivity"
    elif any(word in data_str for word in ["confidential", "private", "internal"]):
        return "medium_sensitivity"
    else:
        return "low_sensitivity"

def audit_data_access(self, user_id: str, action: str, resource_id: str):
    """Audit data access for compliance"""
    audit_entry = {
        "timestamp": datetime.now().isoformat(),
        "user_id": hashlib.sha256(user_id.encode()).hexdigest(),  # Hash user ID
        "action": action,
        "resource_id": resource_id,
        "ip_address": "[REDACTED]",  # Don't log IP addresses
        "user_agent": "[REDACTED]"   # Don't log user agents
    }
    # Log to secure audit system
    # self.log_to_audit_system(audit_entry)
    return audit_entry

return PrivacyAwareLangSmith 

### **Team Collaboration and Workflow Optimization**

def setup_team_workflows():
    """Set up optimized team workflows for LangSmith"""
    
    class TeamWorkflowManager:
        def __init__(self, client: Client):
            self.client = client
            
        def setup_role_based_access(self):
            """Define role-based access patterns"""
            
            roles = {
                "data_scientist": {
                    "permissions": ["read_traces", "create_datasets", "run_evaluations"],
                    "projects": ["research_*", "experiments_*"],
                    "restrictions": ["no_production_access"]
                },
                "ml_engineer": {
                    "permissions": ["read_traces", "create_prompts", "run_evaluations", "deploy_models"],
                    "projects": ["staging_*", "development_*"],
                    "restrictions": ["limited_production_access"]
                },
                "product_manager": {
                    "permissions": ["read_traces", "view_dashboards", "create_annotation_queues"],
                    "projects": ["*"],
                    "restrictions": ["no_code_access"]
                },
                "admin": {
                    "permissions": ["all"],
                    "projects": ["*"],
                    "restrictions": []
                }
            }
            
            return roles
        
        def create_review_workflow(self, project_name: str):
            """Create a code review workflow for LLM changes"""
            
            workflow = {
                "stages": [
                    {
                        "name": "development",
                        "requirements": ["unit_tests_pass", "basic_evaluation_pass"],
                        "reviewers": ["ml_engineer"],
                        "auto_promotion": False
                    },
                    ...
                ],
                "evaluation_gates": {
                    "development": {"accuracy": 0.7, "safety": 0.9},
                    ...
                }
            }
            
            return workflow
        
        def automated_quality_gates(self, project_name: str, target_stage: str):
            """Implement automated quality gates"""
            
            def quality_gate_check(evaluation_results: dict, stage: str) -> dict:
                """Check if evaluation results meet quality gate requirements"""
                
                stage_requirements = {
                    "staging": {...},
                    "production": {...}
                }
                
                requirements = stage_requirements.get(stage, {})
                passed_checks = []
                failed_checks = []
                
                for metric, threshold in requirements.items():
                    actual_value = evaluation_results.get(metric, 0)
                    
                    if actual_value >= threshold:
                        passed_checks.append({...})
                    else:
                        failed_checks.append({...})
                
                return {
                    "stage": stage,
                    "overall_status": "passed" if not failed_checks else "failed",
                    "passed_checks": passed_checks,
                    "failed_checks": failed_checks,
                    "pass_rate": len(passed_checks) / (len(passed_checks) + len(failed_checks)) if (passed_checks or failed_checks) else 0
                }
            
            return quality_gate_check

    return TeamWorkflowManager

    # Common pitfalls and how to avoid them
def avoid_common_pitfalls():
    """Guide for avoiding common LangSmith pitfalls"""
    
    pitfalls_and_solutions = {
        "evaluation_dataset_quality": {
            "problem": "Low-quality evaluation datasets lead to misleading results",
            "solution": "Implement dataset review processes, use schema validation, and regularly audit dataset quality",
            "code_example": """
# Good: Comprehensive dataset validation
def validate_dataset_quality(dataset_name: str) -> dict:
    examples = list(client.list_examples(dataset_name=dataset_name))
    
    quality_checks = {
        "sufficient_size": len(examples) >= 50,
        "input_diversity": len(set(str(ex.inputs) for ex in examples)) / len(examples) > 0.8,
        "output_quality": all(len(str(ex.outputs)) > 10 for ex in examples),
        "balanced_difficulty": check_difficulty_distribution(examples)
    }
    
    return quality_checks
"""
        },
        "over_reliance_on_automated_evaluation": {
            "problem": "Relying only on automated evaluators without human validation",
            "solution": "Combine automated evaluators with human feedback and regular spot checks",
            "code_example": """
# Good: Hybrid evaluation approach
def comprehensive_evaluation(app_function, dataset_name):
    # Automated evaluation
    auto_results = evaluate(
        app_function,
        data=dataset_name,
        evaluators=[Correctness(), Helpfulness()]
    )
    
    # Human evaluation on sample
    sample_for_human_review = create_annotation_queue(
        dataset_name, 
        sample_size=20,
        criteria=["accuracy", "tone", "completeness"]
    )
    
    return {"automated": auto_results, "human_review_queue": sample_for_human_review}
"""
        },
        "insufficient_error_handling": {
            "problem": "Not properly handling LangSmith API errors and failures",
            "solution": "Implement robust error handling and fallback mechanisms",
            "code_example": """
# Good: Robust error handling
@traceable(name="Resilient LLM Call")
def resilient_llm_call(prompt: str, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            response = llm.invoke(prompt)
            return {"response": response, "attempt": attempt + 1}
        except Exception as e:
            if attempt == max_retries - 1:
                return {"error": str(e), "attempts": max_retries}
            time.sleep(2 ** attempt)  # Exponential backoff
"""
        }
    }
    
    return pitfalls_and_solutions

Team collaboration interface showing role-based access, review workflows, and quality gates

LangSmith represents a paradigm shift in how we build, test, and maintain LLM applications. By providing comprehensive observability, sophisticated evaluation frameworks, and productionready monitoring capabilities, it transforms the chaotic process of LLM development into a systematic, data-driven discipline.

Key Takeaways

Observability is Non-Negotiable: In the world of non-deterministic AI systems, you need complete visibility into what your models are doing. LangSmith's tracing capabilities provide the deep insights necessary to debug, optimize, and maintain complex LLM applications.

Evaluation-Driven Development: Traditional testing approaches don't work for LLM applications. LangSmith's evaluation framework, combining automated assessments with human feedback, enables you to build reliable AI systems that improve over time. 

Production Monitoring Saves Businesses: The difference between a successful LLM application and a failed one often comes down to production monitoring. LangSmith's alerting and dashboard capabilities help you catch issues before they impact users. 

Cost Control is Critical: LLM applications can become expensive quickly. LangSmith's cost tracking and optimization features help you build sustainable, profitable AI applications. 

Team Collaboration Accelerates Innovation: LangSmith's collaboration features, from prompt versioning to annotation queues, enable teams to work together effectively on AI projects. 

Future-Proofing Through Standards: By building on LangSmith's framework-agnostic platform, you're investing in tools and practices that will adapt as the AI landscape evolves.

Getting Started Checklist

Ready to transform your LLM development process? Here's your action plan:

Week 1: Foundation

  • [ ] Set up LangSmith account and API keys 
  • [ ] Implement basic tracing in your existing application 
  • [ ] Create your first dataset with 20-50 examples 
  • [ ] Run your first evaluation

Week 2: Expansion

  • [ ] Set up production monitoring and alerts 
  • [ ] Create annotation queues for human feedback 
  • [ ] Implement custom evaluators for your use case
  • [ ] Start tracking costs and optimization opportunities

Week 3: Optimization

  • [ ] Set up A/B testing for prompt variations 
  • [ ] Implement intelligent model routing for cost optimization 
  • [ ] Create automated evaluation pipelines 
  • [ ] Establish team workflows and review processes

Month 2 and Beyond: Mastery

  • [ ] Build comprehensive evaluation suites covering all use cases 
  • [ ] Implement advanced monitoring and anomaly detection 
  • [ ] Create synthetic datasets for edge case testing 
  • [ ] Integrate with your CI/CD pipeline for automated quality gates

Resources for Continued Learning

  • LangSmith Documentation: Comprehensive guides and API references 
  • LangChain Community: Active community for sharing best practices 
  • Evaluation Frameworks: Explore advanced evaluation methodologies 
  • Case Studies: Learn from real-world LangSmith implementations 
  • Training Programs: Consider formal training in LLM operations and evaluation

The future of AI applications belongs to those who can build reliable, observable, and maintainable systems. LangSmith provides the foundation for that future.

Ready to build production-ready LLM applications? Start with LangSmith's free tier today and experience the difference that proper observability and evaluation make. Your users – and your development team – will thank you for taking the time to build AI applications the right way

Don't let your LLM applications remain black boxes. Make them observable, testable, and reliable with LangSmith.

Take action today: Transform your LLM development workflow with LangSmith. Start building applications that are not just impressive demos, but production-ready systems that deliver real value to your users and business.

SaratahKumar C

Founder & CEO, Psitron Technologies