Skip to content

🧠 Open-source optimization engine for LLM agents. Track logs, evaluate behavior, generate insights, and improve agent performance through manual versioning and analysis. Built to make AI actually work.

License

Notifications You must be signed in to change notification settings

totango/handit.ai

Β 
Β 

Repository files navigation

Handit logo Handit logo (dark)

πŸ”₯ The Open Source Engine that Auto-Improves Your AI πŸ”₯

npm version pypi version license GitHub stars Discord

πŸš€ Quick Start β€’ πŸ“‹ Core Features β€’ πŸ“š Docs β€’ πŸ“… Schedule a Call


🎯 What is Handit?

Handit evaluates every agent decision, auto-generates better prompts and datasets, A/B-tests the fix, and lets you control what goes live.

This isn't another wrapper.
This is the auto-improvement engine your agents have been missing.

Handit automatically improves AI agents through evaluation and optimization

🧱 The Auto-Improvement Philosophy

  • πŸ‘οΈ Complete Observability: Track every step of your AI agents with comprehensive tracing. Monitor LLM calls, tool usage, execution timelines, and performance metrics in real-time across your entire agent workflow.
  • πŸ” Evaluate Everything: Automatically assess every agent decisionβ€”inputs, outputs, tool calls, and adherence to instructionsβ€”across every node. Detect issues, hallucinations, and performance gaps in real-time.
  • πŸ€– Auto-Generate Improvements: Automatically get better prompts, based on detected issues. Let AI improve your AI with targeted fixes for specific failure patterns.
  • πŸ§ͺ A/B Test Automatically: Test improvements against your current setup with intelligent A/B testing. Compare performance, measure impact, and validate fixes before they go live.
  • πŸŽ›οΈ Control What Goes Live: You decide what improvements to deploy. Review auto-generated fixes, approve changes, and roll back if needed. Full control over your agent's evolution.
  • ✍️ Version Everything: Track every change, improvement, and rollback. Complete version control for prompts, datasets, and configurationsβ€”by node, model, or project.

🚧 The Problem

AI agents in production suffer from performance degradation, undetected failures, and manual optimization overhead. Teams struggle with fragmented monitoring, manual prompt engineering, and lack of self improvement processes.

When you implement AI in critical business processes, you quickly realize you need to understand what's happening with your AI and ensure it improves continuously. But here's the challenge:

Most teams don't see the problem until it's too late. They're still early in their AI adoption curve, and by the time performance issues hit, they've already:

  • βœ‹ Lost 6+ months troubleshooting mysterious failures
  • πŸ”₯ Shut down AI projects due to unpredictable behavior
  • πŸ“‰ Experienced performance degradation without knowing why

Instead of waiting for problems to surface, teams need a system that continuously monitors, evaluates, and optimizes their AIβ€”before issues become critical.


βœ… The Auto-Improvement System

Handit unifies your entire AI improvement pipeline into a unified platform. Monitoring, evaluation, optimization, and prompt management.

Before After (Handit)
Manual performance monitoring Automatic evaluation & insights
Fragmented optimization tools End-to-end improvement pipeline
Manual prompt engineering Auto-generated optimizations
No systematic A/B testing Intelligent A/B testing
Performance degradation Continuous Self-Improving AI

πŸ”§ Supported Features

Feature Status Use Case
Real-Time Monitoring βœ… Available Track agent performance and detect issues
Evaluation Hub βœ… Available Run evaluations
Prompt Management βœ… Available Version control and A/B test prompts
Self-Improving AI βœ… Available AI-generated improvements
Token Usage Analytics πŸ”„ Coming Soon Monitor token consumption patterns
Complete Cost Tracing πŸ”„ Coming Soon Monitor all costs of your LLMs

πŸ€” How it Works

Handit's architecture follows a proven 3-phase approach that transforms any AI system into a self-improving platform:

Phase 1: Complete Observability

  • Comprehensive Tracing: Track every LLM call, tool usage, and execution step with detailed timing and error tracking
  • Real-Time Monitoring: Instantly visualize all agent operations, inputs, outputs, and performance metrics

Phase 2: Continuous Evaluation

  • LLM-as-Judge: Automated quality assessment using custom evaluators across multiple dimensions (accuracy, completeness, coherence)
  • Quality Insights: Real-time evaluation scores and trends to identify improvement opportunities

Phase 3: Self-Improving AI

  • Automatic Optimization: AI generates better prompts based on evaluation data and performance patterns
  • Background A/B Testing: Improvements are tested automatically with statistical confidence before deployment
  • Release Hub: Review, compare, and deploy optimized prompts with full control and easy rollback

This model means you no longer need to manually monitor, debug, and optimize your AI agents.

⚑ Core Features

The following features are deeply integrated to help you build continuously improving AI systems:

πŸ”‘ Real-Time Monitoring

Continuously ingest logs from every model, prompt, and agent in your stack. Instantly visualize performance trends, detect anomalies, and set custom alerts for drift or failuresβ€”live.

Ready to monitor your AI? β†’ Observability Dashboard

πŸ“£ Evaluation Hub

Run evaluation pipelines on production traffic with custom LLM-as-Judge prompts, business KPI thresholds (accuracy, latency, etc.), and get automated quality scores in real time. Results feed directly into your optimization workflowsβ€”no manual grading required.

Ready to evaluate your AI? β†’ Evaluation Hub

πŸͺ Prompt Management & Optimization

Version control your prompts while Handit automatically improves your AI through self-improving optimization. Generate better prompts, test them with intelligent A/B testing, and deploy proven improvementsβ€”all with full control over what goes live.

Ready to optimize your AI Agents? β†’ Prompt Versions

πŸ“Š Auto-Generated Insights

Handit automatically analyzes performance patterns and generates actionable insights for improvement.

πŸ“‘ Version Control: Complete Traceability

Track every change, improvement, and rollback. Complete version control for prompts, datasets, and configurationsβ€”by node, model, or project.

πŸ‘οΈ End-to-End Observability with Traces

Every execution generates a full trace, capturing step timelines, model interactions, tool calls, and performance metrics. Visualize everything in the dashboard and debug faster.


πŸš€ Complete Handit.ai Quickstart

The Open Source Engine that Auto-Improves Your AI.
Handit evaluates every agent decision, auto-generates better prompts and datasets, A/B-tests the fix, and lets you control what goes live.

What you'll build: A fully observable, continuously evaluated, and automatically optimizing AI system that improves itself based on real production data.

Overview: The Complete Journey

Here's what we'll accomplish in three phases:

  1. Phase 1: AI Observability ⏱️ 5 minutes - Set up comprehensive tracing to see inside your AI agents
  2. Phase 2: Quality Evaluation ⏱️ 10 minutes - Add automated evaluation to continuously assess performance
  3. Phase 3: Self-Improving AI ⏱️ 15 minutes - Enable automatic optimization with proven improvements

The Result: Complete visibility into performance with automated optimization recommendations based on real production data.

Prerequisites


Phase 1: AI Observability

Let's add comprehensive tracing to see exactly what your AI is doing.

Python Example

Step 1: Install the SDK

# Python
pip install -U "handit-sdk>=1.16.0"

# JavaScript
npm install @handit.ai/node

Step 2: Get Your Integration Token

  1. Log into your Handit.ai Dashboard
  2. Go to Settings β†’ Integrations
  3. Copy your integration token

Handit token

Step 3: Add Basic Tracing

Set up your main agent function with four key components:

  1. Initialize Handit.ai service
  2. Set up start tracing
  3. Track LLM calls and tools in your workflow
  4. Set up end tracing

Python Example

Create a handit_service.py file:

"""
Handit.ai service initialization and configuration.
"""
import os
from dotenv import load_dotenv
from handit import HanditTracker

load_dotenv()

# Create a singleton tracker instance
tracker = HanditTracker()
tracker.config(api_key=os.getenv("HANDIT_API_KEY"))

Main agent with comprehensive tracing:

"""
Customer service agent with Handit.ai tracing.
"""
from handit_service import tracker
from langchain.chat_models import ChatOpenAI

class CustomerServiceAgent:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4")

    async def generate_response(self, user_message: str, context: dict, execution_id: str) -> str:
        """Generate response using LLM with context."""
        context_text = "\n".join([doc["content"] for doc in context["similar_documents"]])
        system_prompt = f"You are a helpful customer service agent. Use the provided context.\n\nContext: {context_text}"
        
        response = await self.llm.agenerate([system_prompt + "\n\nUser Question: " + user_message])
        generated_text = response.generations[0][0].text
        
        # Track the LLM call
        tracker.track_node(
            input={
                "systemPrompt": system_prompt,
                "userPrompt": user_message,
                "extraDetails": {
                    "model": "gpt-4",
                    "temperature": 0.7,
                    "context_documents": len(context["similar_documents"])
                }
            },
            output=generated_text,
            node_name="response_generator",
            agent_name="customer_service_agent",
            node_type="llm",
            execution_id=execution_id
        )
        
        return generated_text

    async def get_context_from_vector_db(self, query: str, execution_id: str) -> dict:
        """Tool function to extract context from vector database."""
        # Simulate semantic search results
        results = {
            "query": query,
            "similar_documents": [
                {
                    "content": "Our AI platform offers automated evaluation, optimization, and real-time monitoring.",
                    "similarity_score": 0.94,
                    "document_id": "features_001"
                }
            ]
        }
        
        # Track the tool usage
        tracker.track_node(
            input={
                "toolName": "get_context_from_vector_db",
                "parameters": {"query": query, "top_k": 2},
                "extraDetails": {"vector_db": "chroma", "collection": "company_knowledge"}
            },
            output=results,
            node_name="vector_context_retriever",
            agent_name="customer_service_agent",
            node_type="tool",
            execution_id=execution_id
        )
        
        return results

    async def process_customer_request(self, user_message: str, execution_id: str) -> dict:
        """Process customer request with full tracing."""
        context = await self.get_context_from_vector_db(user_message, execution_id)
        response = await self.generate_response(user_message, context, execution_id)
        return {"response": response}

# Usage example
async def main():
    agent = CustomerServiceAgent()
    
    # Start tracing
    tracing_response = tracker.start_tracing(agent_name="customer_service_agent")
    execution_id = tracing_response.get("executionId")
    
    try:
        result = await agent.process_customer_request(
            "What AI features does your platform offer?", execution_id
        )
        print(f"Response: {result['response']}")
    finally:
        # End tracing
        tracker.end_tracing(execution_id=execution_id, agent_name="customer_service_agent")

JavaScript Example

/**
 * Customer service agent with Handit.ai tracing.
 */
import { config, startTracing, trackNode, endTracing } from '@handit.ai/node';
import { ChatOpenAI } from 'langchain/chat_models';

// Configure Handit.ai
config({ apiKey: process.env.HANDIT_API_KEY });

class CustomerServiceAgent {
    constructor() {
        this.llm = new ChatOpenAI({ model: 'gpt-4' });
    }

    async generateResponse(userMessage, context, executionId) {
        const contextText = context.similarDocuments.map(doc => doc.content).join('\n');
        const systemPrompt = `You are a helpful customer service agent. Use the provided context.\n\nContext: ${contextText}`;
        
        const response = await this.llm.generate([systemPrompt + `\n\nUser Question: ${userMessage}`]);
        const generatedText = response.generations[0][0].text;
        
        // Track the LLM call
        await trackNode({
            input: {
                systemPrompt,
                userPrompt: userMessage,
                extraDetails: {
                    model: "gpt-4",
                    temperature: 0.7,
                    context_documents: context.similarDocuments.length
                }
            },
            output: generatedText,
            nodeName: 'response_generator',
            agentName: 'customer_service_agent',
            nodeType: 'llm',
            executionId
        });
        
        return generatedText;
    }

    async getContextFromVectorDb(query, executionId) {
        // Simulate semantic search
        const results = {
            query,
            similarDocuments: [{
                content: "Our AI platform offers automated evaluation, optimization, and real-time monitoring.",
                similarityScore: 0.94,
                documentId: "features_001"
            }]
        };
        
        // Track the tool usage
        await trackNode({
            input: {
                toolName: "get_context_from_vector_db",
                parameters: { query, top_k: 2 },
                extraDetails: { vector_db: "chroma", collection: "company_knowledge" }
            },
            output: results,
            nodeName: 'vector_context_retriever',
            agentName: 'customer_service_agent',
            nodeType: 'tool',
            executionId
        });
        
        return results;
    }

    async processCustomerRequest(userMessage, executionId) {
        const context = await this.getContextFromVectorDb(userMessage, executionId);
        const response = await this.generateResponse(userMessage, context, executionId);
        return { response };
    }
}

// Usage example
async function main() {
    const agent = new CustomerServiceAgent();
    
    // Start tracing
    const tracingResponse = await startTracing({ agentName: 'customer_service_agent' });
    const executionId = tracingResponse.executionId;
    
    try {
        const result = await agent.processCustomerRequest(
            "What AI features does your platform offer?", executionId
        );
        console.log('Response:', result.response);
    } finally {
        // End tracing
        await endTracing({ executionId, agentName: 'customer_service_agent' });
    }
}

πŸŽ‰ Phase 1 Complete! You now have full observability with every operation, timing, input, output, and error visible in your dashboard.


Phase 2: Quality Evaluation

Add automated evaluation to continuously assess quality across multiple dimensions.

Step 1: Connect Evaluation Models

  1. Go to Settings β†’ Model Tokens
  2. Add your OpenAI or other model credentials
  3. These models will act as "judges" to evaluate responses

Evaluations

Step 2: Create Focused Evaluators

Create separate evaluators for each quality aspect in Evaluation β†’ Evaluation Suite:

Example Evaluator 1: Response Completeness

You are evaluating whether an AI response completely addresses the user's question.

Focus ONLY on completeness - ignore other quality aspects.

User Question: {input}
AI Response: {output}

Rate on a scale of 1-10:
1-2 = Missing major parts of the question
3-4 = Addresses some parts but incomplete
5-6 = Addresses most parts adequately  
7-8 = Addresses all parts well
9-10 = Thoroughly addresses every aspect

Output format:
Score: [1-10]
Reasoning: [Brief explanation]

Example Evaluator 2: Accuracy Check

You are checking if an AI response contains accurate information.

Focus ONLY on factual accuracy - ignore other aspects.

User Question: {input}
AI Response: {output}

Rate on a scale of 1-10:
1-2 = Contains obvious false information
3-4 = Contains questionable claims
5-6 = Mostly accurate with minor concerns
7-8 = Accurate information
9-10 = Completely accurate and verifiable

Output format:
Score: [1-10]
Reasoning: [Brief explanation]

Handit automatically improves AI agents through evaluation and optimization

Step 3: Associate Evaluators to Your LLM Nodes

  1. Go to Agent Performance
  2. Select your LLM node (e.g., "response-generator")
  3. Click Manage Evaluators
  4. Add your evaluators

Handit automatically improves AI agents through evaluation and optimization

Step 4: Monitor Results

View real-time evaluation results in:

  • Tracing tab: Individual evaluation scores
  • Agent Performance: Quality trends over time

Handit automatically improves AI agents through evaluation and optimization

πŸŽ‰ Phase 2 Complete! Continuous evaluation is now running across multiple quality dimensions with real-time insights.


Phase 3: Self-Improving AI

Enable automatic optimization that generates better prompts and provides proven improvements.

Step 1: Connect Optimization Models

  1. Go to Settings β†’ Model Tokens
  2. Select optimization model tokens
  3. Self-improving AI automatically activates once configured

Handit automatically improves AI agents through evaluation and optimization

Automatic Activation: Once optimization tokens are configured, the system automatically begins analyzing evaluation data and generating optimizations.

Step 2: Monitor Optimization Results

The system is now automatically generating and testing improved prompts. Monitor results in two places:

Agent Performance Dashboard:

  • View agent performance metrics
  • Compare current vs optimized versions
  • See improvement percentages

Handit automatically improves AI agents through evaluation and optimization

Release Hub:

  • Go to Optimization β†’ Release Hub
  • View detailed prompt comparisons
  • See statistical confidence and recommendations

Handit automatically improves AI agents through evaluation and optimization

Step 3: Deploy Optimizations

  1. Review Recommendations in Release Hub
  2. Compare Performance between current and optimized prompts
  3. Mark as Production for prompts you want to deploy
  4. Fetch via SDK in your application

Handit automatically improves AI agents through evaluation and optimization

Fetch Optimized Prompts:

# Python
from handit import HanditTracker

tracker = HanditTracker(api_key="your-api-key")
optimized_prompt = tracker.fetch_optimized_prompt(model_id="response-generator")

# Use in your LLM calls
response = your_llm_client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": optimized_prompt},
        {"role": "user", "content": user_query}
    ]
)
// JavaScript
import { HanditClient } from '@handit/sdk';

const handit = new HanditClient({ apiKey: 'your-api-key' });
const optimizedPrompt = await handit.fetchOptimizedPrompt({ modelId: 'response-generator' });

// Use in your LLM calls
const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
        { role: 'system', content: optimizedPrompt },
        { role: 'user', content: userQuery }
    ]
});

πŸŽ‰ Phase 3 Complete! You now have a self-improving AI that automatically detects quality issues, generates better prompts, tests them, and provides proven improvements.


What You've Accomplished

πŸŽ‰ Congratulations! You now have a complete AI observability and optimization system:

  • βœ… Full Observability: Complete visibility into operations with real-time monitoring
  • βœ… Continuous Evaluation: Automated quality assessment across multiple dimensions
  • βœ… Self-Improving AI: Automatic optimization with AI-generated improvements and A/B testing

Troubleshooting

Tracing Not Working?

  • Verify your API key is correct and set as environment variable
  • Ensure you're using the correct function parameters

Evaluations Not Running?

  • Confirm model tokens are valid and have sufficient credits
  • Verify LLM nodes are receiving traffic

Optimizations Not Generating?

  • Ensure evaluation data shows quality issues (scores below threshold)
  • Verify optimization model tokens are configured

Need Help? Join our Discord community or check GitHub Issues


🌐 Language Support

Write your AI agents in your preferred language:

Language Status SDK Package
Python βœ… Stable handit-sdk>=1.16.0
JavaScript βœ… Stable @handit.ai/node
TypeScript βœ… Stable @handit.ai/node
Go βœ… Available HTTP API integration
Any Stack/Framework βœ… Available HTTP API integration (n8n, Zapier, etc.)
Java, C#, Ruby, PHP βœ… Available REST API integration
LangChain & LangGraph βœ… Available Python/JS SDK
LlamaIndex, AutoGen βœ… Available Python/JS SDK + HTTP API
CrewAI, Swarm βœ… Available Python SDK + HTTP API

πŸ’¬ Get Help

🀝 Contributing

πŸš€ Roadmap

We're building Handit in the open, and we'd love for you to be a part of the journey.

Week Focus Status
1 Backend foundation + infrastructure βœ”οΈ Done
2 Prompt versioning βœ”οΈ Done
3 Auto-evaluation + insight generation βœ”οΈ Done
4 Deployment setup + UI + public release βœ”οΈ Done

We welcome contributions! Whether it's:

  • πŸ› Bug fixes and improvements
  • ✨ New features
  • πŸ“š Documentation and examples
  • 🌍 Language support additions
  • 🎨 Dashboard UI enhancements

πŸ‘₯ Contributors

Thanks to everyone helping bring Handit to life:

Want to appear here? Star the repo, follow along, and make your first PR πŸ™Œ


🌟 Ready to auto-improve your AI?

πŸš€ Get Started Now β€’ πŸ“– Read the Docs β€’ πŸ’¬ Join Discord β€’ πŸ“… Schedule a Call


Star History Chart

Built with ❀️ by the Handit team β€’ Star us if you find Handit useful! ⭐


πŸ“„ License

MIT Β© 2025 – Built with πŸ’‘ by the Handit community

About

🧠 Open-source optimization engine for LLM agents. Track logs, evaluate behavior, generate insights, and improve agent performance through manual versioning and analysis. Built to make AI actually work.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 96.8%
  • Handlebars 1.6%
  • Shell 0.5%
  • HTML 0.4%
  • HCL 0.4%
  • Smarty 0.2%
  • Other 0.1%