π₯ The Open Source Engine that Auto-Improves Your AI π₯
π Quick Start β’ π Core Features β’ π Docs β’ π Schedule a Call
Handit evaluates every agent decision, auto-generates better prompts and datasets, A/B-tests the fix, and lets you control what goes live.
This isn't another wrapper.
This is the auto-improvement engine your agents have been missing.
- ποΈ Complete Observability: Track every step of your AI agents with comprehensive tracing. Monitor LLM calls, tool usage, execution timelines, and performance metrics in real-time across your entire agent workflow.
- π Evaluate Everything: Automatically assess every agent decisionβinputs, outputs, tool calls, and adherence to instructionsβacross every node. Detect issues, hallucinations, and performance gaps in real-time.
- π€ Auto-Generate Improvements: Automatically get better prompts, based on detected issues. Let AI improve your AI with targeted fixes for specific failure patterns.
- π§ͺ A/B Test Automatically: Test improvements against your current setup with intelligent A/B testing. Compare performance, measure impact, and validate fixes before they go live.
- ποΈ Control What Goes Live: You decide what improvements to deploy. Review auto-generated fixes, approve changes, and roll back if needed. Full control over your agent's evolution.
- βοΈ Version Everything: Track every change, improvement, and rollback. Complete version control for prompts, datasets, and configurationsβby node, model, or project.
AI agents in production suffer from performance degradation, undetected failures, and manual optimization overhead. Teams struggle with fragmented monitoring, manual prompt engineering, and lack of self improvement processes.
When you implement AI in critical business processes, you quickly realize you need to understand what's happening with your AI and ensure it improves continuously. But here's the challenge:
Most teams don't see the problem until it's too late. They're still early in their AI adoption curve, and by the time performance issues hit, they've already:
- β Lost 6+ months troubleshooting mysterious failures
- π₯ Shut down AI projects due to unpredictable behavior
- π Experienced performance degradation without knowing why
Instead of waiting for problems to surface, teams need a system that continuously monitors, evaluates, and optimizes their AIβbefore issues become critical.
Handit unifies your entire AI improvement pipeline into a unified platform. Monitoring, evaluation, optimization, and prompt management.
Before | After (Handit) |
---|---|
Manual performance monitoring | Automatic evaluation & insights |
Fragmented optimization tools | End-to-end improvement pipeline |
Manual prompt engineering | Auto-generated optimizations |
No systematic A/B testing | Intelligent A/B testing |
Performance degradation | Continuous Self-Improving AI |
Feature | Status | Use Case |
---|---|---|
Real-Time Monitoring | β Available | Track agent performance and detect issues |
Evaluation Hub | β Available | Run evaluations |
Prompt Management | β Available | Version control and A/B test prompts |
Self-Improving AI | β Available | AI-generated improvements |
Token Usage Analytics | π Coming Soon | Monitor token consumption patterns |
Complete Cost Tracing | π Coming Soon | Monitor all costs of your LLMs |
Handit's architecture follows a proven 3-phase approach that transforms any AI system into a self-improving platform:
Phase 1: Complete Observability
- Comprehensive Tracing: Track every LLM call, tool usage, and execution step with detailed timing and error tracking
- Real-Time Monitoring: Instantly visualize all agent operations, inputs, outputs, and performance metrics
Phase 2: Continuous Evaluation
- LLM-as-Judge: Automated quality assessment using custom evaluators across multiple dimensions (accuracy, completeness, coherence)
- Quality Insights: Real-time evaluation scores and trends to identify improvement opportunities
Phase 3: Self-Improving AI
- Automatic Optimization: AI generates better prompts based on evaluation data and performance patterns
- Background A/B Testing: Improvements are tested automatically with statistical confidence before deployment
- Release Hub: Review, compare, and deploy optimized prompts with full control and easy rollback
This model means you no longer need to manually monitor, debug, and optimize your AI agents.
The following features are deeply integrated to help you build continuously improving AI systems:
Continuously ingest logs from every model, prompt, and agent in your stack. Instantly visualize performance trends, detect anomalies, and set custom alerts for drift or failuresβlive.
Ready to monitor your AI? β Observability Dashboard
Run evaluation pipelines on production traffic with custom LLM-as-Judge prompts, business KPI thresholds (accuracy, latency, etc.), and get automated quality scores in real time. Results feed directly into your optimization workflowsβno manual grading required.
Ready to evaluate your AI? β Evaluation Hub
Version control your prompts while Handit automatically improves your AI through self-improving optimization. Generate better prompts, test them with intelligent A/B testing, and deploy proven improvementsβall with full control over what goes live.
Ready to optimize your AI Agents? β Prompt Versions
Handit automatically analyzes performance patterns and generates actionable insights for improvement.
Track every change, improvement, and rollback. Complete version control for prompts, datasets, and configurationsβby node, model, or project.
Every execution generates a full trace, capturing step timelines, model interactions, tool calls, and performance metrics. Visualize everything in the dashboard and debug faster.
The Open Source Engine that Auto-Improves Your AI.
Handit evaluates every agent decision, auto-generates better prompts and datasets, A/B-tests the fix, and lets you control what goes live.
What you'll build: A fully observable, continuously evaluated, and automatically optimizing AI system that improves itself based on real production data.
Here's what we'll accomplish in three phases:
- Phase 1: AI Observability β±οΈ 5 minutes - Set up comprehensive tracing to see inside your AI agents
- Phase 2: Quality Evaluation β±οΈ 10 minutes - Add automated evaluation to continuously assess performance
- Phase 3: Self-Improving AI β±οΈ 15 minutes - Enable automatic optimization with proven improvements
The Result: Complete visibility into performance with automated optimization recommendations based on real production data.
- A Handit.ai Account (sign up if needed)
- 15-30 minutes to complete the setup
Let's add comprehensive tracing to see exactly what your AI is doing.
# Python
pip install -U "handit-sdk>=1.16.0"
# JavaScript
npm install @handit.ai/node
- Log into your Handit.ai Dashboard
- Go to Settings β Integrations
- Copy your integration token
Set up your main agent function with four key components:
- Initialize Handit.ai service
- Set up start tracing
- Track LLM calls and tools in your workflow
- Set up end tracing
Create a handit_service.py
file:
"""
Handit.ai service initialization and configuration.
"""
import os
from dotenv import load_dotenv
from handit import HanditTracker
load_dotenv()
# Create a singleton tracker instance
tracker = HanditTracker()
tracker.config(api_key=os.getenv("HANDIT_API_KEY"))
Main agent with comprehensive tracing:
"""
Customer service agent with Handit.ai tracing.
"""
from handit_service import tracker
from langchain.chat_models import ChatOpenAI
class CustomerServiceAgent:
def __init__(self):
self.llm = ChatOpenAI(model="gpt-4")
async def generate_response(self, user_message: str, context: dict, execution_id: str) -> str:
"""Generate response using LLM with context."""
context_text = "\n".join([doc["content"] for doc in context["similar_documents"]])
system_prompt = f"You are a helpful customer service agent. Use the provided context.\n\nContext: {context_text}"
response = await self.llm.agenerate([system_prompt + "\n\nUser Question: " + user_message])
generated_text = response.generations[0][0].text
# Track the LLM call
tracker.track_node(
input={
"systemPrompt": system_prompt,
"userPrompt": user_message,
"extraDetails": {
"model": "gpt-4",
"temperature": 0.7,
"context_documents": len(context["similar_documents"])
}
},
output=generated_text,
node_name="response_generator",
agent_name="customer_service_agent",
node_type="llm",
execution_id=execution_id
)
return generated_text
async def get_context_from_vector_db(self, query: str, execution_id: str) -> dict:
"""Tool function to extract context from vector database."""
# Simulate semantic search results
results = {
"query": query,
"similar_documents": [
{
"content": "Our AI platform offers automated evaluation, optimization, and real-time monitoring.",
"similarity_score": 0.94,
"document_id": "features_001"
}
]
}
# Track the tool usage
tracker.track_node(
input={
"toolName": "get_context_from_vector_db",
"parameters": {"query": query, "top_k": 2},
"extraDetails": {"vector_db": "chroma", "collection": "company_knowledge"}
},
output=results,
node_name="vector_context_retriever",
agent_name="customer_service_agent",
node_type="tool",
execution_id=execution_id
)
return results
async def process_customer_request(self, user_message: str, execution_id: str) -> dict:
"""Process customer request with full tracing."""
context = await self.get_context_from_vector_db(user_message, execution_id)
response = await self.generate_response(user_message, context, execution_id)
return {"response": response}
# Usage example
async def main():
agent = CustomerServiceAgent()
# Start tracing
tracing_response = tracker.start_tracing(agent_name="customer_service_agent")
execution_id = tracing_response.get("executionId")
try:
result = await agent.process_customer_request(
"What AI features does your platform offer?", execution_id
)
print(f"Response: {result['response']}")
finally:
# End tracing
tracker.end_tracing(execution_id=execution_id, agent_name="customer_service_agent")
/**
* Customer service agent with Handit.ai tracing.
*/
import { config, startTracing, trackNode, endTracing } from '@handit.ai/node';
import { ChatOpenAI } from 'langchain/chat_models';
// Configure Handit.ai
config({ apiKey: process.env.HANDIT_API_KEY });
class CustomerServiceAgent {
constructor() {
this.llm = new ChatOpenAI({ model: 'gpt-4' });
}
async generateResponse(userMessage, context, executionId) {
const contextText = context.similarDocuments.map(doc => doc.content).join('\n');
const systemPrompt = `You are a helpful customer service agent. Use the provided context.\n\nContext: ${contextText}`;
const response = await this.llm.generate([systemPrompt + `\n\nUser Question: ${userMessage}`]);
const generatedText = response.generations[0][0].text;
// Track the LLM call
await trackNode({
input: {
systemPrompt,
userPrompt: userMessage,
extraDetails: {
model: "gpt-4",
temperature: 0.7,
context_documents: context.similarDocuments.length
}
},
output: generatedText,
nodeName: 'response_generator',
agentName: 'customer_service_agent',
nodeType: 'llm',
executionId
});
return generatedText;
}
async getContextFromVectorDb(query, executionId) {
// Simulate semantic search
const results = {
query,
similarDocuments: [{
content: "Our AI platform offers automated evaluation, optimization, and real-time monitoring.",
similarityScore: 0.94,
documentId: "features_001"
}]
};
// Track the tool usage
await trackNode({
input: {
toolName: "get_context_from_vector_db",
parameters: { query, top_k: 2 },
extraDetails: { vector_db: "chroma", collection: "company_knowledge" }
},
output: results,
nodeName: 'vector_context_retriever',
agentName: 'customer_service_agent',
nodeType: 'tool',
executionId
});
return results;
}
async processCustomerRequest(userMessage, executionId) {
const context = await this.getContextFromVectorDb(userMessage, executionId);
const response = await this.generateResponse(userMessage, context, executionId);
return { response };
}
}
// Usage example
async function main() {
const agent = new CustomerServiceAgent();
// Start tracing
const tracingResponse = await startTracing({ agentName: 'customer_service_agent' });
const executionId = tracingResponse.executionId;
try {
const result = await agent.processCustomerRequest(
"What AI features does your platform offer?", executionId
);
console.log('Response:', result.response);
} finally {
// End tracing
await endTracing({ executionId, agentName: 'customer_service_agent' });
}
}
π Phase 1 Complete! You now have full observability with every operation, timing, input, output, and error visible in your dashboard.
Add automated evaluation to continuously assess quality across multiple dimensions.
- Go to Settings β Model Tokens
- Add your OpenAI or other model credentials
- These models will act as "judges" to evaluate responses
Create separate evaluators for each quality aspect in Evaluation β Evaluation Suite:
Example Evaluator 1: Response Completeness
You are evaluating whether an AI response completely addresses the user's question.
Focus ONLY on completeness - ignore other quality aspects.
User Question: {input}
AI Response: {output}
Rate on a scale of 1-10:
1-2 = Missing major parts of the question
3-4 = Addresses some parts but incomplete
5-6 = Addresses most parts adequately
7-8 = Addresses all parts well
9-10 = Thoroughly addresses every aspect
Output format:
Score: [1-10]
Reasoning: [Brief explanation]
Example Evaluator 2: Accuracy Check
You are checking if an AI response contains accurate information.
Focus ONLY on factual accuracy - ignore other aspects.
User Question: {input}
AI Response: {output}
Rate on a scale of 1-10:
1-2 = Contains obvious false information
3-4 = Contains questionable claims
5-6 = Mostly accurate with minor concerns
7-8 = Accurate information
9-10 = Completely accurate and verifiable
Output format:
Score: [1-10]
Reasoning: [Brief explanation]
- Go to Agent Performance
- Select your LLM node (e.g., "response-generator")
- Click Manage Evaluators
- Add your evaluators
View real-time evaluation results in:
- Tracing tab: Individual evaluation scores
- Agent Performance: Quality trends over time
π Phase 2 Complete! Continuous evaluation is now running across multiple quality dimensions with real-time insights.
Enable automatic optimization that generates better prompts and provides proven improvements.
- Go to Settings β Model Tokens
- Select optimization model tokens
- Self-improving AI automatically activates once configured
Automatic Activation: Once optimization tokens are configured, the system automatically begins analyzing evaluation data and generating optimizations.
The system is now automatically generating and testing improved prompts. Monitor results in two places:
Agent Performance Dashboard:
- View agent performance metrics
- Compare current vs optimized versions
- See improvement percentages
Release Hub:
- Go to Optimization β Release Hub
- View detailed prompt comparisons
- See statistical confidence and recommendations
- Review Recommendations in Release Hub
- Compare Performance between current and optimized prompts
- Mark as Production for prompts you want to deploy
- Fetch via SDK in your application
Fetch Optimized Prompts:
# Python
from handit import HanditTracker
tracker = HanditTracker(api_key="your-api-key")
optimized_prompt = tracker.fetch_optimized_prompt(model_id="response-generator")
# Use in your LLM calls
response = your_llm_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": optimized_prompt},
{"role": "user", "content": user_query}
]
)
// JavaScript
import { HanditClient } from '@handit/sdk';
const handit = new HanditClient({ apiKey: 'your-api-key' });
const optimizedPrompt = await handit.fetchOptimizedPrompt({ modelId: 'response-generator' });
// Use in your LLM calls
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: optimizedPrompt },
{ role: 'user', content: userQuery }
]
});
π Phase 3 Complete! You now have a self-improving AI that automatically detects quality issues, generates better prompts, tests them, and provides proven improvements.
π Congratulations! You now have a complete AI observability and optimization system:
- β Full Observability: Complete visibility into operations with real-time monitoring
- β Continuous Evaluation: Automated quality assessment across multiple dimensions
- β Self-Improving AI: Automatic optimization with AI-generated improvements and A/B testing
Tracing Not Working?
- Verify your API key is correct and set as environment variable
- Ensure you're using the correct function parameters
Evaluations Not Running?
- Confirm model tokens are valid and have sufficient credits
- Verify LLM nodes are receiving traffic
Optimizations Not Generating?
- Ensure evaluation data shows quality issues (scores below threshold)
- Verify optimization model tokens are configured
Need Help? Join our Discord community or check GitHub Issues
Write your AI agents in your preferred language:
Language | Status | SDK Package |
---|---|---|
Python | β Stable | handit-sdk>=1.16.0 |
JavaScript | β Stable | @handit.ai/node |
TypeScript | β Stable | @handit.ai/node |
Go | β Available | HTTP API integration |
Any Stack/Framework | β Available | HTTP API integration (n8n, Zapier, etc.) |
Java, C#, Ruby, PHP | β Available | REST API integration |
LangChain & LangGraph | β Available | Python/JS SDK |
LlamaIndex, AutoGen | β Available | Python/JS SDK + HTTP API |
CrewAI, Swarm | β Available | Python SDK + HTTP API |
- π Questions: Use our Discord community
- π Bug Reports: GitHub Issues
- π Documentation: Official Docs
- π Schedule a Call: Book a Demo
We're building Handit in the open, and we'd love for you to be a part of the journey.
Week | Focus | Status |
---|---|---|
1 | Backend foundation + infrastructure | βοΈ Done |
2 | Prompt versioning | βοΈ Done |
3 | Auto-evaluation + insight generation | βοΈ Done |
4 | Deployment setup + UI + public release | βοΈ Done |
We welcome contributions! Whether it's:
- π Bug fixes and improvements
- β¨ New features
- π Documentation and examples
- π Language support additions
- π¨ Dashboard UI enhancements
Thanks to everyone helping bring Handit to life:
Want to appear here? Star the repo, follow along, and make your first PR π
π Ready to auto-improve your AI?
π Get Started Now β’ π Read the Docs β’ π¬ Join Discord β’ π Schedule a Call
MIT Β© 2025 β Built with π‘ by the Handit community