|
1 | | -# envtorch |
2 | | -An environment library for RL and beyond |
| 1 | +# EnvTorch: Agentic Execution Environments |
| 2 | + |
| 3 | +A unified framework for CodeAct environments that supports both agent execution and RL training, built on Gym/Gymnasium APIs with PyTorch/HuggingFace integration patterns. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +EnvTorch provides a standard for agentic execution environments following the CodeAct paradigm, where actions are arbitrary Python code that can chain multiple tool calls. The framework bridges traditional RL environments with modern agent capabilities. |
| 8 | + |
| 9 | +### Key Features |
| 10 | + |
| 11 | +- **CodeAct Execution**: Actions are Python code strings executed in persistent contexts |
| 12 | +- **State Persistence**: Variables and functions persist across steps within episodes |
| 13 | +- **Tool Integration**: MCP (Model Context Protocol) support for external capabilities |
| 14 | +- **RL Compatibility**: Transform system for reward computation and training |
| 15 | +- **Error Handling**: Exceptions become observations for agent learning |
| 16 | +- **Clean APIs**: Minimal, opinionous design following KISS principles |
| 17 | + |
| 18 | +## Quick Start |
| 19 | + |
| 20 | +```python |
| 21 | +from src import create_codeact_env, CodeAction |
| 22 | + |
| 23 | +# Create environment |
| 24 | +env = create_codeact_env() |
| 25 | +obs = env.reset() |
| 26 | + |
| 27 | +# Execute Python code |
| 28 | +action = CodeAction(code=""" |
| 29 | +x = 10 |
| 30 | +y = 20 |
| 31 | +result = x * y |
| 32 | +print(f"Result: {result}") |
| 33 | +result # Return value |
| 34 | +""") |
| 35 | + |
| 36 | +obs = env.step(action) |
| 37 | +print(f"Output: {obs.execution_result.stdout}") |
| 38 | +print(f"Return: {obs.execution_result.return_value}") |
| 39 | +``` |
| 40 | + |
| 41 | +## Core Components |
| 42 | + |
| 43 | +### Actions and Observations |
| 44 | + |
| 45 | +```python |
| 46 | +# Actions contain arbitrary Python code |
| 47 | +action = CodeAction(code="math.sqrt(16)") |
| 48 | + |
| 49 | +# Observations include execution results |
| 50 | +obs = env.step(action) |
| 51 | +print(obs.execution_result.return_value) # 4.0 |
| 52 | +print(obs.execution_result.success) # True |
| 53 | +print(obs.execution_result.stdout) # Any print output |
| 54 | +``` |
| 55 | + |
| 56 | +### Tool Integration |
| 57 | + |
| 58 | +```python |
| 59 | +from src import create_mcp_environment |
| 60 | + |
| 61 | +# Environment with MCP tools |
| 62 | +env = create_mcp_environment() |
| 63 | +obs = env.reset() |
| 64 | + |
| 65 | +# Tools available as Python objects |
| 66 | +action = CodeAction(code=""" |
| 67 | +content = "Hello, world!" |
| 68 | +file_write("/tmp/hello.txt", content) |
| 69 | +result = file_read("/tmp/hello.txt") |
| 70 | +print(f"File contents: {result}") |
| 71 | +""") |
| 72 | + |
| 73 | +obs = env.step(action) |
| 74 | +``` |
| 75 | + |
| 76 | +### RL Training with Transforms |
| 77 | + |
| 78 | +```python |
| 79 | +from src import create_math_env_transform |
| 80 | + |
| 81 | +# Environment that rewards correct math solutions |
| 82 | +transform = create_math_env_transform(expected_answer=42) |
| 83 | +env = create_codeact_env() |
| 84 | +env.transform = transform |
| 85 | + |
| 86 | +# Agent gets rewarded for correct answers |
| 87 | +action = CodeAction(code="21 * 2") # Correct answer |
| 88 | +obs = env.step(action) |
| 89 | +print(obs.reward) # 1.0 (success) + quality bonuses |
| 90 | +``` |
| 91 | + |
| 92 | +## Architecture |
| 93 | + |
| 94 | +### Type System |
| 95 | +- `Action` / `CodeAction`: Base and concrete action types |
| 96 | +- `Observation` / `CodeObservation`: Base and concrete observation types |
| 97 | +- `State` / `CodeState`: Environment state with execution context |
| 98 | +- `ExecutionResult`: Detailed code execution results |
| 99 | + |
| 100 | +### Core Classes |
| 101 | +- `Environment`: Base class following Gym API |
| 102 | +- `CodeActEnvironment`: Main environment for code execution |
| 103 | +- `Transform`: Base class for observation modification |
| 104 | +- `ToolRegistry`: Manages available tools and functions |
| 105 | + |
| 106 | +### Transform Examples |
| 107 | +- `CodeSafetyTransform`: Penalizes unsafe code patterns |
| 108 | +- `MathProblemTransform`: Rewards correct numerical answers |
| 109 | +- `CodeQualityTransform`: Evaluates code quality metrics |
| 110 | +- `CompositeTransform`: Combines multiple transforms |
| 111 | + |
| 112 | +## File Structure |
| 113 | + |
| 114 | +``` |
| 115 | +src/ |
| 116 | +├── types.py # Core type definitions |
| 117 | +├── interfaces.py # Abstract base classes |
| 118 | +├── environment.py # Main CodeAct environment |
| 119 | +├── transforms.py # Transform implementations |
| 120 | +├── mcp.py # MCP integration |
| 121 | +└── __init__.py # Clean exports |
| 122 | +``` |
| 123 | + |
| 124 | +## Usage Patterns |
| 125 | + |
| 126 | +### Agent Exploration |
| 127 | +```python |
| 128 | +env = create_codeact_env() |
| 129 | +obs = env.reset() |
| 130 | + |
| 131 | +# Multi-step problem solving |
| 132 | +action1 = CodeAction(code="data = [1, 2, 3, 4, 5]") |
| 133 | +obs = env.step(action1) |
| 134 | + |
| 135 | +action2 = CodeAction(code="mean = sum(data) / len(data); mean") |
| 136 | +obs = env.step(action2) # Uses persistent data from step 1 |
| 137 | +``` |
| 138 | + |
| 139 | +### RL Training Loop |
| 140 | +```python |
| 141 | +# Create environment with reward function |
| 142 | +transform = create_safe_env_transform() |
| 143 | +env = create_codeact_env() |
| 144 | +env.transform = transform |
| 145 | + |
| 146 | +for episode in range(100): |
| 147 | + obs = env.reset() |
| 148 | + action = generate_action() # From your policy |
| 149 | + obs = env.step(action) |
| 150 | + |
| 151 | + reward = obs.reward # Computed by transforms |
| 152 | + # Update policy based on reward |
| 153 | +``` |
| 154 | + |
| 155 | +### Hybrid Agent + RL |
| 156 | +```python |
| 157 | +# Phase 1: Agent exploration |
| 158 | +env = create_codeact_env() |
| 159 | +# Agent explores different solution approaches |
| 160 | + |
| 161 | +# Phase 2: RL optimization |
| 162 | +env.transform = optimization_transform |
| 163 | +# Train to optimize based on exploration insights |
| 164 | +``` |
| 165 | + |
| 166 | +## Design Principles |
| 167 | + |
| 168 | +- **KISS Approach**: Minimal, opinionated design |
| 169 | +- **Single Way**: One clear way to accomplish tasks |
| 170 | +- **Pythonic**: Follows PyTorch/HuggingFace patterns |
| 171 | +- **No Inline Comments**: Code should be self-explanatory |
| 172 | +- **Functional Composition**: Private functions explain complex logic |
| 173 | + |
| 174 | +## Testing |
| 175 | + |
| 176 | +Run the test suite: |
| 177 | +```bash |
| 178 | +python test_unified.py |
| 179 | +``` |
| 180 | + |
| 181 | +Run examples: |
| 182 | +```bash |
| 183 | +python example.py |
| 184 | +``` |
| 185 | + |
| 186 | +## Requirements |
| 187 | + |
| 188 | +See `requirements.txt` for dependencies. Core requirements: |
| 189 | +- Python 3.9+ |
| 190 | +- PyTorch 2.0+ |
| 191 | +- HuggingFace datasets |
| 192 | + |
| 193 | +## License |
| 194 | + |
| 195 | +BSD 3-Clause License (see LICENSE file) |
0 commit comments