Skip to content

Conversation

@daphne-cornelisse
Copy link

@daphne-cornelisse daphne-cornelisse commented Oct 2, 2025

Goal

Given a dataset of human "expert" trajectories $(o_t, a_t)$, we aim to compute the likelihood of the expert’s actions under the current policy and use this as an auxiliary learning signal.

Implementation

Dataset

Use the inverse bicycle model to infer the human actions taken for every step.

At initialization

The method _save_expert_data() saves max_expert_samples per env to disk. We stack actions based on bptt_horizon so that the resulting tuples are of shape (max_expert_samples, bptt_horizon, action/obs_dim).

Under the hood - this method uses c_collect_expert_data() in drive.h to step the agents with their continuous inferred human actions and collect the corresponding observations. We return the discretized actions if needed.

Training inner loop

Do:

  1. Load all data to the CPU (stored in two .pt files)
  2. Sample human_samples tuples $(o^{\text{expert}}_t, a^{\text{expert}}_t)$ from the saved expert dataset.
  3. Load this subset to the GPU
  4. Evaluate the policy likelihood $\pi^{\text{agent}}_i(a^{\text{expert}}_t \mid o^{\text{expert}}_t)$ for each transition.
  5. Apply clipping and average the results.
  6. Incorporate the weighted log-likelihood signal into the loss objective.

Wandb integration

  • Displays the mean log probability of human actions (negative values, closer to 0 is better)
Screenshot 2025-10-08 at 10 01 42
  • Adds a new realism/ section with the human action distributions. This is done mostly as a sanity check. The section can be extended with other realism-related metrics.
Screenshot 2025-10-08 at 11 50 18

@daphne-cornelisse daphne-cornelisse marked this pull request as draft October 7, 2025 09:32
@daphne-cornelisse daphne-cornelisse marked this pull request as ready for review October 8, 2025 09:53
@daphne-cornelisse daphne-cornelisse changed the base branch from main to gsp_dev November 11, 2025 23:26
@eugenevinitsky
Copy link

@greptile

@eugenevinitsky
Copy link

Just testing greptile on things to develop a feel for it

@greptile-apps
Copy link

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

This PR adds human log likelihood regularization to encourage the policy to mimic human driving behavior from expert trajectories. The implementation uses an inverse bicycle model to infer continuous steering and acceleration actions from trajectory data, stores them in binary format, and samples them during training to compute an auxiliary loss term.

Key Changes:

  • Added expert_accel and expert_steering fields to the Entity struct in C++ to store inferred human actions
  • Implemented infer_human_actions() using inverse bicycle model to compute continuous control actions from position, velocity, and heading data
  • Created c_collect_expert_data() to step through environments with expert actions and collect observations
  • Added _save_expert_data() to generate and save bptt_horizon-length sequences of expert demonstrations at initialization
  • Integrated human log likelihood loss into the training loop by sampling expert data each minibatch and evaluating policy likelihood
  • Added configuration parameters (human_ll_coef, human_clip_coef, human_samples) with sweep support
  • Added WandB logging for human action distributions and log probabilities under the realism/ section

Issues Found:

  • Sequential sampling in _save_expert_data() creates bias toward early agents, reducing diversity when max_expert_samples is small relative to total available sequences
  • Repeated disk loading of expert data every minibatch is inefficient - could sample once per epoch
  • Minor: redundant .to(device) call on line 435 of pufferl.py

Confidence Score: 3/5

  • This PR has one logic issue with sequential sampling bias that should be addressed before merging
  • The implementation is mostly sound with proper memory management, correct inverse bicycle model math, and appropriate integration into the training loop. However, the sequential sampling bias in _save_expert_data() (line 323 of drive.py) is a real issue that will cause the saved expert dataset to lack diversity when max_expert_samples=128 is smaller than the total available sequences. This could impact training quality by only learning from a subset of agents. The disk I/O inefficiency is a performance concern but not a correctness issue.
  • pufferlib/ocean/drive/drive.py - fix the sequential sampling bias in _save_expert_data() to ensure diverse agent representation

Important Files Changed

File Analysis

Filename Score Overview
pufferlib/ocean/drive/drive.h 4/5 Added expert action storage fields and collection function. Memory allocation/deallocation properly handled. Discretization logic looks correct.
pufferlib/ocean/drive/drive.py 3/5 Added inverse bicycle model for action inference, expert data saving/sampling. Contains sequential sampling bias issue that reduces agent diversity in saved data.
pufferlib/pufferl.py 4/5 Integrated human log likelihood loss into training loop. Implementation is correct but has minor inefficiency with repeated disk loading per minibatch.

Sequence Diagram

sequenceDiagram
    participant Init as Environment Init
    participant C as C/C++ (drive.h)
    participant Py as Python (drive.py)
    participant Train as Training Loop (pufferl.py)
    participant Disk as Disk Storage
    
    Note over Init,Disk: Initialization Phase
    Init->>Py: __init__(bptt_horizon, max_expert_samples)
    Py->>Py: _save_expert_data()
    Py->>C: vec_collect_expert_data()
    C->>C: c_reset() for each env
    loop t=0 to TRAJECTORY_LENGTH-1
        C->>C: Read expert_accel[t], expert_steering[t]
        C->>C: Discretize continuous actions
        C->>C: Store observations
        C->>C: c_step() to advance env
    end
    C->>C: c_reset() to restore state
    C-->>Py: Return discrete_actions, continuous_actions, observations
    Py->>Py: Create bptt_horizon length sequences
    Py->>Disk: Save expert_actions_discrete.pt
    Py->>Disk: Save expert_actions_continuous.pt
    Py->>Disk: Save expert_observations.pt
    
    Note over Train,Disk: Training Phase (per minibatch)
    Train->>Py: sample_expert_data(n_samples)
    Py->>Disk: Load expert data files (CPU)
    Py->>Py: Random sample n_samples
    Py-->>Train: Return actions, observations
    Train->>Train: Move to GPU
    Train->>Train: policy(human_obs, human_state)
    Train->>Train: sample_logits(logits, human_actions)
    Train->>Train: Compute human_log_prob
    Train->>Train: loss = pg_loss + vf_loss - ent_loss - human_ll_coef * human_loss
    Train->>Train: Backward pass & optimize
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 416 to 452
# Add log likelihood loss of human actions under current policy.
# 1: Sample a batch of human actions and observations from dataset
# Shape: [n_samples, bptt_horizon, feature_dim]
discrete_human_actions, continuous_human_actions, human_observations = (
self.vecenv.driver_env.sample_expert_data(n_samples=config["human_samples"], return_both=True)
)
discrete_human_actions = discrete_human_actions.to(device)
continuous_human_actions = continuous_human_actions.to(device)
human_observations = human_observations.to(device)

# Use helper function to compute realism metrics
realism_metrics = self.vecenv.driver_env.compute_realism_metrics(
discrete_human_actions, continuous_human_actions
)
self.realism.update(realism_metrics)

# Select appropriate action type for training
use_continuous = self.vecenv.driver_env._action_type_flag == 1
human_actions = continuous_human_actions if use_continuous else discrete_human_actions
human_observations = human_observations.to(device)

# 2: Compute the log-likelihood of human actions under the current policy,
# given the corresponding human observations. A higher likelihood indicates
# that the policy behaves more like a human under the same observations.
human_state = dict(
action=human_actions,
lstm_h=None,
lstm_c=None,
)

human_logits, _ = self.policy(human_observations, human_state)

_, human_log_prob, human_entropy = pufferlib.pytorch.sample_logits(
logits=human_logits, action=human_actions
)

self.realism["human_log_prob"] = human_log_prob.mean().item()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: human log likelihood computation happens every minibatch but samples from the same full dataset each time - this creates unnecessary disk I/O overhead. Consider sampling once per epoch outside the minibatch loop and reusing those samples across all minibatches within that epoch

Prompt To Fix With AI
This is a comment left during a code review.
Path: pufferlib/pufferl.py
Line: 416:452

Comment:
**style:** human log likelihood computation happens every minibatch but samples from the same full dataset each time - this creates unnecessary disk I/O overhead. Consider sampling once per epoch outside the minibatch loop and reusing those samples across all minibatches within that epoch

How can I resolve this? If you propose a fix, please make it concise.

@daphne-cornelisse daphne-cornelisse merged commit b08ad27 into gsp_dev Nov 13, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants