Add human log likelihood regularizer #75

daphne-cornelisse · 2025-10-02T22:55:19Z

Goal

Given a dataset of human "expert" trajectories $(o_t, a_t)$, we aim to compute the likelihood of the expert’s actions under the current policy and use this as an auxiliary learning signal.

Implementation

Dataset

Use the inverse bicycle model to infer the human actions taken for every step.

At initialization

The method _save_expert_data() saves max_expert_samples per env to disk. We stack actions based on bptt_horizon so that the resulting tuples are of shape (max_expert_samples, bptt_horizon, action/obs_dim).

Under the hood - this method uses c_collect_expert_data() in drive.h to step the agents with their continuous inferred human actions and collect the corresponding observations. We return the discretized actions if needed.

Training inner loop

Do:

Load all data to the CPU (stored in two .pt files)
Sample human_samples tuples $(o^{\text{expert}}_t, a^{\text{expert}}_t)$ from the saved expert dataset.
Load this subset to the GPU
Evaluate the policy likelihood $\pi^{\text{agent}}_i(a^{\text{expert}}_t \mid o^{\text{expert}}_t)$ for each transition.
Apply clipping and average the results.
Incorporate the weighted log-likelihood signal into the loss objective.

Wandb integration

Displays the mean log probability of human actions (negative values, closer to 0 is better)

Adds a new realism/ section with the human action distributions. This is done mostly as a sanity check. The section can be extended with other realism-related metrics.

pufferlib/config/ocean/drive.ini

…en needed.

eugenevinitsky · 2025-11-13T01:54:23Z

@greptile

eugenevinitsky · 2025-11-13T01:54:31Z

Just testing greptile on things to develop a feel for it

greptile-apps · 2025-11-13T01:58:34Z

Greptile Overview

Greptile Summary

This PR adds human log likelihood regularization to encourage the policy to mimic human driving behavior from expert trajectories. The implementation uses an inverse bicycle model to infer continuous steering and acceleration actions from trajectory data, stores them in binary format, and samples them during training to compute an auxiliary loss term.

Key Changes:

Added expert_accel and expert_steering fields to the Entity struct in C++ to store inferred human actions
Implemented infer_human_actions() using inverse bicycle model to compute continuous control actions from position, velocity, and heading data
Created c_collect_expert_data() to step through environments with expert actions and collect observations
Added _save_expert_data() to generate and save bptt_horizon-length sequences of expert demonstrations at initialization
Integrated human log likelihood loss into the training loop by sampling expert data each minibatch and evaluating policy likelihood
Added configuration parameters (human_ll_coef, human_clip_coef, human_samples) with sweep support
Added WandB logging for human action distributions and log probabilities under the realism/ section

Issues Found:

Sequential sampling in _save_expert_data() creates bias toward early agents, reducing diversity when max_expert_samples is small relative to total available sequences
Repeated disk loading of expert data every minibatch is inefficient - could sample once per epoch
Minor: redundant .to(device) call on line 435 of pufferl.py

Confidence Score: 3/5

This PR has one logic issue with sequential sampling bias that should be addressed before merging
The implementation is mostly sound with proper memory management, correct inverse bicycle model math, and appropriate integration into the training loop. However, the sequential sampling bias in _save_expert_data() (line 323 of drive.py) is a real issue that will cause the saved expert dataset to lack diversity when max_expert_samples=128 is smaller than the total available sequences. This could impact training quality by only learning from a subset of agents. The disk I/O inefficiency is a performance concern but not a correctness issue.
pufferlib/ocean/drive/drive.py - fix the sequential sampling bias in _save_expert_data() to ensure diverse agent representation

Important Files Changed

File Analysis

Filename	Score	Overview
pufferlib/ocean/drive/drive.h	4/5	Added expert action storage fields and collection function. Memory allocation/deallocation properly handled. Discretization logic looks correct.
pufferlib/ocean/drive/drive.py	3/5	Added inverse bicycle model for action inference, expert data saving/sampling. Contains sequential sampling bias issue that reduces agent diversity in saved data.
pufferlib/pufferl.py	4/5	Integrated human log likelihood loss into training loop. Implementation is correct but has minor inefficiency with repeated disk loading per minibatch.

Sequence Diagram

sequenceDiagram
    participant Init as Environment Init
    participant C as C/C++ (drive.h)
    participant Py as Python (drive.py)
    participant Train as Training Loop (pufferl.py)
    participant Disk as Disk Storage
    
    Note over Init,Disk: Initialization Phase
    Init->>Py: __init__(bptt_horizon, max_expert_samples)
    Py->>Py: _save_expert_data()
    Py->>C: vec_collect_expert_data()
    C->>C: c_reset() for each env
    loop t=0 to TRAJECTORY_LENGTH-1
        C->>C: Read expert_accel[t], expert_steering[t]
        C->>C: Discretize continuous actions
        C->>C: Store observations
        C->>C: c_step() to advance env
    end
    C->>C: c_reset() to restore state
    C-->>Py: Return discrete_actions, continuous_actions, observations
    Py->>Py: Create bptt_horizon length sequences
    Py->>Disk: Save expert_actions_discrete.pt
    Py->>Disk: Save expert_actions_continuous.pt
    Py->>Disk: Save expert_observations.pt
    
    Note over Train,Disk: Training Phase (per minibatch)
    Train->>Py: sample_expert_data(n_samples)
    Py->>Disk: Load expert data files (CPU)
    Py->>Py: Random sample n_samples
    Py-->>Train: Return actions, observations
    Train->>Train: Move to GPU
    Train->>Train: policy(human_obs, human_state)
    Train->>Train: sample_logits(logits, human_actions)
    Train->>Train: Compute human_log_prob
    Train->>Train: loss = pg_loss + vf_loss - ent_loss - human_ll_coef * human_loss
    Train->>Train: Backward pass & optimize

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

pufferlib/pufferl.py

greptile-apps · 2025-11-13T01:58:32Z

pufferlib/pufferl.py

+            # Add log likelihood loss of human actions under current policy.
+            # 1: Sample a batch of human actions and observations from dataset
+            # Shape: [n_samples, bptt_horizon, feature_dim]
+            discrete_human_actions, continuous_human_actions, human_observations = (
+                self.vecenv.driver_env.sample_expert_data(n_samples=config["human_samples"], return_both=True)
+            )
+            discrete_human_actions = discrete_human_actions.to(device)
+            continuous_human_actions = continuous_human_actions.to(device)
+            human_observations = human_observations.to(device)
+
+            # Use helper function to compute realism metrics
+            realism_metrics = self.vecenv.driver_env.compute_realism_metrics(
+                discrete_human_actions, continuous_human_actions
+            )
+            self.realism.update(realism_metrics)
+
+            # Select appropriate action type for training
+            use_continuous = self.vecenv.driver_env._action_type_flag == 1
+            human_actions = continuous_human_actions if use_continuous else discrete_human_actions
+            human_observations = human_observations.to(device)
+
+            # 2: Compute the log-likelihood of human actions under the current policy,
+            # given the corresponding human observations. A higher likelihood indicates
+            # that the policy behaves more like a human under the same observations.
+            human_state = dict(
+                action=human_actions,
+                lstm_h=None,
+                lstm_c=None,
+            )
+
+            human_logits, _ = self.policy(human_observations, human_state)
+
+            _, human_log_prob, human_entropy = pufferlib.pytorch.sample_logits(
+                logits=human_logits, action=human_actions
+            )
+
+            self.realism["human_log_prob"] = human_log_prob.mean().item()


style: human log likelihood computation happens every minibatch but samples from the same full dataset each time - this creates unnecessary disk I/O overhead. Consider sampling once per epoch outside the minibatch loop and reusing those samples across all minibatches within that epoch

Prompt To Fix With AI

This is a comment left during a code review. Path: pufferlib/pufferl.py Line: 416:452 Comment: **style:** human log likelihood computation happens every minibatch but samples from the same full dataset each time - this creates unnecessary disk I/O overhead. Consider sampling once per epoch outside the minibatch loop and reusing those samples across all minibatches within that epoch How can I resolve this? If you propose a fix, please make it concise.

pufferlib/ocean/drive/drive.py

… of human data).

daphne-cornelisse added 4 commits October 2, 2025 18:50

Add human ll loss with random actions [placeholder].

23745f9

Change clip coef

b6d33aa

Typo

4d5ec38

Integrate the actual human data into pipeline.

2902af7

eugenevinitsky reviewed Oct 6, 2025

View reviewed changes

pufferlib/config/ocean/drive.ini Outdated Show resolved Hide resolved

eugenevinitsky reviewed Oct 6, 2025

View reviewed changes

pufferlib/config/ocean/drive.ini Outdated Show resolved Hide resolved

daphne-cornelisse marked this pull request as draft October 7, 2025 09:32

daphne-cornelisse added 5 commits October 8, 2025 03:56

Back at 194 K FPS. Save data to disk, only load small batch to gpu wh…

2cecf29

…en needed.

Ignore human data.

70f828f

Revert small unrelated change.

3cdafce

Remove print statement

d5c3aa3

Return the continuous actions for sanity checks.

36679fd

daphne-cornelisse added the research label Oct 8, 2025

daphne-cornelisse marked this pull request as ready for review October 8, 2025 09:53

Log the weighted loss objectives instead.

542a1e7

daphne-cornelisse requested a review from kevin-joseph-ai October 8, 2025 13:42

daphne-cornelisse added 2 commits October 9, 2025 03:38

Always compute human log prob for systematic analysis.

9b3afa8

Bug fix: do not use human entropy.

94601ee

daphne-cornelisse changed the base branch from main to gsp_dev November 11, 2025 23:26

Merge remote-tracking branch 'origin' into dc/human_ll

e2bcd43

greptile-apps bot reviewed Nov 13, 2025

View reviewed changes

daphne-cornelisse added 6 commits November 13, 2025 11:24

Small fixes

d78cb9f

Experimental settings.

daaad11

Simplify expert data generation logic.

cd3214d

Small improvement

5d001c4

Store amount of human data used in checkpoints (each seq is 3 seconds…

c706b86

… of human data).

Cleanup/

ff632af

daphne-cornelisse merged commit b08ad27 into gsp_dev Nov 13, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add human log likelihood regularizer #75

Add human log likelihood regularizer #75

Uh oh!

daphne-cornelisse commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

eugenevinitsky commented Nov 13, 2025

Uh oh!

eugenevinitsky commented Nov 13, 2025

Uh oh!

greptile-apps bot commented Nov 13, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add human log likelihood regularizer #75

Add human log likelihood regularizer #75

Uh oh!

Conversation

daphne-cornelisse commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Implementation

Dataset

At initialization

Training inner loop

Wandb integration

Uh oh!

Uh oh!

Uh oh!

eugenevinitsky commented Nov 13, 2025

Uh oh!

eugenevinitsky commented Nov 13, 2025

Uh oh!

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daphne-cornelisse commented Oct 2, 2025 •

edited

Loading