Reward/Entropy/Discount Conditioning #96

m2kulkarni · 2025-10-23T16:30:34Z

Adds conditioning to the ego agent: Currently supports "none", "reward", "entropy", "discount", "all"
Changes

compute_puff_advantage() accepts per-agent gammas tensor instead of scalar
Variable ego dimension in neural network (7 + conditioning_dims)

condition_type="none" preserves original behavior

m2kulkarni · 2025-10-23T16:31:46Z

Adds #23

charliemolony · 2025-10-28T13:52:22Z

pufferlib/ocean/drive/binding.c

Do you think it would be worth it to condition on the ade reward aswell or should we leave it?

Interesting, Im not sure what this accomplishes. Because I don't really care about the reward_ade of other drivers that much right?

It probably not that important tbh, just might add an extra dimension of conditioning

Changes - Remove GIF generation code from drive.c - Improved load_weights to auto-detect file size - view flag - random maps if map-name not passed - policy-name flag to make videos for a particular policy. Saved in the policy directoryq

charliemolony

LGTM! Nice job

daphne-cornelisse

Did a high-level review. Will do an in-depth one later this week. Please remove output files

daphne-cornelisse · 2025-10-28T18:14:30Z

pufferlib/config/ocean/drive.ini

 control_all_agents = False # this should be set to false unless you want to specifically want to override and control expert marked vehicles
 num_policy_controlled_agents = -1 # note: if you add this you likely need to set num_agents to a smaller number
 deterministic_agent_selection = False # if this is true it overrides vehicles marked as expert to be policy controlled
+condition_type = "none" # Options: "none", "reward", "entropy", "discount", "all"


Can you add some docs in the readme in drive/ to explain these?

daphne-cornelisse · 2025-10-28T18:18:22Z

pufferlib/pufferl.py

            losses["policy_loss"] += pg_loss.item() / self.total_minibatches
            losses["value_loss"] += v_loss.item() / self.total_minibatches
-            losses["entropy"] += entropy_loss.item() / self.total_minibatches
+            losses["entropy"] += entropy.mean().item() / self.total_minibatches


Can we keep entropy_loss the same? no need to change this

charliemolony

with the new dynamics model 'Jerk', there is an extra dimension in the observations. I think you need to account for this and shift the conditioning dimensions down

m2kulkarni · 2025-11-03T17:18:58Z

Ahh thanks for pointing out. Will fix

m2kulkarni · 2025-11-04T23:10:47Z

@charliemolony merged the new dynamics model. Can you take a look again?

pufferlib/ocean/drive/drive.h

m2kulkarni · 2025-11-10T22:26:53Z

@copilot

Copilot · 2025-11-10T22:27:02Z

@m2kulkarni I've opened a new pull request, #125, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull Request Overview

This PR adds conditioning capabilities to the ego agent, supporting "none", "reward", "entropy", "discount", and "all" condition types. The changes enable per-agent discount factors and variable ego dimensions in the neural network architecture.

Key changes:

Modified compute_puff_advantage() to accept per-agent gamma tensors instead of scalar values
Added variable ego dimension support (base dimension + conditioning dimensions)
Implemented reward, entropy, and discount conditioning with configurable weight ranges

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_drive_conditioning.py	New test suite validating conditioning behavior for both dynamics models
pufferlib/pufferl.py	Updated training logic to handle per-agent gammas and entropy-weighted losses
pufferlib/ocean/torch.py	Modified neural network to support variable ego dimensions based on conditioning
pufferlib/ocean/drive/visualize.c	Added conditioning parameters to visualization function signatures
pufferlib/ocean/drive/drivenet.h	Updated DriveNet structure and initialization with conditioning support
pufferlib/ocean/drive/drive.py	Added conditioning configuration and observation space calculation
pufferlib/ocean/drive/drive.h	Extended Drive struct with conditioning fields and memory management
pufferlib/ocean/drive/drive.c	Updated function calls with new conditioning parameters
pufferlib/ocean/drive/binding.c	Added Python binding support for conditioning parameters
pufferlib/extensions/pufferlib.cpp	Modified advantage computation to accept per-agent gamma tensors
pufferlib/extensions/cuda/pufferlib.cu	Updated CUDA kernel for per-agent gamma support
pufferlib/config/ocean/drive.ini	Added condition_type configuration option

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T23:01:49Z

pufferlib/pufferl.py

+                    disc_idx = 7  # base ego obs
+                else:
+                    disc_idx = 10  # base ego obs


The disc_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 367 and 369 should swap their values.

Suggested change

disc_idx = 7 # base ego obs

else:

disc_idx = 10 # base ego obs

disc_idx = 10 # base ego obs

else:

disc_idx = 7 # base ego obs

Copilot · 2025-11-10T23:01:49Z

pufferlib/pufferl.py

+                    ent_idx = 7  # base ego obs
+                else:
+                    ent_idx = 10  # base ego obs


The ent_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 470 and 472 should swap their values.

Suggested change

ent_idx = 7 # base ego obs

else:

ent_idx = 10 # base ego obs

ent_idx = 10 # base ego obs

else:

ent_idx = 7 # base ego obs

Copilot · 2025-11-10T23:01:49Z

pufferlib/ocean/drive/drive.h

+    if (env->use_rc) {
+        env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float));
+        env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float));
+        env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float));
+    }
+    if (env->use_ec) {
+        env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float));
+    }
+    if (env->use_dc) {
+        env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float));
+    }


Memory leak: conditioning weight arrays are allocated in init() and then allocated again in allocate() (lines 1568-1578) without freeing the first allocation. Since allocate() calls init() at line 1563, the second allocation overwrites the pointers from the first allocation. Remove these allocations from init() as they are redundantly allocated in allocate().

Suggested change

if (env->use_rc) {

env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float));

env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float));

env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float));

}

if (env->use_ec) {

env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float));

}

if (env->use_dc) {

env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float));

}

// Conditioning weight arrays are allocated in allocate(), not here.

Copilot · 2025-11-10T23:01:50Z

pufferlib/ocean/drive/drive.h

            float rel_y = -dx*sin_heading + dy*cos_heading;
            // Store observations with correct indexing
            obs[obs_idx] = rel_x * 0.02f;
+        // Add conditioning weights to observations


Orphaned comment that doesn't relate to the code on line 1908. This comment appears to be misplaced or leftover from editing. It should be removed.

Suggested change

// Add conditioning weights to observations

Copilot

Pull Request Overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-11T00:58:24Z

pufferlib/ocean/drive/visualize.c

+        .collision_weight_lb = -0.0f,
+        .collision_weight_ub = -0.0f,
+        .offroad_weight_lb = -0.0f,
+        .offroad_weight_ub = -0.0f,


Using -0.0f for collision and offroad weight bounds is semantically confusing. Should use 0.0f instead, as -0.0f and 0.0f are equivalent in floating-point but -0.0f suggests negative weights were intended.

Suggested change

.collision_weight_lb = -0.0f,

.collision_weight_ub = -0.0f,

.offroad_weight_lb = -0.0f,

.offroad_weight_ub = -0.0f,

.collision_weight_lb = 0.0f,

.collision_weight_ub = 0.0f,

.offroad_weight_lb = 0.0f,

.offroad_weight_ub = 0.0f,

Copilot · 2025-11-11T00:58:24Z

pufferlib/ocean/drive/drive.h

            float rel_y = -dx*sin_heading + dy*cos_heading;
            // Store observations with correct indexing
            obs[obs_idx] = rel_x * 0.02f;
+        // Add conditioning weights to observations


This comment is misplaced and misleading. It appears in the middle of processing partner observations (line 1909) but conditioning weights were already added earlier (lines 1872-1882). This comment should be removed.

Suggested change

// Add conditioning weights to observations

Copilot · 2025-11-11T00:58:24Z

pufferlib/ocean/drive/drivenet.h

    memset(net->obs_road, 0, net->num_agents * 200 * 13 * sizeof(float));

+    // Reshape observations into 2D boards and additional features
+    float* obs_self = net->obs_self;


[nitpick] Variable obs_self shadows the struct member net->obs_self and is only used once on line 157. Consider removing this local variable and using net->obs_self directly for clarity.

charliemolony reviewed Oct 28, 2025

View reviewed changes

m2kulkarni added 4 commits October 28, 2025 10:42

merging from main

5f2c1c3

fixed test_drive_render.py

595734b

added reward/entropy/discount conditioning

fecb769

m2kulkarni force-pushed the mohit/conditioning branch from dfd7d49 to fecb769 Compare October 28, 2025 14:52

m2kulkarni added 2 commits October 28, 2025 10:53

pre-commit fixes

558a09c

Merge branch 'main' into mohit/conditioning

4dbc888

charliemolony approved these changes Oct 28, 2025

View reviewed changes

daphne-cornelisse self-requested a review October 28, 2025 18:06

daphne-cornelisse reviewed Oct 28, 2025

View reviewed changes

m2kulkarni added 3 commits October 28, 2025 15:15

some build fixes

6eaaed6

removed output files

eb37090

readd weights

525c8d1

charliemolony reviewed Oct 31, 2025

View reviewed changes

m2kulkarni added 4 commits November 3, 2025 12:28

fixed viz env

d47501e

Merge remote-tracking branch 'origin/main' into mohit/conditioning

1e6cc3a

run pre-commit

b327cd6

pre-commit fix

df07668

charliemolony reviewed Nov 5, 2025

View reviewed changes

pufferlib/ocean/drive/drive.h Show resolved Hide resolved

m2kulkarni added 6 commits November 5, 2025 15:28

fixed issues with conditioning and viz

22d6655

Merge remote-tracking branch 'origin/main' into mohit/conditioning

c4d909a

changed respawn timstep to always be at obs[6]

f4fa4ee

merged main

df103dc

Merge branch 'main' into mohit/conditioning

6ae8105

Merge branch 'main' into mohit/conditioning

f959329

Copilot AI mentioned this pull request Nov 10, 2025

No changes needed - comment was just a mention #125

Closed

m2kulkarni requested a review from Copilot November 10, 2025 22:58

Copilot started reviewing on behalf of m2kulkarni November 10, 2025 22:58 View session

Copilot finished reviewing on behalf of m2kulkarni November 10, 2025 23:01

Copilot AI reviewed Nov 10, 2025

View reviewed changes

fixed wrong jerk/classic obs values

b6dd7c6

m2kulkarni requested a review from Copilot November 11, 2025 00:55

Copilot started reviewing on behalf of m2kulkarni November 11, 2025 00:55 View session

Copilot finished reviewing on behalf of m2kulkarni November 11, 2025 00:57

Copilot AI reviewed Nov 11, 2025

View reviewed changes

Reward/Entropy/Discount Conditioning #96

Are you sure you want to change the base?

Reward/Entropy/Discount Conditioning #96

Conversation

m2kulkarni commented Oct 23, 2025

Uh oh!

m2kulkarni commented Oct 23, 2025

Uh oh!

charliemolony Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m2kulkarni Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

charliemolony Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

charliemolony left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daphne-cornelisse left a comment

Choose a reason for hiding this comment

Uh oh!

daphne-cornelisse Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

daphne-cornelisse Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

charliemolony left a comment

Choose a reason for hiding this comment

Uh oh!

m2kulkarni commented Nov 3, 2025

Uh oh!

m2kulkarni commented Nov 4, 2025

Uh oh!

Uh oh!

m2kulkarni commented Nov 10, 2025

Uh oh!

Copilot AI commented Nov 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

charliemolony Oct 28, 2025 •

edited

Loading

charliemolony left a comment •

edited

Loading