Skip to content

Conversation

@m2kulkarni
Copy link

Adds conditioning to the ego agent: Currently supports "none", "reward", "entropy", "discount", "all"
Changes

  • compute_puff_advantage() accepts per-agent gammas tensor instead of scalar
  • Variable ego dimension in neural network (7 + conditioning_dims)

condition_type="none" preserves original behavior

@m2kulkarni
Copy link
Author

Adds #23

Copy link

@charliemolony charliemolony Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be worth it to condition on the ade reward aswell or should we leave it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, Im not sure what this accomplishes. Because I don't really care about the reward_ade of other drivers that much right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably not that important tbh, just might add an extra dimension of conditioning

Changes
- Remove GIF generation code from drive.c
- Improved load_weights to auto-detect file size
- view flag
- random maps if map-name not passed
- policy-name flag to make videos for a particular policy. Saved in the policy directoryq
Copy link

@charliemolony charliemolony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice job

@daphne-cornelisse daphne-cornelisse self-requested a review October 28, 2025 18:06
Copy link

@daphne-cornelisse daphne-cornelisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a high-level review. Will do an in-depth one later this week. Please remove output files

control_all_agents = False # this should be set to false unless you want to specifically want to override and control expert marked vehicles
num_policy_controlled_agents = -1 # note: if you add this you likely need to set num_agents to a smaller number
deterministic_agent_selection = False # if this is true it overrides vehicles marked as expert to be policy controlled
condition_type = "none" # Options: "none", "reward", "entropy", "discount", "all"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs in the readme in drive/ to explain these?

losses["policy_loss"] += pg_loss.item() / self.total_minibatches
losses["value_loss"] += v_loss.item() / self.total_minibatches
losses["entropy"] += entropy_loss.item() / self.total_minibatches
losses["entropy"] += entropy.mean().item() / self.total_minibatches

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep entropy_loss the same? no need to change this

Copy link

@charliemolony charliemolony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new dynamics model 'Jerk', there is an extra dimension in the observations. I think you need to account for this and shift the conditioning dimensions down

@m2kulkarni
Copy link
Author

Ahh thanks for pointing out. Will fix

@m2kulkarni
Copy link
Author

@charliemolony merged the new dynamics model. Can you take a look again?

@m2kulkarni
Copy link
Author

@copilot

Copy link

Copilot AI commented Nov 10, 2025

@m2kulkarni I've opened a new pull request, #125, to work on those changes. Once the pull request is ready, I'll request review from you.

@m2kulkarni m2kulkarni requested a review from Copilot November 10, 2025 22:58
Copilot finished reviewing on behalf of m2kulkarni November 10, 2025 23:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds conditioning capabilities to the ego agent, supporting "none", "reward", "entropy", "discount", and "all" condition types. The changes enable per-agent discount factors and variable ego dimensions in the neural network architecture.

Key changes:

  • Modified compute_puff_advantage() to accept per-agent gamma tensors instead of scalar values
  • Added variable ego dimension support (base dimension + conditioning dimensions)
  • Implemented reward, entropy, and discount conditioning with configurable weight ranges

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_drive_conditioning.py New test suite validating conditioning behavior for both dynamics models
pufferlib/pufferl.py Updated training logic to handle per-agent gammas and entropy-weighted losses
pufferlib/ocean/torch.py Modified neural network to support variable ego dimensions based on conditioning
pufferlib/ocean/drive/visualize.c Added conditioning parameters to visualization function signatures
pufferlib/ocean/drive/drivenet.h Updated DriveNet structure and initialization with conditioning support
pufferlib/ocean/drive/drive.py Added conditioning configuration and observation space calculation
pufferlib/ocean/drive/drive.h Extended Drive struct with conditioning fields and memory management
pufferlib/ocean/drive/drive.c Updated function calls with new conditioning parameters
pufferlib/ocean/drive/binding.c Added Python binding support for conditioning parameters
pufferlib/extensions/pufferlib.cpp Modified advantage computation to accept per-agent gamma tensors
pufferlib/extensions/cuda/pufferlib.cu Updated CUDA kernel for per-agent gamma support
pufferlib/config/ocean/drive.ini Added condition_type configuration option

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 367 to 369
disc_idx = 7 # base ego obs
else:
disc_idx = 10 # base ego obs
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disc_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 367 and 369 should swap their values.

Suggested change
disc_idx = 7 # base ego obs
else:
disc_idx = 10 # base ego obs
disc_idx = 10 # base ego obs
else:
disc_idx = 7 # base ego obs

Copilot uses AI. Check for mistakes.
Comment on lines 470 to 472
ent_idx = 7 # base ego obs
else:
ent_idx = 10 # base ego obs
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ent_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 470 and 472 should swap their values.

Suggested change
ent_idx = 7 # base ego obs
else:
ent_idx = 10 # base ego obs
ent_idx = 10 # base ego obs
else:
ent_idx = 7 # base ego obs

Copilot uses AI. Check for mistakes.
Comment on lines +1508 to +1518
if (env->use_rc) {
env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float));
env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float));
env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
if (env->use_ec) {
env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
if (env->use_dc) {
env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak: conditioning weight arrays are allocated in init() and then allocated again in allocate() (lines 1568-1578) without freeing the first allocation. Since allocate() calls init() at line 1563, the second allocation overwrites the pointers from the first allocation. Remove these allocations from init() as they are redundantly allocated in allocate().

Suggested change
if (env->use_rc) {
env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float));
env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float));
env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
if (env->use_ec) {
env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
if (env->use_dc) {
env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float));
}
// Conditioning weight arrays are allocated in allocate(), not here.

Copilot uses AI. Check for mistakes.
float rel_y = -dx*sin_heading + dy*cos_heading;
// Store observations with correct indexing
obs[obs_idx] = rel_x * 0.02f;
// Add conditioning weights to observations
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphaned comment that doesn't relate to the code on line 1908. This comment appears to be misplaced or leftover from editing. It should be removed.

Suggested change
// Add conditioning weights to observations

Copilot uses AI. Check for mistakes.
@m2kulkarni m2kulkarni requested a review from Copilot November 11, 2025 00:55
Copilot finished reviewing on behalf of m2kulkarni November 11, 2025 00:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +258 to +261
.collision_weight_lb = -0.0f,
.collision_weight_ub = -0.0f,
.offroad_weight_lb = -0.0f,
.offroad_weight_ub = -0.0f,
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -0.0f for collision and offroad weight bounds is semantically confusing. Should use 0.0f instead, as -0.0f and 0.0f are equivalent in floating-point but -0.0f suggests negative weights were intended.

Suggested change
.collision_weight_lb = -0.0f,
.collision_weight_ub = -0.0f,
.offroad_weight_lb = -0.0f,
.offroad_weight_ub = -0.0f,
.collision_weight_lb = 0.0f,
.collision_weight_ub = 0.0f,
.offroad_weight_lb = 0.0f,
.offroad_weight_ub = 0.0f,

Copilot uses AI. Check for mistakes.
float rel_y = -dx*sin_heading + dy*cos_heading;
// Store observations with correct indexing
obs[obs_idx] = rel_x * 0.02f;
// Add conditioning weights to observations
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misplaced and misleading. It appears in the middle of processing partner observations (line 1909) but conditioning weights were already added earlier (lines 1872-1882). This comment should be removed.

Suggested change
// Add conditioning weights to observations

Copilot uses AI. Check for mistakes.
memset(net->obs_road, 0, net->num_agents * 200 * 13 * sizeof(float));

// Reshape observations into 2D boards and additional features
float* obs_self = net->obs_self;
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Variable obs_self shadows the struct member net->obs_self and is only used once on line 157. Consider removing this local variable and using net->obs_self directly for clarity.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants