-
Notifications
You must be signed in to change notification settings - Fork 12
Reward/Entropy/Discount Conditioning #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Adds #23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be worth it to condition on the ade reward aswell or should we leave it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, Im not sure what this accomplishes. Because I don't really care about the reward_ade of other drivers that much right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably not that important tbh, just might add an extra dimension of conditioning
Changes - Remove GIF generation code from drive.c - Improved load_weights to auto-detect file size - view flag - random maps if map-name not passed - policy-name flag to make videos for a particular policy. Saved in the policy directoryq
dfd7d49 to
fecb769
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job
daphne-cornelisse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a high-level review. Will do an in-depth one later this week. Please remove output files
| control_all_agents = False # this should be set to false unless you want to specifically want to override and control expert marked vehicles | ||
| num_policy_controlled_agents = -1 # note: if you add this you likely need to set num_agents to a smaller number | ||
| deterministic_agent_selection = False # if this is true it overrides vehicles marked as expert to be policy controlled | ||
| condition_type = "none" # Options: "none", "reward", "entropy", "discount", "all" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some docs in the readme in drive/ to explain these?
pufferlib/pufferl.py
Outdated
| losses["policy_loss"] += pg_loss.item() / self.total_minibatches | ||
| losses["value_loss"] += v_loss.item() / self.total_minibatches | ||
| losses["entropy"] += entropy_loss.item() / self.total_minibatches | ||
| losses["entropy"] += entropy.mean().item() / self.total_minibatches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep entropy_loss the same? no need to change this
charliemolony
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new dynamics model 'Jerk', there is an extra dimension in the observations. I think you need to account for this and shift the conditioning dimensions down
|
Ahh thanks for pointing out. Will fix |
|
@charliemolony merged the new dynamics model. Can you take a look again? |
|
@m2kulkarni I've opened a new pull request, #125, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds conditioning capabilities to the ego agent, supporting "none", "reward", "entropy", "discount", and "all" condition types. The changes enable per-agent discount factors and variable ego dimensions in the neural network architecture.
Key changes:
- Modified
compute_puff_advantage()to accept per-agent gamma tensors instead of scalar values - Added variable ego dimension support (base dimension + conditioning dimensions)
- Implemented reward, entropy, and discount conditioning with configurable weight ranges
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_drive_conditioning.py | New test suite validating conditioning behavior for both dynamics models |
| pufferlib/pufferl.py | Updated training logic to handle per-agent gammas and entropy-weighted losses |
| pufferlib/ocean/torch.py | Modified neural network to support variable ego dimensions based on conditioning |
| pufferlib/ocean/drive/visualize.c | Added conditioning parameters to visualization function signatures |
| pufferlib/ocean/drive/drivenet.h | Updated DriveNet structure and initialization with conditioning support |
| pufferlib/ocean/drive/drive.py | Added conditioning configuration and observation space calculation |
| pufferlib/ocean/drive/drive.h | Extended Drive struct with conditioning fields and memory management |
| pufferlib/ocean/drive/drive.c | Updated function calls with new conditioning parameters |
| pufferlib/ocean/drive/binding.c | Added Python binding support for conditioning parameters |
| pufferlib/extensions/pufferlib.cpp | Modified advantage computation to accept per-agent gamma tensors |
| pufferlib/extensions/cuda/pufferlib.cu | Updated CUDA kernel for per-agent gamma support |
| pufferlib/config/ocean/drive.ini | Added condition_type configuration option |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pufferlib/pufferl.py
Outdated
| disc_idx = 7 # base ego obs | ||
| else: | ||
| disc_idx = 10 # base ego obs |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disc_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 367 and 369 should swap their values.
| disc_idx = 7 # base ego obs | |
| else: | |
| disc_idx = 10 # base ego obs | |
| disc_idx = 10 # base ego obs | |
| else: | |
| disc_idx = 7 # base ego obs |
pufferlib/pufferl.py
Outdated
| ent_idx = 7 # base ego obs | ||
| else: | ||
| ent_idx = 10 # base ego obs |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ent_idx values are swapped. When dynamics_model is 'jerk', the base ego dimension is 10, not 7. When it's 'classic' (else), the base ego dimension is 7, not 10. Lines 470 and 472 should swap their values.
| ent_idx = 7 # base ego obs | |
| else: | |
| ent_idx = 10 # base ego obs | |
| ent_idx = 10 # base ego obs | |
| else: | |
| ent_idx = 7 # base ego obs |
| if (env->use_rc) { | ||
| env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | ||
| env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | ||
| env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | ||
| } | ||
| if (env->use_ec) { | ||
| env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | ||
| } | ||
| if (env->use_dc) { | ||
| env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | ||
| } |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory leak: conditioning weight arrays are allocated in init() and then allocated again in allocate() (lines 1568-1578) without freeing the first allocation. Since allocate() calls init() at line 1563, the second allocation overwrites the pointers from the first allocation. Remove these allocations from init() as they are redundantly allocated in allocate().
| if (env->use_rc) { | |
| env->collision_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | |
| env->offroad_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | |
| env->goal_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | |
| } | |
| if (env->use_ec) { | |
| env->entropy_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | |
| } | |
| if (env->use_dc) { | |
| env->discount_weights = (float*)calloc(env->active_agent_count, sizeof(float)); | |
| } | |
| // Conditioning weight arrays are allocated in allocate(), not here. |
| float rel_y = -dx*sin_heading + dy*cos_heading; | ||
| // Store observations with correct indexing | ||
| obs[obs_idx] = rel_x * 0.02f; | ||
| // Add conditioning weights to observations |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orphaned comment that doesn't relate to the code on line 1908. This comment appears to be misplaced or leftover from editing. It should be removed.
| // Add conditioning weights to observations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .collision_weight_lb = -0.0f, | ||
| .collision_weight_ub = -0.0f, | ||
| .offroad_weight_lb = -0.0f, | ||
| .offroad_weight_ub = -0.0f, |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using -0.0f for collision and offroad weight bounds is semantically confusing. Should use 0.0f instead, as -0.0f and 0.0f are equivalent in floating-point but -0.0f suggests negative weights were intended.
| .collision_weight_lb = -0.0f, | |
| .collision_weight_ub = -0.0f, | |
| .offroad_weight_lb = -0.0f, | |
| .offroad_weight_ub = -0.0f, | |
| .collision_weight_lb = 0.0f, | |
| .collision_weight_ub = 0.0f, | |
| .offroad_weight_lb = 0.0f, | |
| .offroad_weight_ub = 0.0f, |
| float rel_y = -dx*sin_heading + dy*cos_heading; | ||
| // Store observations with correct indexing | ||
| obs[obs_idx] = rel_x * 0.02f; | ||
| // Add conditioning weights to observations |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is misplaced and misleading. It appears in the middle of processing partner observations (line 1909) but conditioning weights were already added earlier (lines 1872-1882). This comment should be removed.
| // Add conditioning weights to observations |
| memset(net->obs_road, 0, net->num_agents * 200 * 13 * sizeof(float)); | ||
|
|
||
| // Reshape observations into 2D boards and additional features | ||
| float* obs_self = net->obs_self; |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Variable obs_self shadows the struct member net->obs_self and is only used once on line 157. Consider removing this local variable and using net->obs_self directly for clarity.
Adds conditioning to the ego agent: Currently supports "none", "reward", "entropy", "discount", "all"
Changes
compute_puff_advantage()accepts per-agent gammas tensor instead of scalarcondition_type="none"preserves original behavior