Skip to content

Conversation

@daphne-cornelisse
Copy link
Contributor

@daphne-cornelisse daphne-cornelisse commented Apr 5, 2025

Description

Implements advantage filtering from "Robust Autonomy Emerges from Self-Play" (Details in Appendix C, Algorithm 1).

The key idea is to discard transitions with low-magnitude advantages to focus training on the most informative samples.

Added config options:

  • apply_advantage_filter
  • beta (0.25)
  • initial_th_factor (0.01)

Todo

  • Implement Advantage Filtering with existing Experience buffer, preserving the shape of tensors. Zero-out all transitions < threshold.
  • Restructure Experience buffer to actually filter out such transitions, for memory and training efficiency

Logging

Add short message:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│  🐡 PufferLib 1.0.0                                           CPU: 3.2%               GPU: 98.0%               DRAM: 0.3%                      VRAM: 0.0%  │
│                                                                                                                                                            │
│  Summary                                    Value    Evaluate                            42s       84%    Losses                                    Value  │
│  Environment                             gpudrive      Forward                            0s        1%    policy_loss                              -0.002  │
│  Agent Steps                               393.7k      Env                               41s       82%    value_loss                                3.634  │
│  SPS                                         8.0k      Misc                               0s        0%    entropy                                   4.487  │
│  Epoch                                          3    Train                                4s        9%    old_approx_kl                             0.004  │
│  Uptime                                       50s      Forward                            2s        5%    approx_kl                                 0.004  │
│  Remaining                             34h 50m 7s      Learn                              4s        8%    clipfrac                                  0.020  │
│                                                        Misc                               0s        0%    explained_variance                        0.297  │
│                                                                                                                                                            │
│  User Stats                                                           Value    User Stats                                                           Value  │
│                                                                                                                                                            │
│  Message: Advantage filtering: kept 87.6% of transitions                                                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

transitions_to_use = transitions_per_mb * self.num_minibatches

np.random.shuffle(kept_indices)
kept_indices = kept_indices[:transitions_to_use]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line supposed to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that the number of transitions is divisible across the # of mini batches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants