Skip to content

Conversation

lionelkusch
Copy link
Collaborator

This PR introduces 3 methods for computing marginal importance:

  • Anova
  • Univariate Linear Regression Tests (ULRT: ANOVA for regression problem)
  • Mutual Information

I based this PR on the PR #220 for the API and PR #265 for the testing tools.

I have an issue with ULRT. In some cases, the feature important are the lowest. I need your help to determine what is the problem.

@lionelkusch lionelkusch added the method implementation Question regarding methods implementations label Jul 16, 2025
@lionelkusch lionelkusch force-pushed the PR_marginal_scikit_learn branch from 3d8966e to d4caef5 Compare August 27, 2025 17:18
@lionelkusch lionelkusch marked this pull request as ready for review August 27, 2025 17:24
("HiDim with noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.0),
("HiDim with correlated noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.5),
("HiDim with correlated features", 150, 200, 1, 0.8, 42, 1.0, np.inf, 0.0),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a univariate selection, it is not necessary to create such a high-dimensional example. especially if the support size is only one. n_features could be lowered to spare compute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that univariate can't detect features in high dimensions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of univariate selection, features are handled randomly, so there is no methodological difference between low- and high- dimension. We can simply the tests here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see issue #375.

Comment on lines +187 to +190
importance[important_features].mean()
> importance[not_important_features][
np.where(importance[not_important_features] != 0)
].mean()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
importance[important_features].mean()
> importance[not_important_features][
np.where(importance[not_important_features] != 0)
].mean()
importance[important_features].mean()
> importance[not_important_features].mean()

Is there a specific reason to exclude features with importance=0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a lot of features where the importance equals to zeros. This creates a large bias when the mean is computed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bias; it is expected that the mutual information tends to zero under the null hypothesis (which is the case for for non-important )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a bias in the mean. In other cases, I should take the median for a better estimation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the mutual information were perfectly estimated (n tends to infinity), then the test would fail because the method actually works.
Indeed, the mutual information of a feature that is independent from y is 0, so the array importance[not_important_features][np.where(importance[not_important_features] != 0)] would be empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's not the case.

[ANOVA, MutualInformationClassification],
ids=["ANOVA", "MutualInformation"],
)
class TestClass:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #360 it seems that the guideline for tests is to avoid classes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it but I already use this classification of test for CFI.
I prefer to keep the coherence with the initial structure of the tests than to change it.
If it was the case, PR #265 was the place to criticise it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the case --- as we have said many times. Nevertheless, it's fine to handle this in a future PR.

assert (
importance[important_features].mean()
> importance[not_important_features][
np.where(importance[not_important_features] != 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above


importances = classvi.fit_importance(X, y)
assert len(importances) == 3
assert np.all(importances >= 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we additionally test that the importances of variables in the support are larger than the non-important features?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me a theoretical reason that the support for importance features is lower than the non-important features?

I don't see the reason for this difference. Moreover, where do you put the threshold between important and not important features?

@lionelkusch lionelkusch added API 2 Refactoring following the second version of API and removed method implementation Question regarding methods implementations labels Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API 2 Refactoring following the second version of API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants