-
Notifications
You must be signed in to change notification settings - Fork 12
Add marginal variable importance base on sklearn funtion #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3d8966e
to
d4caef5
Compare
("HiDim with noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.0), | ||
("HiDim with correlated noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.5), | ||
("HiDim with correlated features", 150, 200, 1, 0.8, 42, 1.0, np.inf, 0.0), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a univariate selection, it is not necessary to create such a high-dimensional example. especially if the support size is only one. n_features
could be lowered to spare compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that univariate can't detect features in high dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of univariate selection, features are handled randomly, so there is no methodological difference between low- and high- dimension. We can simply the tests here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see issue #375.
importance[important_features].mean() | ||
> importance[not_important_features][ | ||
np.where(importance[not_important_features] != 0) | ||
].mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importance[important_features].mean() | |
> importance[not_important_features][ | |
np.where(importance[not_important_features] != 0) | |
].mean() | |
importance[important_features].mean() | |
> importance[not_important_features].mean() |
Is there a specific reason to exclude features with importance=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a lot of features where the importance equals to zeros. This creates a large bias when the mean is computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a bias; it is expected that the mutual information tends to zero under the null hypothesis (which is the case for for non-important )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a bias in the mean. In other cases, I should take the median for a better estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if the mutual information were perfectly estimated (n tends to infinity), then the test would fail because the method actually works.
Indeed, the mutual information of a feature that is independent from y is 0, so the array importance[not_important_features][np.where(importance[not_important_features] != 0)]
would be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's not the case.
[ANOVA, MutualInformationClassification], | ||
ids=["ANOVA", "MutualInformation"], | ||
) | ||
class TestClass: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from #360 it seems that the guideline for tests is to avoid classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do it but I already use this classification of test for CFI.
I prefer to keep the coherence with the initial structure of the tests than to change it.
If it was the case, PR #265 was the place to criticise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the case --- as we have said many times. Nevertheless, it's fine to handle this in a future PR.
assert ( | ||
importance[important_features].mean() | ||
> importance[not_important_features][ | ||
np.where(importance[not_important_features] != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
|
||
importances = classvi.fit_importance(X, y) | ||
assert len(importances) == 3 | ||
assert np.all(importances >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we additionally test that the importances of variables in the support are larger than the non-important features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me a theoretical reason that the support for importance features is lower than the non-important features?
I don't see the reason for this difference. Moreover, where do you put the threshold between important and not important features?
This PR introduces 3 methods for computing marginal importance:
I based this PR on the PR #220 for the API and PR #265 for the testing tools.
I have an issue with ULRT. In some cases, the feature important are the lowest. I need your help to determine what is the problem.