Add marginal variable importance base on sklearn funtion #317

lionelkusch · 2025-07-16T12:34:35Z

This PR introduces 3 methods for computing marginal importance:

Anova
Univariate Linear Regression Tests (ULRT: ANOVA for regression problem)
Mutual Information

I based this PR on the PR #220 for the API and PR #265 for the testing tools.

I have an issue with ULRT. In some cases, the feature important are the lowest. I need your help to determine what is the problem.

src/hidimstat/marginal/selection_variable_scikit_learn.py

jpaillard · 2025-09-03T11:17:15Z

test/marginal/test_selection_variable_scikit_learn_classification.py

+    ("HiDim with noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.0),
+    ("HiDim with correlated noise", 150, 200, 1, 0.0, 42, 1.0, 10.0, 0.5),
+    ("HiDim with correlated features", 150, 200, 1, 0.8, 42, 1.0, np.inf, 0.0),
+]


For a univariate selection, it is not necessary to create such a high-dimensional example. especially if the support size is only one. n_features could be lowered to spare compute.

Do you mean that univariate can't detect features in high dimensions?

In the case of univariate selection, features are handled randomly, so there is no methodological difference between low- and high- dimension. We can simply the tests here.

see issue #375.

jpaillard · 2025-09-03T11:20:01Z

test/marginal/test_selection_variable_scikit_learn_classification.py

+        importance[important_features].mean()
+        > importance[not_important_features][
+            np.where(importance[not_important_features] != 0)
+        ].mean()


Suggested change

importance[important_features].mean()

> importance[not_important_features][

np.where(importance[not_important_features] != 0)

].mean()

importance[important_features].mean()

> importance[not_important_features].mean()

Is there a specific reason to exclude features with importance=0?

Yes, there are a lot of features where the importance equals to zeros. This creates a large bias when the mean is computed.

That's not a bias; it is expected that the mutual information tends to zero under the null hypothesis (which is the case for for non-important )

This creates a bias in the mean. In other cases, I should take the median for a better estimation.

The problem is that if the mutual information were perfectly estimated (n tends to infinity), then the test would fail because the method actually works.
Indeed, the mutual information of a feature that is independent from y is 0, so the array importance[not_important_features][np.where(importance[not_important_features] != 0)] would be empty.

Yes, but it's not the case.

jpaillard · 2025-09-03T11:20:53Z

test/marginal/test_selection_variable_scikit_learn_classification.py

+    [ANOVA, MutualInformationClassification],
+    ids=["ANOVA", "MutualInformation"],
+)
+class TestClass:


from #360 it seems that the guideline for tests is to avoid classes

I can do it but I already use this classification of test for CFI.
I prefer to keep the coherence with the initial structure of the tests than to change it.
If it was the case, PR #265 was the place to criticise it.

It's the case --- as we have said many times. Nevertheless, it's fine to handle this in a future PR.

jpaillard · 2025-09-03T11:22:13Z

test/marginal/test_selection_variable_scikit_learn_regression.py

+    assert (
+        importance[important_features].mean()
+        > importance[not_important_features][
+            np.where(importance[not_important_features] != 0)


same comment as above

jpaillard · 2025-09-03T11:23:32Z

test/marginal/test_selection_variable_scikit_learn_regression.py

+
+        importances = classvi.fit_importance(X, y)
+        assert len(importances) == 3
+        assert np.all(importances >= 0)


Can we additionally test that the importances of variables in the support are larger than the non-important features?

Can you give me a theoretical reason that the support for importance features is lower than the non-important features?

I don't see the reason for this difference. Moreover, where do you put the threshold between important and not important features?

lionelkusch added the method implementation Question regarding methods implementations label Jul 16, 2025

New marginal method base on scikitlearn

d4caef5

lionelkusch force-pushed the PR_marginal_scikit_learn branch from 3d8966e to d4caef5 Compare August 27, 2025 17:18

fix format

d85c373

lionelkusch marked this pull request as ready for review August 27, 2025 17:24

jpaillard reviewed Aug 28, 2025

View reviewed changes

src/hidimstat/marginal/selection_variable_scikit_learn.py Outdated Show resolved Hide resolved

src/hidimstat/marginal/selection_variable_scikit_learn.py Show resolved Hide resolved

lionelkusch added 2 commits August 29, 2025 12:39

change the importance to f-statistic

9b03055

fix p-value

cc2a741

jpaillard mentioned this pull request Sep 3, 2025

D0CRT Accept Scikit-learn models directly instead of parameter dictionaries #370

Merged

jpaillard reviewed Sep 3, 2025

View reviewed changes

This was referenced Sep 3, 2025

D0CRT: Merge screening and screening_threshold into one parameter. #231

Closed

Remove classes in tests #376

Open

lionelkusch added API 2 Refactoring following the second version of API and removed method implementation Question regarding methods implementations labels Sep 9, 2025

Add marginal variable importance base on sklearn funtion #317

Are you sure you want to change the base?

Add marginal variable importance base on sklearn funtion #317

Uh oh!

Conversation

lionelkusch commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!