Implement Display for Dataset. #78

tormeh · 2021-02-07T00:00:13Z

Rough first draft. Happy to get feedback

codecov-io · 2021-02-07T00:50:53Z

Codecov Report

Merging #78 (3c59df6) into development (68e7162) will decrease coverage by 0.32%.
The diff coverage is 0.00%.

@@               Coverage Diff               @@
##           development      #78      +/-   ##
===============================================
- Coverage        83.89%   83.56%   -0.33%     
===============================================
  Files               75       75              
  Lines             7853     7885      +32     
===============================================
+ Hits              6588     6589       +1     
- Misses            1265     1296      +31

Impacted Files	Coverage Δ
src/dataset/mod.rs	`45.23% <0.00%> (-27.84%)`	⬇️
src/optimization/first_order/lbfgs.rs	`92.85% <0.00%> (-1.59%)`	⬇️
src/svm/svc.rs	`89.62% <0.00%> (-0.32%)`	⬇️
src/optimization/line_search.rs	`90.00% <0.00%> (+8.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68e7162...3c59df6. Read the comment docs.

VolodymyrOrlov

Thanks, @tormeh! I agree we need a way to quickly display content of a dataset. And I like proposed approach in general, but I'd like visualization logic to cover these corner cases:

Attributes self.target_names and self.feature_names are optional. We should be able to display datasets where both these attributes are set to an empty list
How can we compactly display large datasets?

src/dataset/mod.rs

VolodymyrOrlov · 2021-02-10T02:40:41Z

src/dataset/mod.rs

+            features: Vec<Feature<X>>,
+        }
+        impl<X: Copy + std::fmt::Debug, Y: Copy + std::fmt::Debug> std::fmt::Display for DataPoint<X, Y> {
+            fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {


Can you add a simple unit test that covers this function? For example, I am getting an error when I try to print Iris Dataset:

#[test] fn display() { let dataset = iris::load_dataset(); println!("{}", dataset); }

Hm... I see the problem. I assumed (favourite word right there) that you would have a vector of targets per sample, and not use different integer values to represent different targets.
So, for a single sample example:
feature_vec * weight_matrix = target_vec = [0, 1, 0, 0]
not:
feature_vec * weight_vec = target_int = 2

This worked in my case since I only had one target.

Not sure what to do here, tbh. People can choose any mapping of numeric -> target value, so I don't think it's possible to print it out in a nice way. Maybe just print the raw target value? What do you think?

Yes, data is a flattened matrix of size N x P, where N is number of samples and P is number of features per sample, target is a 1d array with target values of size P. I think printing raw values of target is a good solution.

VolodymyrOrlov · 2021-02-10T02:43:36Z

src/dataset/mod.rs

+                });
+            }
+            let mut targets = Vec::new();
+            for target_index in 0..self.target_names.len() {


What if a dataset does not has names for all target variables? What if the self.target_names list does not cover all target variables?

VolodymyrOrlov · 2021-02-10T02:44:20Z

src/dataset/mod.rs

+        let mut datapoints = Vec::new();
+        for sample_index in 0..self.num_samples {
+            let mut features = Vec::new();
+            for feature_index in 0..self.feature_names.len() {


How do we handle datasets with empty self.feature_names?

I've made a change that just uses debug print in that case

VolodymyrOrlov · 2021-02-10T02:48:23Z

src/dataset/mod.rs

+                        .collect::<String>(),
+                    self.features
+                        .iter()
+                        .map(|feature| format!("{}:{:?}, ", feature.name, feature.value))


I am wondering, how will the proposed message format fits large datasets? Let's say we have 2K samples in a dataset. Do we need to repeat feature name for every value we display? It seems redundant to me...
Another thought, should we truncate the message when certain number of samples has been reached?

You could imagine a CSV-like format, But as someone who always chooses expanded output in Postgres, I'm not really a fan. It works really well until rows won't fit on a single line on the screen, and then it's awful. You could imagine that we intelligently choose depending on number of features vs. number of data points, which could be better, but not sure how it would work in practice. Maybe we should detect screen width vs num_features and decide based on that?

Truncating the message seems like a straightforward thing to do. Like, say, after 200 lines or something? I bet someone would want it configurable, though.

How do alternatives like Pandas (or whatever, I'm not familiar with data science tools) do it?

I see your point here, Tormod. It is always frustrating when you have more attributes per sample than your screen can fit and data shifts by one line making it harder to skim through the data. But I don't think that printing feature name along with every value makes it easier to read the data either. I've played with your code a little and made it print Iris dataset. This is what I see, please correct me it this is not how you intended the format to be in this PR:

Iris dataset: https://archive.ics.uci.edu/ml/datasets/iris setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):5.1, sepal width (cm):3.5, petal length (cm):1.4, petal width (cm):0.2, setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.9, sepal width (cm):3.0, petal length (cm):1.4, petal width (cm):0.2, setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.7, sepal width (cm):3.2, petal length (cm):1.3, petal width (cm):0.2, setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.6, sepal width (cm):3.1, petal length (cm):1.5, petal width (cm):0.2, setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):5.0, sepal width (cm):3.6, petal length (cm):1.4, petal width (cm):0.2,

One problem with this way of displaying data is that it is hard to focus on numbers due to repeated feature names. Also, if we have more attributes per sample data will still take more lines per sample thus making it harder to see where one sample ends and another starts.

Pandas displays data as a table, that is truncated to max_rows and max_columns:

tormeh force-pushed the implement-display-for-dataset branch 3 times, most recently from c622d15 to 3c59df6 Compare February 7, 2021 00:27

tormeh mentioned this pull request Feb 9, 2021

No implementation of Display for Dataset #77

Open

VolodymyrOrlov requested changes Feb 10, 2021

View reviewed changes

tormeh force-pushed the implement-display-for-dataset branch from 3c59df6 to 5a7143a Compare February 21, 2021 14:41

Implement Display for Dataset.

8a53f2f

tormeh force-pushed the implement-display-for-dataset branch from 5a7143a to 8a53f2f Compare February 21, 2021 15:27

tormeh closed this Mar 15, 2022

Mec-iS reopened this Jun 1, 2025

Mec-iS self-requested a review as a code owner June 1, 2025 13:31

Mec-iS added 2 commits June 1, 2025 22:31

Merge branch 'development' into implement-display-for-dataset

0894865

Merge branch 'development' into implement-display-for-dataset

01f2281

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Display for Dataset. #78

Implement Display for Dataset. #78

Uh oh!

tormeh commented Feb 7, 2021

Uh oh!

codecov-io commented Feb 7, 2021

Uh oh!

VolodymyrOrlov left a comment

Uh oh!

Uh oh!

VolodymyrOrlov Feb 10, 2021

Uh oh!

tormeh Feb 21, 2021 •

edited

Loading

Uh oh!

VolodymyrOrlov Feb 25, 2021

Uh oh!

VolodymyrOrlov Feb 10, 2021

Uh oh!

VolodymyrOrlov Feb 10, 2021

Uh oh!

tormeh Feb 21, 2021

Uh oh!

VolodymyrOrlov Feb 10, 2021

Uh oh!

tormeh Feb 21, 2021

Uh oh!

VolodymyrOrlov Feb 25, 2021

Uh oh!

Uh oh!

Implement Display for Dataset. #78

Are you sure you want to change the base?

Implement Display for Dataset. #78

Uh oh!

Conversation

tormeh commented Feb 7, 2021

Uh oh!

codecov-io commented Feb 7, 2021

Codecov Report

Uh oh!

VolodymyrOrlov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tormeh Feb 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tormeh Feb 21, 2021 •

edited

Loading