Skip to content

Implement Display for Dataset. #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: development
Choose a base branch
from

Conversation

tormeh
Copy link

@tormeh tormeh commented Feb 7, 2021

Rough first draft. Happy to get feedback

@tormeh tormeh force-pushed the implement-display-for-dataset branch 3 times, most recently from c622d15 to 3c59df6 Compare February 7, 2021 00:27
@codecov-io
Copy link

Codecov Report

Merging #78 (3c59df6) into development (68e7162) will decrease coverage by 0.32%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development      #78      +/-   ##
===============================================
- Coverage        83.89%   83.56%   -0.33%     
===============================================
  Files               75       75              
  Lines             7853     7885      +32     
===============================================
+ Hits              6588     6589       +1     
- Misses            1265     1296      +31     
Impacted Files Coverage Δ
src/dataset/mod.rs 45.23% <0.00%> (-27.84%) ⬇️
src/optimization/first_order/lbfgs.rs 92.85% <0.00%> (-1.59%) ⬇️
src/svm/svc.rs 89.62% <0.00%> (-0.32%) ⬇️
src/optimization/line_search.rs 90.00% <0.00%> (+8.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68e7162...3c59df6. Read the comment docs.

Copy link
Collaborator

@VolodymyrOrlov VolodymyrOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @tormeh! I agree we need a way to quickly display content of a dataset. And I like proposed approach in general, but I'd like visualization logic to cover these corner cases:

  1. Attributes self.target_names and self.feature_names are optional. We should be able to display datasets where both these attributes are set to an empty list
  2. How can we compactly display large datasets?

features: Vec<Feature<X>>,
}
impl<X: Copy + std::fmt::Debug, Y: Copy + std::fmt::Debug> std::fmt::Display for DataPoint<X, Y> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a simple unit test that covers this function? For example, I am getting an error when I try to print Iris Dataset:

  #[test]
  fn display() {
      let dataset = iris::load_dataset();

      println!("{}", dataset);
  }

Copy link
Author

@tormeh tormeh Feb 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I see the problem. I assumed (favourite word right there) that you would have a vector of targets per sample, and not use different integer values to represent different targets.
So, for a single sample example:
feature_vec * weight_matrix = target_vec = [0, 1, 0, 0]
not:
feature_vec * weight_vec = target_int = 2

This worked in my case since I only had one target.

Not sure what to do here, tbh. People can choose any mapping of numeric -> target value, so I don't think it's possible to print it out in a nice way. Maybe just print the raw target value? What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, data is a flattened matrix of size N x P, where N is number of samples and P is number of features per sample, target is a 1d array with target values of size P. I think printing raw values of target is a good solution.

});
}
let mut targets = Vec::new();
for target_index in 0..self.target_names.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a dataset does not has names for all target variables? What if the self.target_names list does not cover all target variables?

let mut datapoints = Vec::new();
for sample_index in 0..self.num_samples {
let mut features = Vec::new();
for feature_index in 0..self.feature_names.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle datasets with empty self.feature_names?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a change that just uses debug print in that case

.collect::<String>(),
self.features
.iter()
.map(|feature| format!("{}:{:?}, ", feature.name, feature.value))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, how will the proposed message format fits large datasets? Let's say we have 2K samples in a dataset. Do we need to repeat feature name for every value we display? It seems redundant to me...
Another thought, should we truncate the message when certain number of samples has been reached?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could imagine a CSV-like format, But as someone who always chooses expanded output in Postgres, I'm not really a fan. It works really well until rows won't fit on a single line on the screen, and then it's awful. You could imagine that we intelligently choose depending on number of features vs. number of data points, which could be better, but not sure how it would work in practice. Maybe we should detect screen width vs num_features and decide based on that?

Truncating the message seems like a straightforward thing to do. Like, say, after 200 lines or something? I bet someone would want it configurable, though.

How do alternatives like Pandas (or whatever, I'm not familiar with data science tools) do it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point here, Tormod. It is always frustrating when you have more attributes per sample than your screen can fit and data shifts by one line making it harder to skim through the data. But I don't think that printing feature name along with every value makes it easier to read the data either. I've played with your code a little and made it print Iris dataset. This is what I see, please correct me it this is not how you intended the format to be in this PR:

Iris dataset: https://archive.ics.uci.edu/ml/datasets/iris
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0,  : sepal length (cm):5.1, sepal width (cm):3.5, petal length (cm):1.4, petal width (cm):0.2, 
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0,  : sepal length (cm):4.9, sepal width (cm):3.0, petal length (cm):1.4, petal width (cm):0.2, 
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0,  : sepal length (cm):4.7, sepal width (cm):3.2, petal length (cm):1.3, petal width (cm):0.2, 
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0,  : sepal length (cm):4.6, sepal width (cm):3.1, petal length (cm):1.5, petal width (cm):0.2, 
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0,  : sepal length (cm):5.0, sepal width (cm):3.6, petal length (cm):1.4, petal width (cm):0.2, 

One problem with this way of displaying data is that it is hard to focus on numbers due to repeated feature names. Also, if we have more attributes per sample data will still take more lines per sample thus making it harder to see where one sample ends and another starts.

Pandas displays data as a table, that is truncated to max_rows and max_columns:

image

@tormeh tormeh force-pushed the implement-display-for-dataset branch from 3c59df6 to 5a7143a Compare February 21, 2021 14:41
@tormeh tormeh force-pushed the implement-display-for-dataset branch from 5a7143a to 8a53f2f Compare February 21, 2021 15:27
@tormeh tormeh closed this Mar 15, 2022
@Mec-iS Mec-iS reopened this Jun 1, 2025
@Mec-iS Mec-iS self-requested a review as a code owner June 1, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants