-
Notifications
You must be signed in to change notification settings - Fork 86
Implement Display for Dataset. #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Implement Display for Dataset. #78
Conversation
c622d15
to
3c59df6
Compare
Codecov Report
@@ Coverage Diff @@
## development #78 +/- ##
===============================================
- Coverage 83.89% 83.56% -0.33%
===============================================
Files 75 75
Lines 7853 7885 +32
===============================================
+ Hits 6588 6589 +1
- Misses 1265 1296 +31
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @tormeh! I agree we need a way to quickly display content of a dataset. And I like proposed approach in general, but I'd like visualization logic to cover these corner cases:
- Attributes
self.target_names
andself.feature_names
are optional. We should be able to display datasets where both these attributes are set to an empty list - How can we compactly display large datasets?
src/dataset/mod.rs
Outdated
features: Vec<Feature<X>>, | ||
} | ||
impl<X: Copy + std::fmt::Debug, Y: Copy + std::fmt::Debug> std::fmt::Display for DataPoint<X, Y> { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a simple unit test that covers this function? For example, I am getting an error when I try to print Iris Dataset:
#[test]
fn display() {
let dataset = iris::load_dataset();
println!("{}", dataset);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... I see the problem. I assumed (favourite word right there) that you would have a vector of targets per sample, and not use different integer values to represent different targets.
So, for a single sample example:
feature_vec * weight_matrix = target_vec = [0, 1, 0, 0]
not:
feature_vec * weight_vec = target_int = 2
This worked in my case since I only had one target.
Not sure what to do here, tbh. People can choose any mapping of numeric -> target value, so I don't think it's possible to print it out in a nice way. Maybe just print the raw target value? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, data
is a flattened matrix of size N x P, where N is number of samples and P is number of features per sample, target
is a 1d array with target values of size P. I think printing raw values of target
is a good solution.
src/dataset/mod.rs
Outdated
}); | ||
} | ||
let mut targets = Vec::new(); | ||
for target_index in 0..self.target_names.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a dataset does not has names for all target variables? What if the self.target_names
list does not cover all target variables?
src/dataset/mod.rs
Outdated
let mut datapoints = Vec::new(); | ||
for sample_index in 0..self.num_samples { | ||
let mut features = Vec::new(); | ||
for feature_index in 0..self.feature_names.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle datasets with empty self.feature_names
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a change that just uses debug print in that case
src/dataset/mod.rs
Outdated
.collect::<String>(), | ||
self.features | ||
.iter() | ||
.map(|feature| format!("{}:{:?}, ", feature.name, feature.value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering, how will the proposed message format fits large datasets? Let's say we have 2K samples in a dataset. Do we need to repeat feature name for every value we display? It seems redundant to me...
Another thought, should we truncate the message when certain number of samples has been reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could imagine a CSV-like format, But as someone who always chooses expanded output in Postgres, I'm not really a fan. It works really well until rows won't fit on a single line on the screen, and then it's awful. You could imagine that we intelligently choose depending on number of features vs. number of data points, which could be better, but not sure how it would work in practice. Maybe we should detect screen width vs num_features and decide based on that?
Truncating the message seems like a straightforward thing to do. Like, say, after 200 lines or something? I bet someone would want it configurable, though.
How do alternatives like Pandas (or whatever, I'm not familiar with data science tools) do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point here, Tormod. It is always frustrating when you have more attributes per sample than your screen can fit and data shifts by one line making it harder to skim through the data. But I don't think that printing feature name along with every value makes it easier to read the data either. I've played with your code a little and made it print Iris dataset. This is what I see, please correct me it this is not how you intended the format to be in this PR:
Iris dataset: https://archive.ics.uci.edu/ml/datasets/iris
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):5.1, sepal width (cm):3.5, petal length (cm):1.4, petal width (cm):0.2,
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.9, sepal width (cm):3.0, petal length (cm):1.4, petal width (cm):0.2,
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.7, sepal width (cm):3.2, petal length (cm):1.3, petal width (cm):0.2,
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):4.6, sepal width (cm):3.1, petal length (cm):1.5, petal width (cm):0.2,
setosa:0.0, versicolor:0.0, virginica:0.0, :0.0, : sepal length (cm):5.0, sepal width (cm):3.6, petal length (cm):1.4, petal width (cm):0.2,
One problem with this way of displaying data is that it is hard to focus on numbers due to repeated feature names. Also, if we have more attributes per sample data will still take more lines per sample thus making it harder to see where one sample ends and another starts.
Pandas displays data as a table, that is truncated to max_rows
and max_columns
:
3c59df6
to
5a7143a
Compare
5a7143a
to
8a53f2f
Compare
Rough first draft. Happy to get feedback