Should / can we improve the handling of tabular data? #19586
Replies: 4 comments 2 replies
-
Storing metadata on datasets is expensive in terms of database size, fetching them on demand from datasets might also not be feasible. Ideally everything should be based on column numbers or more precise datatypes. If a user needs to provide the right value to pick the column from a dataset I would say we have already failed from a UX perspective. It might be preferable to instead use specific datatypes that dictate the column layout. |
Beta Was this translation helpful? Give feedback.
-
The way how tables are handled in Galaxy from the perspective of a user and tool developer has multiple issues.
Question is, how do we improve this? I think the idea of setting metadata for a table via a tool is not bad, but it doesn't help during tool development that you can not make assumptions about tabular inputs. |
Beta Was this translation helpful? Give feedback.
-
We determine quite a bit of metadata for all sorts of datasets and basically make no use of it at all -- (at least on the tool side .. and I do not know of any other use). Here I see an example where metadata really could help and the extra storage might really help. But probably I lack a bit of background knowledge... :)
If you work with tabular data you need to be able to select columns / rows. Numbers might be good for some, but I think many users would benefit from a column name based selection.
We could have a mini tool that has a single select (somehow filled from the first line of a dataset - split by some delimiter ... maybe a |
Beta Was this translation helpful? Give feedback.
-
Maybe just |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem: The
tabular
datatype sets barely any metadata (except forcolumns
anddata_lines
/comment_lines
which is set for only for not-to-big datasets), buttabular
is our main datatype for tabular data, because most tab delimited data is sniffed as tabular and we "push" tool authors toward this datatype.Many tools for processing tabular data could make good use of metadata:
column_names
could be used indata_column
parameters,comment_lines
could be used to automatically treat header different from the data lines. Also the display is nicer/more user friendly if there are column names.The reason for the abundant use of
tabular
is that it is very generic and makes only few assumptions (eg. on the presence of header).Some possible improvements:
tsv
which sets more metadata (i.e. if the assumptions fortsv
are fulfilled):tsv
more often as output type (for tab delimited data with header and constant column number)tabular
asformat
for input paramters, the more specializedtsv
might always be added.tsv
could be automatically accepted fortabular
(since it's a more specialized format).tabular
output (e.g. be adding IUC guidlines)<action name="METADATA_NAME" type="metadata" default="METADATA_VALUE"/>
ortabular
(without duplicating the data).#
Tabular.set_meta
function would be that we get metadata also for large dataSome facts:
Datatype hierarchy:
Tabular
:tabular
BaseCSV
CSV
:csv
TSV
:tsv
Sniffers exist only for
csv
andtsv
. But if files are sniffed astsv
it is overwritten astabular
(in most cases).The sniffers for
csv
andtsv
use python'scsv
module:csv
/tsv
tsv
: consistent number of columnsPossible metadata:
comment_lines
data_lines
columns
column_names
delimiter
Automatically set metadata for
tabular
:columns
is the maximum number of columns over all considered lines (max 100k)delimiter
: tab'#'
) are counted (files with more than 100k lines will not have number of data/comment lines)column_names
is never setAutomatically set metadata for
csv
andtsv
columns_names
column_types
are derived from the 2nd linecomment_lines
is 1 iff 2 lines can be readdata_lines
is number of lines - 1columns
is max of number of columns of 1st (header) and 2nd lineImportant difference between
csv
andtsv
is thatcsv
allows for inconsistent number of columns.Beta Was this translation helpful? Give feedback.
All reactions