Skip to content

Document combinations of format="auto", profile, discover_datasets, from_metadata #9945

Open
@mvdbeek

Description

@mvdbeek

This was a little confusing. The things @nuwang and me tried on gitter:

Nuwan Goonasekera @nuwang 12:44
I’ve been trying to get manual sniffing working, but have now opened a completely different can of worms - for some reason, despite specifying format=“auto” and writing out a galaxy.json file, it simply shows me the tool metadata file as output instead. Just to make it harder for me, debugger breakpoints suddenly stopped working for Galaxy, although it works fine for the tool :-)

Marius van den Beek @mvdbeek 12:45
did you check out the test tools ?
There are a whole bunch of examples in
tool_provided_metadata_5.xml
tool_provided_metadata_4.xml
tool_provided_metadata_6.xml
tool_provided_metadata_7.xml
tool_provided_metadata_3.xml
tool_provided_metadata_2.xml
tool_provided_metadata_1.xml
tool_provided_metadata_12.xml
tool_provided_metadata_10.xml
collection_creates_dynamic_nested_from_json_elements.xml
tool_provided_metadata_11.xml
collection_creates_dynamic_nested_from_json.xml
tool_provided_metadata_9.xml
empty_datasets.xml
tool_provided_metadata_8.xml
tool_provided_metadata_6 looks similar to what you want to do ?

Nuwan Goonasekera @nuwang 12:49
Yes… I also looked at the old genomespace code which does the same, but it’s not workign in this case.. weird

Marius van den Beek @mvdbeek 12:50
what did you try ?

Nuwan Goonasekera @nuwang 12:50
I tried this:
<data name="output_file1" format="auto" />

Marius van den Beek @mvdbeek 12:51
and the rest ?
format="auto” seems only needed for old profile versions

Nuwan Goonasekera @nuwang 12:51
Is there something else needed? I’m now trying:
  <outputs>
    <data name="output_file1" format="auto" />
      <discover_datasets pattern="__desgination_and_ext__" recurse="true" visible="true" assign_primary_output="true" />
    </data>
  </outputs>

Marius van den Beek @mvdbeek 12:51
yes
you need to write a galaxy.json file!
   <command>
    echo "1" > sample1.report.tsv;
    echo "2" > sample2.report.tsv;
    cp $c1 galaxy.json;
  </command>
  <configfiles>
      <configfile name="c1">{"sample": {
"datasets": [
{"filename": "sample1.report.tsv", "name": "cool name 1", "ext": "txt", "info": "cool 1 info", "dbkey": "hg19"},
{"filename": "sample2.report.tsv", "name": "cool name 2", "ext": "txt", "info": "cool 2 info", "dbkey": "hg19"}
]
}}
</configfile>
  </configfiles>
  <inputs>
    <param name="input" type="data" />
  </inputs>
  <outputs>
    <data name="sample">
      <discover_datasets pattern="(?P&lt;designation&gt;.+)\.report\.tsv" visible="true" />
    </data>
  </outputs>

Nuwan Goonasekera @nuwang 12:52
That looks like:
$ cat galaxy.json
{"type": "dataset", "dataset_id": 41, "filename": "testowncloud.txt", "ext": "txt"}
{"type": "new_primary_dataset", "base_dataset_id": 41, "filename": "f.txt", "ext": "txt”}

Marius van den Beek @mvdbeek 12:52
that is the example from tool_provided_metadata_6

Nuwan Goonasekera @nuwang 12:52
huh, so I’m missing a level of nesting by the looks of it
Did this format change by chance? The outer “datasets” entry was not written in genomespace as I recall

Marius van den Beek @mvdbeek 12:53
maybe an old profile ?

Nuwan Goonasekera @nuwang 12:54
Yes. That was an old profile. Thanks, will try this out and report back

Marius van den Beek @mvdbeek 12:54
This has been the format since 17.09

Nuwan Goonasekera @nuwang 12:54
Makes sense. GenomeSpace was using 16.04
05

Marius van den Beek @mvdbeek 12:55
<tool id="tool_provided_metadata_4" name="tool_provided_metadata_4" version="1.0.0">
    <!-- Demonstrate indexing dataset metadata by dataset basename instead of dataset id in galaxy.json for legacy tools (galaxy.json structure changes for profile >= 17.09 tools). -->
    <command>
      echo "This is a line of text." > '$out1';
      cp $c1 galaxy.json;
    </command>
    <configfiles>
      <configfile name="c1">#import os
{"type": "dataset", "dataset": "${os.path.basename(str($out1))}", "name": "my dynamic name", "ext": "txt", "info": "my dynamic info", "dbkey": "cust1"}</configfile>
    </configfiles>
    <inputs>
        <param name="input1" type="data" label="Input Dataset"/>
    </inputs>
    <outputs>
        <!-- Set format="auto" to read from galaxy.json, use auto_format="true"
             to sniff. -->
        <data name="out1" format="auto" />
    </outputs>
that’s for older profiles
which I suppose also only works if there is only 1 output

Nuwan Goonasekera @nuwang 12:56
I see. So I’ll need the discover_datasets too by the looks of it?

Marius van den Beek @mvdbeek 12:56
yes

Nuwan Goonasekera @nuwang 13:26
So even with the thing configured exactly as in example 6, it still ignores the datasets read from galaxy.json. It’s clearly reading it, because if I return an incorrect format, it’ll print an error like:
galaxy.tool_util.provided_metadata ERROR 2020-07-01 16:49:36,427 [p:30694,w:1,m:0] [LocalRunner.work_thread-1] (57) Got JSON data from tool, but data is improperly formatted or no "type" key in data
Traceback (most recent call last):
  File "lib/galaxy/tool_util/provided_metadata.py", line 111, in __init__
    assert 'type' in line
AssertionError
galaxy.tool_util.provided_metadata DEBUG 2020-07-01 16:49:36,427 [p:30694,w:1,m:0] [LocalRunner.work_thread-1] Offending data was: {'output_file1': {'datasets': [{'type': 'dataset', 'dataset_id': 56, 'filename': 'testowncloud.txt', 'ext': 'txt'}, {'type': 'new_primary_dataset', 'base_dataset_id': 56, 'filename': 'f.txt', 'ext': 'txt'}]}}

Marius van den Beek @mvdbeek 13:27
Do you have the tool somewhere ?

Nuwan Goonasekera @nuwang 13:29
Yes. Just pushed the latest here: https://github.com/usegalaxy-au/Galaxy-Owncloud-Integration
Once difference is that I set: tool_type=“output_parameter_json” so that I can get access to the TOOL_PROVIDED_JOB_METADATA_FILE
This to be exact: https://github.com/usegalaxy-au/Galaxy-Owncloud-Integration/blob/master/Galaxy_OwncloudImportExport/owncloud_import.xml

Marius van den Beek @mvdbeek 13:31
you don’t need format="auto"

Nuwan Goonasekera @nuwang 13:32
No effect.

Marius van den Beek @mvdbeek 13:32
Can you express what you want to do in a simpler tool ?
Maybe just template out galaxy.json in a configfile as in the test ?
also why do you need tool_type=“output_parameter_json” ?

Nuwan Goonasekera @nuwang 13:35
To gain access to the tool_provided_job_metadata_file: https://github.com/usegalaxy-au/Galaxy-Owncloud-Integration/blob/master/Galaxy_OwncloudImportExport/owncloud_importer.py#L69

Marius van den Beek @mvdbeek 13:35
yeah, but you always know the content anyway, so that doesn’t seem necessary ?

Nuwan Goonasekera @nuwang 13:36
Sorry this line: https://github.com/usegalaxy-au/Galaxy-Owncloud-Integration/blob/1316a0b2e888c7a5694a435e894dfffc5df8834e/Galaxy_OwncloudImportExport/owncloud_importer.py#L31

Marius van den Beek @mvdbeek 13:37
ok, that shouldn’t have an impact anyway, metadata setting should be the same anyway

Nuwan Goonasekera @nuwang 13:37
Well, the reason is to be able to manually create the data types registry, so that we can then do: sniff.handle_uploaded_dataset_file(file_path, datatypes_registry)

Marius van den Beek @mvdbeek 13:37
...

Nuwan Goonasekera @nuwang 13:38
What the tool itself is doing is fairly simple. It just creates all the files it downloads from owncloud into the output folder. Since this is a mishmash of files, we need to sniff the datatypes manually

Marius van den Beek @mvdbeek 13:39
right, but you might be better off handing this over to an exisiting upload tool
anyway, we can try to get this to work
the problem is that I am not sure this will work, and it is a potential waste of time if we’re going to do this differently in 20.09 anyway
so what you can try to do is set the metadata file to read form explicitly
there might be a clash between the tool provided metadata file and galaxy.json

Marius van den Beek @mvdbeek 13:44
for that you can add provided_metadata_file="not_galaxy.json”on the output tag

Nuwan Goonasekera @nuwang 13:46
It is potentially a waste of time, and I was hoping to be done with this tool - before the datatype not being sniffed issue came up. I thought this would be a quick transplant from the old genomespace code, but it has turned into something else… Still, this may not be a total waste since this should work properly for other tools also?

Marius van den Beek @mvdbeek 13:46
maybe
It’s just that building an alternative upload tool is probably not the most common task

Nuwan Goonasekera @nuwang 13:49
No luck with that either. It’s clearly reading the file as I mentioned, but it’s then just ignoring it anyway
which doesn’t make much sense

Marius van den Beek @mvdbeek 13:49
can you push an update ?

Nuwan Goonasekera @nuwang 13:50
done in: usegalaxy-au/Galaxy-Owncloud-Integration@657e4e2

Nuwan Goonasekera @nuwang 14:00
So this tool is discovering untyped data in a folder. And it’s writing out a galaxy.json file. Neither of which seems unusual individually? Perhaps the only unusual thing is that it’s doing both together?

Marius van den Beek @mvdbeek 14:00
yes
well, untyped data is very unusual actually, and galaxy.json is too
but yeah, both of these work in isolation
alright, so I’ve turned the tool into the test tool, more or less, and it works

Marius van den Beek @mvdbeek 14:06
https://gist.github.com/mvdbeek/e6130c36bd06cf7305419ad9f0f2cd12
looks like you don’t want to use pattern=“__desgination_and_ext__"

Nuwan Goonasekera @nuwang 14:07
name_and_ext perhaps?

Marius van den Beek @mvdbeek 14:07
no
I don’t think you can use _ext
but easy to test tnow
and you don’t need it anyway, since you’re managing this in galaxy.json now

Nuwan Goonasekera @nuwang 14:08
true
interesting that without discover_datasets, it doesn’t work. The upload tool seems to work fine though? And some of the examples omit that block too

Marius van den Beek @mvdbeek 14:09
but you are discovering datasets ?
the legacy examples don’t need it, that’s right

Nuwan Goonasekera @nuwang 14:10
I thought that writing a galaxy.json implies discovery?

Marius van den Beek @mvdbeek 14:10
in legacy tools, yes
but you’re already producing galaxy.json by the fact that you’re using tool_type="output_parameter_json"
also Exception: Cannot specify attribute [pattern] if from_provided_metadata is True … that is a good warning, but it’s too late if you don’t know that you want to set from_provided_metadata="true"
which doesn’t seem to be required ..
and it doesn’t help that we have <discover_datasets pattern="(?P&lt;designation&gt;.+)\.report\.tsv" visible="true" /> in the example, where pattern doesn’t actually do anything

Marius van den Beek @mvdbeek 14:16
that is something we can fix in the docs
and the tests
so does the tool work with
    <data name="output_file1">
      <discover_datasets from_provided_metadata="true"/>
    </data
?
I think that should be the right way to use galaxy.json

Nuwan Goonasekera @nuwang 14:19
This is super stubborn. Still ignoring it
I’m now cloning and trying a fresh installation of Galaxy, because this makes no sense

Marius van den Beek @mvdbeek 14:20
try my gist too
with planemo test owncloud_import.xml

Nuwan Goonasekera @nuwang 14:21
gist works
I just replaces the existing wrapper and reran the tool

Nuwan Goonasekera @nuwang 14:28
It’s working now
I removed provided_metadata_file=“not_galaxy.json” and added from_provided_metadata="true"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions