Feedback requested on microbiome model #11

mmrmas · 2022-03-03T03:22:20Z

mmrmas
Mar 3, 2022

Good morning, I’d like to raise a question regarding my own project here. Hence the show and tell category.

As you all may know, my goal is to work with genomic data from microbes in the soil. The model I’d like to produce should be useful and applicable for any land owner who needs to understand the soil better, in order to prevent decline or improve production from that soil. After many hours of useless thinking and browsing, I see the need to raise some issues with you, peers and teachers, to get some guidance.

Possible directions
To understand my current issues, let me first explain the data and how it could be transformed into insights (+) but what I see as a main challenge (-)

Genomic data start with some sort of DNA sequencing data.
(+) this data could be vectorized and used in NLP to classify into organisms, gene groups, diseases
(-) the data is not standardized. There are many protocols and labs that all produce data in their own manner
Genomic data that I intend to use can be clustered into groups
(+) their composition can be used as features to train a model in a supervised manner, for instance to detect pollution
(-) These clusters have no intrinsic info and clustering may vary on an experiment-to-experiment basis
Clusters can be mapped to known biological microbes
(+) as in (2), without any further knowledge, the composition of microbes can be used for training
(-) we loose the information from non-mapped clusters in (2)

In all previous steps, I need a large and clean dataset of many different assays that may collectively lead to successful training a model. But perhaps I can also focus on individual datasets:

Known biological microbes can be investigated for functions (are they pathogens, decomposers, …)
(+) Such data can be collected from the public domain to train an NLP -based model (QA, or Open Domain QA)
(-) data is unstructured, not standardized, difficult to clean and (for open domain QA) it will take enormous work to build a training set

Some initial thoughts
Step 1 to 3 may be most straightforward, but also follows a classic approach. In a way, it can easily be repeated, and it can be done better by others, who have direct access to data or better ML skills.

Step 4 is basically building a chatbot. There are huge challenges with respect to accuracy. And, I have seen no examples yet where NLP can deal with information that is either redundant (getting meaning from overlapping pieces of information) or, more importantly in the real-life world, contrasting (a pathogen in context 1 can be a beneficial microbe in context 2).

What excites me though, is that step 4 would allow users to interact with their data. Instead of giving them one outcome, in which they may or may not be interested, we let them ask the questions and find out what is most relevant for them.

So, where to start? I would appreciate your feedback. Thanks.

jamescavanagh · 2022-03-09T01:15:23Z

jamescavanagh
Mar 9, 2022
Maintainer

Hi Sam,

I found two approaches that people are using for pipelines , I sent them to you in wechat, but this is a better home for them.

Pyfeat Library
https://github.com/mrzResearchArena/PyFeat
MathFeature
https://github.com/Bonidia/MathFeature

One interesting idea would be with Transduction, which would be that you could convert the genomic signal from one form into another.
It is commonly used in sequencing problems

https://machinelearningmastery.com/transduction-in-machine-learning/

This one might get your gears turning

1 reply

mmrmas Mar 15, 2022
Author

thanks a lot @jamescavanagh , I'm looking into it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback requested on microbiome model #11

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feedback requested on microbiome model #11

Uh oh!

mmrmas Mar 3, 2022

Replies: 1 comment · 1 reply

Uh oh!

jamescavanagh Mar 9, 2022 Maintainer

Uh oh!

mmrmas Mar 15, 2022 Author

mmrmas
Mar 3, 2022

Replies: 1 comment 1 reply

jamescavanagh
Mar 9, 2022
Maintainer

mmrmas Mar 15, 2022
Author