Feedback requested on microbiome model #11
mmrmas
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
|
Hi Sam, I found two approaches that people are using for pipelines , I sent them to you in wechat, but this is a better home for them.
One interesting idea would be with Transduction, which would be that you could convert the genomic signal from one form into another. https://machinelearningmastery.com/transduction-in-machine-learning/ This one might get your gears turning |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Good morning, I’d like to raise a question regarding my own project here. Hence the show and tell category.
As you all may know, my goal is to work with genomic data from microbes in the soil. The model I’d like to produce should be useful and applicable for any land owner who needs to understand the soil better, in order to prevent decline or improve production from that soil. After many hours of useless thinking and browsing, I see the need to raise some issues with you, peers and teachers, to get some guidance.
Possible directions
To understand my current issues, let me first explain the data and how it could be transformed into insights (+) but what I see as a main challenge (-)
Genomic data start with some sort of DNA sequencing data.
(+) this data could be vectorized and used in NLP to classify into organisms, gene groups, diseases
(-) the data is not standardized. There are many protocols and labs that all produce data in their own manner
Genomic data that I intend to use can be clustered into groups
(+) their composition can be used as features to train a model in a supervised manner, for instance to detect pollution
(-) These clusters have no intrinsic info and clustering may vary on an experiment-to-experiment basis
Clusters can be mapped to known biological microbes
(+) as in (2), without any further knowledge, the composition of microbes can be used for training
(-) we loose the information from non-mapped clusters in (2)
In all previous steps, I need a large and clean dataset of many different assays that may collectively lead to successful training a model. But perhaps I can also focus on individual datasets:
(+) Such data can be collected from the public domain to train an NLP -based model (QA, or Open Domain QA)
(-) data is unstructured, not standardized, difficult to clean and (for open domain QA) it will take enormous work to build a training set
Some initial thoughts
Step 1 to 3 may be most straightforward, but also follows a classic approach. In a way, it can easily be repeated, and it can be done better by others, who have direct access to data or better ML skills.
Step 4 is basically building a chatbot. There are huge challenges with respect to accuracy. And, I have seen no examples yet where NLP can deal with information that is either redundant (getting meaning from overlapping pieces of information) or, more importantly in the real-life world, contrasting (a pathogen in context 1 can be a beneficial microbe in context 2).
What excites me though, is that step 4 would allow users to interact with their data. Instead of giving them one outcome, in which they may or may not be interested, we let them ask the questions and find out what is most relevant for them.
So, where to start? I would appreciate your feedback. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions