This is a sample data parser for Biothings Studio. This repo does not contain any information regarding Biothings Studio, please refer to the original link if you need more information on Biothings Studio. It is highly recommended that you go through the tutorials and developer guide in Biothings Studio page first.
This Python project uses pipenv to manage virtual environment.
To install pipenv:
pip install pipenv
To create project virtual environment, along with the dependencies:
pipenv install
Hint: make sure you have Python 3.6
installed.
Once you have set up the virtual environment, you are ready to go. You can tailor the code to your need. Refer to the next section on how to do that.
We defined a method, load_data()
in parser.py
, specified in manifest.json
file to be the parser for Biothings Studio. parser.load_data()
returns a generator that yields one record at a time, which will be used by Biothings Studio.
These are the files you need to walk through if you want to customize your own parser.
Defines data download (dumper) and parsing (uploader) logic as well as metadata. More details covered in Biothings Studio tutorials.
Below is the workflow. Customize it to your demand.
-
Define data file name, delimiter and source name:
FILENAME
: filename does not include the path.DELIMITER
: what you used to separate fields in the data file. For example,,
in a.csv
file or\t
in a.tsv
file.SOURCE_NAME
: the key name to be shown in the API response for your data. For example:
{ _id: ..., my_data_source: { ..., # some data } }
-
Check if file exists in path.
-
Inspect file to get the total number of lines. (optional but recommended for logging purpose so that we can indicate progress in the following steps)
-
Read file:
- Skip commented lines and empty lines
- Split line according to schema. Skip the line when split fails, record to
skipped
list. - Format and enforce data type for each fields. Skip the line when cast fails, record to
skipped
list. - Construct an entry and yield it.
- Output all skipped lines (
skipped
list) to log after finished.
A sample file provided for testing purpose.