Mycroft

An AI approach to detect Semantic Data Types using NLP and Deep Learning.

Installation Steps

Install python. Make sure you are using python version 3.7 or above.
Once the repository is cloned, go to project's folder and create a virtual environment using environment manager like virtualenv, conda, pipenv.
Activate the environment and install the desired python packages using:
```
 pip install -r requirements.txt
```
Now the main dataset is extracted from the website http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/englishCorpus/compressed/ . Once you go to the website download the first folder and extract the folder which will be called 0 and place it in the directory named resources/data.

Code Execution

There are 2 parts to the code, the data conversion part and the complete process execution part.

Data Conversion:
- After the packages and the dataset is installed and structured, lets look at the data conversion part, it consists of command line arguments to run the codes in a specific ways. The command line arguments are:
  - --sample or -s, default=False, description=Choose if you want to use sample or not
  - --no_of_tables or -num, default=False, description=Choose the number of tables that are needed for Mycroft
  Using the above command line arguments, we can run the code using the following commands after going to the src/data_conversion folder:
  - To run the data extractor code and extract 50000 web tables
    
    python data_extractor.py -num 50000
Process Execution:
- This part also consists of command line arguments to run the codes in a specific ways. The command line arguments are:
  - --input_data or -i, default=sherlock, description=Choose the type of data (options: sherlock, mycroft)
  - --extract or -e, default=False, description=Choose if you want to generate features or used the saved features
  - --split or -spt, default=False, description=Choose if you want to split the data or not
  - --train_split or -ts, default=0.7, description=Choose the percentage of the train data split (Example: 0.7 -> 70% train)
  - --no_of_tables or -num, default=20000, description=Choose the files with number of tables that is required for processing (options: "40000, 50000, 100000")
  - --sample or -smp, default=False, description="Choose if you want to use sample or not"
  Using the above command line arguments, we can run the code using the following commands after going to the src folder:
  - To run the Mycroft on the the data with 50000 web tables with saved features
    
    python main.py -i mycroft -spt True -ts 0.8 -num 50000
  - To run the Mycroft on the the data with 50000 web tables with features being generated
    
    python main.py -i mycroft -e True -spt True -ts 0.8 -num 50000
    
    Note: The feature extraction part will take a lot of time (8 secs per data column)
  - In the same way you can explore other values and get the results

Mycroft Process Diagram

Mycroft Neural Net Model (Inspired from Sherlock)

Dataset

The dataset is available in the website: http://webdatacommons.org/webtables/2015/downloadInstructions.html. Go to the link http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/englishCorpus/compressed/ and download the first folder and extract the folder which will be called 0 and place it in resources/data.

References

M. Hulsebos, K. Hu,M. Bakker, E. Zgraggen, A. Satyanarayan, T. Kraska, c. Demiralp, and C. Hidalgo, Sherlock: A deep learning approach to semantic data type detection, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining, ACM, 2019.
O. Lehmberg, D. Ritze, R.Meusel, and C. Bizer, A large public corpus of web tables containing time and context metadata, in Proceedings of the 25th International Conference Companion on WorldWideWeb, pp. 75–76, 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mycroft

Installation Steps

Code Execution

Mycroft Process Diagram

Mycroft Neural Net Model (Inspired from Sherlock)

Dataset

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

shashankboosi/Mycroft

Folders and files

Latest commit

History

Repository files navigation

Mycroft

Installation Steps

Code Execution

Mycroft Process Diagram

Mycroft Neural Net Model (Inspired from Sherlock)

Dataset

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages