This repository provides a software pipeline in order to explain drift between two sets of documents using embeddings.
First experiments indicate that BERT document embeddings outperform Doc2Vec document embeddings.
- How to configure file storage and the default directory to read data
- Amazon movie reviews
- Data overview
- How to read with Amazon Pickle_Reader and access texts, embeddings, metadata
- How to read with Amazon Pickle_Splitter and get items, which are equally splitted
- Data is currently stored at Google Drive
 
- How to store interim results
- How to reduce dimensions
- How to create Wordclouds
- Goal: Reusable, complete and documented code (good for developers, reviewers, everyone)
- If you add new classes, please provide minimal code examples, put them into the docdirectory and add a link above.
- Directories
- doc: Documentation (e.g. how to read data)
- experimentsJupyter notebooks (e.g. combine class instances into a process generating explanations)
- transformation: Classes for data transformation (e.g. create embeddings, reduce dimensions)
- access: Classes for data access (e.g. read or split embeddings)
- explanations: Classes for the explanation process (e.g. handling ml models, generate explanations)
- scripts: Small sets of commands (e.g. to synchronize repositories)
 
- How to name your code: PEP 8 - Style Guide for Python Code
This work has been supported by the German FederalMinistry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.