A library of reusable AVF functionality and data structures.
It includes the primary data structure in which all runs are stored, helper functions for IO/hashing/time/logging, a cleaning library, and standardised analysis functions.
CoreDataModules is not available on PyPI, so must instead be installed from this GitHub repository.
Do so with the following command, setting <version> to the tag to install e.g. v0.0.1:
$ pipenv install -e git+https://www.github.com/AfricasVoices/CoreDataModules@<version>#egg=CoreDataModulesTo use the core_data_modules.analysis.mapping module, include the extra dependencies for mapping:
$ pipenv install -e git+https://www.github.com/AfricasVoices/CoreDataModules@<version>#egg=CoreDataModules[mapping]The data format for a batch of runs in the data pipeline is always an iterable of TracedData.
A TracedData object can be thought of as a "dictionary with data provenance".
It provides the same read interface as a Python dict, but writes must be performed using the append_data method,
which requires some additional metadata about each update when it is called. TracedData objects maintain a complete
record of all previous updates.
To construct a TracedData object, read a value, and update value(s):
import time
from core_data_modules.traced_data import TracedData, Metadata
USER = "test_user"
td = TracedData({"age": "20", "language": "english"}, Metadata(USER, Metadata.get_call_location(), time.time()))
print(td["age"]) # 20
td.append_data({"age": "21"}, Metadata(USER, Metadata.get_call_location(), time.time()))
print(td["age"]) # 21CoreDataModules also provides utilites for converting between in-memory iterables of TracedData and JSON, CSV, Coda, Coding CSV, and The Interface files.
For example, to save and load a list of TracedData objects by serializing to and from JSON:
from core_data_modules.traced_data.io import TracedDataJsonIO
data = # TODO: assign iterable of TracedData objects
with open("output.json", "w") as f:
TracedDataJsonIO.export_traced_data_iterable_to_json(data, f)
with open("input.json", "r") as f:
data = TracedDataJsonIO.import_json_to_traced_data_iterable(f)All reusable cleaning functions may be used by importing functionality from core_data_modules.cleaners.
For example:
from core_data_modules.cleaners.english.demographic_cleaner import DemographicCleaner
DemographicCleaner.clean_gender("woman") # 'female'
DemographicCleaner.clean_gender("aoeu") # Codes.NOT_CODEDNote that cleaners which fail to assign a cleaned value return Codes.NOT_CODED.
All tests are to be written using unittest,
and to be placed in the tests/ directory.
The structure of the tests/ directory should match that of core_data_modules, with all test scripts prefixed
with test_.
For example, if there is a file at core_data_modules/package/file.py, there should be an associated test file
at tests/package/test_file.py.
To run all tests: $ pipenv run python setup.py test --addopts doctest.
To run a specific test or test suite:
$ pipenv run python setup.py test --addopts path/to/test_file.py::TestClass::test_method
e.g. $ pipenv run python setup.py test --addopts tests/cleaners/english/test_demographic_cleaner.py.
Circle CI automatically runs all tests in Python 3.6 whenever a commit is pushed to this repository.