Corpora

A Dataset Studio for the Digital Humanities

What is Corpora?

Corpora is Digital Humanities (DH) infrastructure—it's intended to be run by academic centers, libraries, or project teams, and its purpose is to greatly accelerate the development of DH projects by serving as a database, a REST API, a data collection/curation interface, and a Python-powered asynchronous task queue all wrapped into one.

Corpora's name is the plural for "corpus" because it happily manages multiple, disparate datasets for a variety of different projects. On the same instance of Corpora, you can host one project that contains millions of bibliographic metadata records and another than contains hundreds of thousands of geolocated annotations for images. The limitations for what you can host with Corpora are determined by your hardware, not the affordances of the software.

Among Corpora's affordances is a web-based interface for defining a data schema that is expected to evolve over time. This is because it is extremely rare to define a perfect schema for appropriately keeping track of project data on the first attempt. Projects happen iteratively, and so data schemas should be as effortlessly maleable as possible.

Once a schema has been defined for a project, Corpora dynamically generates web forms for collaborators to enter the data. It also dynamically generates a read-only REST API for third party applications to query in order to provide, for instance, a public, front-facing website for that project. Corpora's web interface makes use of this REST API to allow scholars to search, sort, and explore the connections between their datapoints as they go about building their project.

Finally, Corpora is built to cater to a wide range of users, from scholars with little technical expertise to data wranglers who want to leverage the power of Corpora's built-in iPython notebooks. Corpora also integrates a fully asynchronous job queue allowing power users to run long-running jobs (such as optical character recognition, natural language processing, TEI XML ingestion, etc.) on thousands of records at once.

Who built it?

Corpora was built by Bryan Tarpley, Associate Research Scientist at the Center of Digital Humanities Research at Texas A&M University. Bryan very much hopes it will gain traction and become a robust, community developed open-source project.

Documentation

Corpora's documentation is a work in progress, but you may find it here :)

License

Corpora uses the MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 663 Commits
app		app
docs/src		docs/src
elasticsearch		elasticsearch
iiif		iiif
nginx		nginx
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml.example		docker-compose.yml.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Corpora

What is Corpora?

Who built it?

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

bptarpley/corpora

Folders and files

Latest commit

History

Repository files navigation

Corpora

What is Corpora?

Who built it?

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages