Omesa

A small framework for reproducible Text Mining research that largely builds on top of scikit-learn. Its goal is to make common research procedures quick to set up, structured according to best practices, optimized, well recorded, and easily interpretable. To this end it features:

Web front-end and stand-alone database to overview experiments and interpret their performance.
Flexible wrappers to plug in your tools and features of choice.
Sparse and multi-threaded feature extraction.
Optional exhaustive search over best features, pipeline options, and classifier parameters.
Record of all settings and fitted components of the entire experiment, promoting reproducibility.
Dump an easily deployable version of the final model for plug-and-play demos.

Important Note

This repository is currently in development, stable functionality is not guaranteed as long as this message is showing.

Getting Started

We offer three quick examples to demonstrate the functionality:

2 minutes: using Omesa only for simple text classification.
5 minutes: integrating Omesa for data-to-features.
5 minutes: viewing experiment performance via the web app.

Dependencies

Omesa currently heavily relies on numpy, scipy and sklearn. When using the web app, bottle, blitzdb, plotly and lime are added dependencies. Currently these need to be installed by hand, later they will become standards. To use the Frog wrapper as a Dutch back-end, we strongly recommend using LaMachine. For English, there is a spaCy wrapper available.

Acknowledgements

Part of the work on Omesa was carried out in the context of the AMiCA (IWT SBO-project 120007) project, funded by the government agency for Innovation by Science and Technology (IWT).