Omesa + Your Pipeline - Data To Features in 5 Minutes

The package was originally developed to be used as an easy data-to-features wrapper, with as few dependencies as possible. For this purpose, the Vectorizer class was built, which allows minimal use of Omesa within an existing framework. An example of its use can be seen below.

Preparing Settings

Say that we are starting session in which we would like to train on some data. We need a config name, a list of data, and what kind of features we wish to extract from for this. First we import Omesa, and the featurizer classes we want to use. After, the feature classes can be initialized with the relevant parameters, and we provide the directory and info to open our data

from omesa.containers import CSV
from omesa.featurizer import Ngrams


features = [Ngrams(level='char', n_list=[3]),
            Ngrams(level='token', n_list=[1, 2])]

data = CSV('/dir/to/data/data.csv', data=1, label=0, header=True)

Data To Features

Now we can transform our data to X, y:

from omesa.pipes import Vectorizer

vec = Vectorizer(features)
X, y = vec.transform(data)

X is returned as a sparse matrix, and y a list of labels.

Own Pipeline

From there on, you can do whatever you wish with X, such as a common sklearn classification operation.

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, y)

Saving for Deployment

To save your model, you can do:

from omesa.containers import Pipeline

pl = Pipeline(name='my_experiment', source='json')
pl.save(vectorizer=vec, classifier=clf)

In a demo, this could be loaded in again by using:

from omesa.containers import Pipeline
from sklearn.naive_bayes import GaussianNB

pl = Pipeline(name='my_experiment', source='json')
pl.load()

pl.classify('raw text')
... [label], (0.12231, 0.87769)