Text feature extraction module

This module contains several helper classes for extracting textual features used in Text Mining applications, partly based on instances parsed with parse. It also includes a wrapper class to cleverly hanlde this within the Omesa framework.

Featurizer

 class Featurizer(features, preprocessor=False, parser=False)

Wrapper for looping feature extractors in fit and transform operations.

Calls helper classes which extract different features from text data. Given a list of initialized feature extractor classes, correctly streams or dumps instances along these classes. Also provides an interface to fit and transform methods.

Parameters	Type	Doc
features	list	List of initialized feature extractor classes. The classes can befound within this module.

Attributes	Type	Doc
helper	list of classes	Store for the provided features.
Y	list of labels	Labels for X.

Examples

Note: this is just for local use only.

During training with a full space and a generator:


>>> loader = reader.load  # assumes that this is a generator
>>> features = [Ngrams(level='char', n_list=[1,2])]
>>> ftr = _Featurizer(features)
>>> ftr.fit(loader())
>>> X, Y = ftr.transform(loader()), ftr.labels

During testing with only one instance:


>>> new_data = 'this is some string to test'
>>> tex, tey = ftr.transform(new_data), ftr.labels

Methods

Function	Doc
transform	Call all the helpers to extract features.

transform

    transform(instance)

Call all the helpers to extract features.

Parameters	Type	Doc
instance	tuple	Containing at least (raw) and optionally (parse, meta).

Returns	Type	Doc
v	dict	Feature vector where key, value = feature, value.

Ngrams

 class Ngrams(object)

Calculate n-gram frequencies.

Can either be applied on token, POS or character level. The transform method dumps a feature dictionary that can be used for feature hashing.

Parameters	Type	Doc
n_list	list of integers	Amount of grams that have to be extracted, can be multiple. Say that uni and bigrams have to be extracted, n_list has to be [1, 2].

Examples

Token-level uni and bigrams with a maximum of 2000 feats per n:


>>> ng = Ngrams(level='token', n_list=[1, 2], max_feats=2000)
>>> ng.transform('this is text')
... {'this': 1, 'is': 1, 'text': 1, 'this is': 1, 'is text': 1}

Methods

Function	Doc
str	Report on feature settings.
find_ngrams	Magic n-gram function.
transform	Given a document, return level-grams as Counter dict.

str

    __str__()

Report on feature settings.

find_ngrams

    find_ngrams(input_list, n)

Magic n-gram function.

Calculate n-grams from a list of tokens/characters with added begin and end items. Based on the implementation by Scott Triglia.

transform

    transform(raw, parse=None)

Given a document, return level-grams as Counter dict.

FuncWords

 class FuncWords(object)

Extract function word frequencies.

Computes relative frequencies of function words according to parse data, and adds the respective frequencies as a feature.

Methods

Function	Doc
transform	Extract frequencies for fitted function word possibilites.

transform

    transform(_, parse)

Extract frequencies for fitted function word possibilites.

APISent

 class APISent(object)

Sentiment features using API tools.

Interacts with web and therefore needs urllib3. Might be very slow, use with caution and prefrably store features.

Parameters	Type	Doc
mode	string, optional, default 'deep'	Can be either 'deep' for Twitter-based neural sentiment (py2, bootslocal server instance), or 'nltk' for the text-processing.com API.

Examples


>>> sent = APISent()
>>> sent.transform("you're gonna have a bad time")
... 0.030120761495050809
>>> sent = APISent(mode='nltk')
>>> sent.transform("you're gonna have a bad time")
...

Methods

Function	Doc
str	String representation for APISent.
transform	Return a dictionary of feature values.

str

    __str__()

String representation for APISent.

transform

    transform(raw, _)

Return a dictionary of feature values.

DuSent

 class DuSent(object)

Lexicon based sentiment features.

Calculates four features related to sentiment: average polarity, number of positive, negative and neutral words. Counts based on the Duoman and Pattern sentiment lexicons.

Methods

Function	Doc
str	Class string representation.
calculate_sentiment	Calculate four features for the input instance.
transform	Get the sentiment belonging to the words in the parse string.

str

    __str__()

Class string representation.

calculate_sentiment

    calculate_sentiment(instance)

Calculate four features for the input instance.

Instance is a list of word-pos-lemma tuples that represent a token.

transform

    transform(_, parse)

Get the sentiment belonging to the words in the parse string.

SimpleStats

 class SimpleStats(object)

Parameters	Type	Doc
text	boolean, optional, default True	Text-based features to be extracted, includes: - Total amount of flooding, and individually punctuation and alphanumeric stats. - Frequency of punctuation and number sequences. - Emoticon frequencies.
sentence_lenth	boolean, optional, default True	Add the sentence length as a feature.

Examples

All features:


>>> SimpleStats()

Only text features:


>>> SimpleStats(token=False, sentence_length=False)

Methods

Function	Doc
avg	Average length of iter.
text_based_feats	Include features that are based on the raw text.
token_based_feats	Include features that are based on certain tokens.
avg_sent_length	Calculate average sentence length.
transform	Transform given instance into simple text features.

avg

    avg(iterb)

Average length of iter.

text_based_feats

    text_based_feats(raw)

Include features that are based on the raw text.

token_based_feats

    token_based_feats(tokens)

Include features that are based on certain tokens.

avg_sent_length

    avg_sent_length(sentence_indices)

Calculate average sentence length.

transform

    transform(raw, parse)

Transform given instance into simple text features.

Readability

 class Readability(object)

Get readability-related features.

Methods

Function	Doc
transform	Add each metric to the feature vector.

transform

    transform(raw, _)

Add each metric to the feature vector.