Text feature extraction module
This module contains several helper classes for extracting textual features used in Text Mining applications, partly based on instances parsed with parse. It also includes a wrapper class to cleverly hanlde this within the Omesa framework.
Featurizer
class Featurizer(features, preprocessor=False, parser=False)
Wrapper for looping feature extractors in fit and transform operations.
Calls helper classes which extract different features from text data. Given a list of initialized feature extractor classes, correctly streams or dumps instances along these classes. Also provides an interface to fit and transform methods.
Parameters | Type | Doc |
---|---|---|
features | list | List of initialized feature extractor classes. The classes can befound within this module. |
Attributes | Type | Doc |
---|---|---|
helper | list of classes | Store for the provided features. |
Y | list of labels | Labels for X. |
Examples
Note: this is just for local use only.
During training with a full space and a generator:
>>> loader = reader.load # assumes that this is a generator
>>> features = [Ngrams(level='char', n_list=[1,2])]
>>> ftr = _Featurizer(features)
>>> ftr.fit(loader())
>>> X, Y = ftr.transform(loader()), ftr.labels
During testing with only one instance:
>>> new_data = 'this is some string to test'
>>> tex, tey = ftr.transform(new_data), ftr.labels
Methods
Function | Doc |
---|---|
transform | Call all the helpers to extract features. |
transform
transform(instance)
Call all the helpers to extract features.
Parameters | Type | Doc |
---|---|---|
instance | tuple | Containing at least (raw) and optionally (parse, meta). |
Returns | Type | Doc |
---|---|---|
v | dict | Feature vector where key, value = feature, value. |
Ngrams
class Ngrams(object)
Calculate n-gram frequencies.
Can either be applied on token, POS or character level. The transform method dumps a feature dictionary that can be used for feature hashing.
Parameters | Type | Doc |
---|---|---|
n_list | list of integers | Amount of grams that have to be extracted, can be multiple. Say that uni and bigrams have to be extracted, n_list has to be [1, 2]. |
Examples
Token-level uni and bigrams with a maximum of 2000 feats per n:
>>> ng = Ngrams(level='token', n_list=[1, 2], max_feats=2000)
>>> ng.transform('this is text')
... {'this': 1, 'is': 1, 'text': 1, 'this is': 1, 'is text': 1}
Methods
Function | Doc |
---|---|
str | Report on feature settings. |
find_ngrams | Magic n-gram function. |
transform | Given a document, return level-grams as Counter dict. |
str
__str__()
Report on feature settings.
find_ngrams
find_ngrams(input_list, n)
Magic n-gram function.
Calculate n-grams from a list of tokens/characters with added begin and end items. Based on the implementation by Scott Triglia.
transform
transform(raw, parse=None)
Given a document, return level-grams as Counter dict.
FuncWords
class FuncWords(object)
Extract function word frequencies.
Computes relative frequencies of function words according to parse data, and adds the respective frequencies as a feature.
Methods
Function | Doc |
---|---|
transform | Extract frequencies for fitted function word possibilites. |
transform
transform(_, parse)
Extract frequencies for fitted function word possibilites.
APISent
class APISent(object)
Sentiment features using API tools.
Interacts with web and therefore needs urllib3. Might be very slow, use with caution and prefrably store features.
Parameters | Type | Doc |
---|---|---|
mode | string, optional, default 'deep' | Can be either 'deep' for Twitter-based neural sentiment (py2, bootslocal server instance), or 'nltk' for the text-processing.com API. |
Examples
>>> sent = APISent()
>>> sent.transform("you're gonna have a bad time")
... 0.030120761495050809
>>> sent = APISent(mode='nltk')
>>> sent.transform("you're gonna have a bad time")
...
Methods
Function | Doc |
---|---|
str | String representation for APISent. |
transform | Return a dictionary of feature values. |
str
__str__()
String representation for APISent.
transform
transform(raw, _)
Return a dictionary of feature values.
DuSent
class DuSent(object)
Lexicon based sentiment features.
Calculates four features related to sentiment: average polarity, number of positive, negative and neutral words. Counts based on the Duoman and Pattern sentiment lexicons.
Methods
Function | Doc |
---|---|
str | Class string representation. |
calculate_sentiment | Calculate four features for the input instance. |
transform | Get the sentiment belonging to the words in the parse string. |
str
__str__()
Class string representation.
calculate_sentiment
calculate_sentiment(instance)
Calculate four features for the input instance.
Instance is a list of word-pos-lemma tuples that represent a token.
transform
transform(_, parse)
Get the sentiment belonging to the words in the parse string.
SimpleStats
class SimpleStats(object)
Parameters | Type | Doc |
---|---|---|
text | boolean, optional, default True | Text-based features to be extracted, includes: - Total amount of flooding, and individually punctuation and alphanumeric stats. - Frequency of punctuation and number sequences. - Emoticon frequencies. |
sentence_lenth | boolean, optional, default True | Add the sentence length as a feature. |
Examples
All features:
>>> SimpleStats()
Only text features:
>>> SimpleStats(token=False, sentence_length=False)
Methods
Function | Doc |
---|---|
avg | Average length of iter. |
text_based_feats | Include features that are based on the raw text. |
token_based_feats | Include features that are based on certain tokens. |
avg_sent_length | Calculate average sentence length. |
transform | Transform given instance into simple text features. |
avg
avg(iterb)
Average length of iter.
text_based_feats
text_based_feats(raw)
Include features that are based on the raw text.
token_based_feats
token_based_feats(tokens)
Include features that are based on certain tokens.
avg_sent_length
avg_sent_length(sentence_indices)
Calculate average sentence length.
transform
transform(raw, parse)
Transform given instance into simple text features.
Readability
class Readability(object)
Get readability-related features.
Methods
Function | Doc |
---|---|
transform | Add each metric to the feature vector. |
transform
transform(raw, _)
Add each metric to the feature vector.