Processor methods

SimpleCleaner

 class SimpleCleaner def __init__(self) 

Very simple data cleaner.


Methods

Function Doc
clean Lower text and removes special stuff.

clean

    clean(text) 

Lower text and removes special stuff.

SocialCleaner

 class SocialCleaner(object) 

Preprocess raw social media data.

This class should handle all the preprocessing done for both the labels as well as the texts that are provided. Current implementations are the standard replacement of bbcode, emoticons and urls with their own TOKENS. These are placed in the basic preprocessing.


Methods

Function Doc
clean Clean according to ALL the preprocessors.
replace_bbcode_tags Replace BBCode tags.
replace_url_email Replace URLs and e-mail addresses.
find_emoticons Replace or find emoticons in given text.

clean

    clean(text) 

Clean according to ALL the preprocessors.

replace_bbcode_tags

    replace_bbcode_tags(text) 

Replace BBCode tags.

Replace all tags with [], which are included in the Netlog data, with a tag consisting of capital letters surrounded by underscores. Typography tags such as [b], [/b], [u], [/u], [i], [/i] (for bold, underlined and italics) are removed. The new tags are: - PHOTO, [photo]116157181[/photo] - VIDEO, [video]nl-9159440[/video] - URL, [url=http://www.adres.be/]Adres[/url] or [/url] - EMOTICON, [love], [@hug], [#clap_anim]

Parameters Type Doc
text string Input text.
Returns Type Doc
text string The text in which the Netlog tags with [] have been replaced.

replace_url_email

    replace_url_email(text, repl=('_URL_', '_EMAIL_')) 

Replace URLs and e-mail addresses.

Replace URLs with the tag URL Replace e-mail addresses with the tag EMAIL

Parameters Type Doc
text string Input text.
Returns Type Doc
text string The text in which the URLs and e-mail addresses have been replaced.

find_emoticons

    find_emoticons(text, repl="_EMOTICON_") 

Replace or find emoticons in given text.

Replace emoticons with a replacement string (default="EMOTICON"). Emoticons can be western (and flipped) -- :), :p, :(, o:, x: -- or eastern ^_^.

Parameters Type Doc
text string Input text.
Returns Type Doc
re.sub string The text with the emoticons replaced by repl.

Spacy

 class Spacy(object) 

Wrapper to spaCy.io. From their docs @ http://http://spacy.io/docs/.

"spaCy consists of a vocabulary table that stores lexical types, a pipeline that produce annotations, and three classes to manipulate document, span and token data. The annotations are predicted using statistical models, according to specifications that follow common practice in the research community." spaCy is currently used in Omesa to provide the English part of the backbone. It's faster than CoreNLP, and Python <3. While spaCy can also extract things such as NER (it lacks sentiment and co-reference), this is currently not enabled for Omesa.


Methods

Function Doc
parse Extract spaCy tags.

parse

    parse(text) 

Extract spaCy tags.

Convert raw text instance into spaCy format. Currently only returns token, lemma, POS.

Parameters Type Doc
text string A raw string of characters.
Returns Type Doc
instance list The token, lemma, POS list that can be used in featurizers.

Frog

 class Frog(object) 

Wrapper to python-frog, loaded from LaMachine.

Excerpt from the documentation @ http://ilk.uvt.nl/frog/: Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Recently, a dependency parser, a base phrase chunker, and a named-entity recognizer module were added. Where possible, Frog makes use of multi-processor support to run subtasks in parallel. Frog is currently used in Omesa to provide the Dutch part of the backbone. As the other backbones, it currently only uses a subset of features. Full list of potential extractions (not enabled for Omesa) are: - Morphological segmentation (according to MBMA). - Confidence in the POS tag, a number between 0 and 1, representing the probability mass assigned to the best guess tag in the tag distribution. - Named entity type, identifying person (PER), organization (ORG), location (LOC), product (PRO), event (EVE), and miscellaneous (MISC), using a BIO (or IOB2) encoding. - Base (non-embedded) phrase chunk in BIO encoding. - Token number of head word in dependency graph (according to CSI-DP). - Type of dependency relation with head word.


Methods

Function Doc
parse Extract frog tags.

parse

    parse(text) 

Extract frog tags.

Convert raw text instance into Frog format. Currently only returns token, lemma, POS.

Parameters Type Doc
text string A raw string of characters.
Returns Type Doc
instance list The token, lemma, POS list that can be used in featurizers.