Processor methods

SimpleCleaner

 class SimpleCleaner def __init__(self)

Very simple data cleaner.

Methods

Function	Doc
clean	Lower text and removes special stuff.

clean

    clean(text)

Lower text and removes special stuff.

SocialCleaner

 class SocialCleaner(object)

Preprocess raw social media data.

This class should handle all the preprocessing done for both the labels as well as the texts that are provided. Current implementations are the standard replacement of bbcode, emoticons and urls with their own TOKENS. These are placed in the basic preprocessing.

Methods

Function	Doc
clean	Clean according to ALL the preprocessors.
replace_bbcode_tags	Replace BBCode tags.
replace_url_email	Replace URLs and e-mail addresses.
find_emoticons	Replace or find emoticons in given text.

clean

    clean(text)

Clean according to ALL the preprocessors.

replace_bbcode_tags

    replace_bbcode_tags(text)

Replace BBCode tags.

Replace all tags with [], which are included in the Netlog data, with a tag consisting of capital letters surrounded by underscores. Typography tags such as [b], [/b], [u], [/u], [i], [/i] (for bold, underlined and italics) are removed. The new tags are: - PHOTO, [photo]116157181[/photo] - VIDEO, [video]nl-9159440[/video] - URL, [url=http://www.adres.be/]Adres[/url] or [/url] - EMOTICON, [love], [@hug], [#clap_anim]

Parameters	Type	Doc
text	string	Input text.

Returns	Type	Doc
text	string	The text in which the Netlog tags with [] have been replaced.

replace_url_email

    replace_url_email(text, repl=('_URL_', '_EMAIL_'))

Replace URLs and e-mail addresses.

Replace URLs with the tag URL Replace e-mail addresses with the tag EMAIL

Parameters	Type	Doc
text	string	Input text.

Returns	Type	Doc
text	string	The text in which the URLs and e-mail addresses have been replaced.

find_emoticons

    find_emoticons(text, repl="_EMOTICON_")

Replace or find emoticons in given text.

Replace emoticons with a replacement string (default="EMOTICON"). Emoticons can be western (and flipped) -- :), :p, :(, o:, x: -- or eastern ^_^.

Parameters	Type	Doc
text	string	Input text.

Returns	Type	Doc
re.sub	string	The text with the emoticons replaced by repl.

Spacy

 class Spacy(object)

Wrapper to spaCy.io. From their docs @ http://http://spacy.io/docs/.

"spaCy consists of a vocabulary table that stores lexical types, a pipeline that produce annotations, and three classes to manipulate document, span and token data. The annotations are predicted using statistical models, according to specifications that follow common practice in the research community." spaCy is currently used in Omesa to provide the English part of the backbone. It's faster than CoreNLP, and Python <3. While spaCy can also extract things such as NER (it lacks sentiment and co-reference), this is currently not enabled for Omesa.

Methods

Function	Doc
parse	Extract spaCy tags.

parse

    parse(text)

Extract spaCy tags.

Convert raw text instance into spaCy format. Currently only returns token, lemma, POS.

Parameters	Type	Doc
text	string	A raw string of characters.

Returns	Type	Doc
instance	list	The token, lemma, POS list that can be used in featurizers.

Frog

 class Frog(object)

Wrapper to python-frog, loaded from LaMachine.

Excerpt from the documentation @ http://ilk.uvt.nl/frog/: Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Recently, a dependency parser, a base phrase chunker, and a named-entity recognizer module were added. Where possible, Frog makes use of multi-processor support to run subtasks in parallel. Frog is currently used in Omesa to provide the Dutch part of the backbone. As the other backbones, it currently only uses a subset of features. Full list of potential extractions (not enabled for Omesa) are: - Morphological segmentation (according to MBMA). - Confidence in the POS tag, a number between 0 and 1, representing the probability mass assigned to the best guess tag in the tag distribution. - Named entity type, identifying person (PER), organization (ORG), location (LOC), product (PRO), event (EVE), and miscellaneous (MISC), using a BIO (or IOB2) encoding. - Base (non-embedded) phrase chunk in BIO encoding. - Token number of head word in dependency graph (according to CSI-DP). - Type of dependency relation with head word.

Methods

Function	Doc
parse	Extract frog tags.

parse

    parse(text)

Extract frog tags.

Convert raw text instance into Frog format. Currently only returns token, lemma, POS.

Parameters	Type	Doc
text	string	A raw string of characters.

Returns	Type	Doc
instance	list	The token, lemma, POS list that can be used in featurizers.