basic
preprocess
- normalize_whitespace(text)[source]
 - Given - textstr, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace. eg. “ foo bar “ -> “foo bar”- Parameters:
- text (string) – 
- Return type:
- string 
 
- remove_whitespace(text)[source]
- Given - textstr, remove one or more spacings and linebreaks. Also strip leading/trailing whitespace. eg. “ foo bar “ -> “foobar”.- Parameters:
- text (string) – 
- Return type:
- string 
 
- lower_text(text)[source]
- Given - textstr, transform it into lowercase.- Parameters:
- text (string) – 
- Return type:
- string 
 
- filter_groups(token, ignored_stopwords=None)[source]
- Given - tokenstr and a list of groups of words that were concatenated into tokens, reverses the tokens to their ungrouped state.- Parameters:
- token (string) – 
- ignored_stopwords (list of strings) – 
 
- Return type:
- string 
 
- ungroup_ignored_stopwords(tokens, ignored_stopwords=None)[source]
- Given - tokenslist of str and a list of groups of words that are concatenated in tokens, reverses the tokens to their ungrouped state.- Parameters:
- tokens (list of strings) – 
- ignored_stopwords (list of strings) – 
 
- Return type:
- list of strings 
 
- remove_stopwords(text, lang, custom_stopwords=None, ignored_stopwords=None)[source]
- Given - textstr, remove classic stopwords for a given language and custom stopwords given as a list. Words and groups of words from ignored_stopwords list are ignored during stopwords removal.- Parameters:
- text (string) – 
- lang (string) – 
- custom_stopwords (list of strings) – 
- ignored_stopwords (list of strings) – 
 
- Return type:
- string 
- Raises:
- ValueError – if - custom_stopwordsand- ignored_stopwordshave common elements.
 
- remove_eol_characters(text)[source]
- Remove end of line (n) char. - Parameters:
- text (str) – 
- Return type:
- str 
 
- fix_bad_unicode(text, normalization='NFC')[source]
 - Fix unicode text that’s “broken” using ftfy; this includes mojibake, HTML entities and other code cruft, and non-standard forms for display purposes. - Parameters:
- text (string) – 
- ({'NFC' (normalization) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods 
- 'NFKC' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods 
- 'NFD' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods 
- 'NFKD'}) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods 
- normalization ( - str) –
 
- Return type:
- string 
 
- unpack_english_contractions(text)[source]
 - Replace English contractions in - textstr with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. eg. “You’re fired. She’s nice.” -> “You are fired. She’s nice.”- Parameters:
- text (string) – 
- Return type:
- string 
 
- replace_urls(text, replace_with='*URL*')[source]
 - Replace all URLs in - textstr with- replace_withstr.- Parameters:
- text (string) – 
- replace_with (string) – the string you want the URL to be replaced with. 
 
- Return type:
- string 
 
- replace_emails(text, replace_with='*EMAIL*')[source]
 - Replace all emails in - textstr with- replace_withstr- Parameters:
- text (string) – 
- replace_with (string) – the string you want the email address to be replaced with. 
 
- Return type:
- string 
 
- replace_phone_numbers(text, country_to_detect, replace_with='*PHONE*', method='regex')[source]
 - Replace all phone numbers in - textstr with- replace_withstr- Parameters:
- text (string) – 
- replace_with (string) – the string you want the phone number to be replaced with. 
- method (['regex','detection']) – regex is faster but will omit a lot of numbers, while detection will catch every numbers, but takes a while. 
- country_to_detect (list) – If a list of country code is specified, will catch every number formatted. Only when method = ‘detection’. 
 
- Return type:
- string 
 
- replace_numbers(text, replace_with='*NUMBER*')[source]
 - Replace all numbers in - textstr with- replace_withstr.- Parameters:
- text (string) – 
- replace_with (string) – the string you want the number to be replaced with. 
 
- Return type:
- string 
 
- replace_currency_symbols(text, replace_with=None)[source]
 - Replace all currency symbols in - textstr with string specified by- replace_withstr.- Parameters:
- text (str) – raw text 
- replace_with (None or string) – - if None (default), replace symbols with
- their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”) 
 
 
- Return type:
- string 
 
- remove_punct(text, marks=None)[source]
- Remove punctuation from - textby replacing all instances of- markswith whitespace.- Parameters:
- text (str) – raw text 
- marks (str or None) – If specified, remove only the characters in this string, e.g. - marks=',;:'removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
 
- Return type:
- string 
 - Note - When - marks=None, Python’s built-in- str.translate()is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.
- remove_accents(text, method='unicode')[source]
- Remove accents from any accented unicode characters in - textstr, either by transforming them into ascii equivalents or removing them entirely.- Parameters:
- text (str) – raw text 
- method (({'unicode', 'ascii'})) – - if ‘unicode’, remove accented char for any unicode symbol with a direct ASCII equivalent; if ‘ascii’, remove accented char for any unicode symbol - NB: the ‘ascii’ method is notably faster than ‘unicode’, but less good 
 
- Return type:
- string 
- Raises:
- ValueError – if - methodis not in {‘unicode’, ‘ascii’}