token

preprocess

remove_stopwords(tokens, lang, custom_stopwords=None)[source]

Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’.

Parameters:

tokens (list(str)) – list of tokens
lang (str) – language iso code (e.g : “en”)
custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default

Returns:

tokens without stopwords

Return type:

list

Raises:

ValueError – When inputs is not a list

remove_tokens_with_nonletters(tokens)[source]

Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’].

Parameters:: tokens (list) – list of tokens to be cleaned
Returns:: list of tokens without tokens with numbers
Return type:: list

remove_special_caracters_from_tokenslist(tokens)[source]

Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—‘,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”].

Parameters:: tokens (list) – list of tokens to be cleaned
Returns:: list of tokens without tokens that contains only special caracters
Return type:: list

remove_smallwords(tokens, smallwords_threshold)[source]

Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”].

Parameters:

text (list) – list of strings
smallwords_threshold (int) – threshold of small word
tokens (List[str]) –

Return type:

list

tokenizer

exception LanguageNotHandled[source]: Bases: Exception

exception LanguageNotInstalledError[source]: Bases: Exception

class SpacyModel(lang)[source]

Bases: object

class SingletonSpacyModel(lang)[source]

Bases: object

Parameters:: lang (str) –

model: Optional[Language] = None

get_lang_model()[source]

Return type:: Optional[str]

tokenize(text, lang_module='en_spacy')[source]

Convert text to a list of tokens.

Parameters:

lang_module (str {'en_spacy', 'en_nltk', 'fr_spacy', 'fr_moses', 'ko_spacy', 'ja_spacy'}) – choose the tokenization module according to the langage and the implementation. Recommanded: Spacy (faster, better results). To process other langages import models.Spacy_models
text (str) –

Returns:

list of string

Return type:

list

Raises:

ValueError – If lang_module is not a valid module name

untokenize(tokens, lang='fr')[source]

Inputs a list of tokens output string. [“J’”, ‘ai’] >>> “J’ ai”.

Parameters:

lang (string) – language code
tokens (List[str]) –

Returns:

text

Return type:

string

convert_tokens_to_string(tokens_or_str)[source]

Parameters:: tokens_or_str (Union[str, List[str], None]) –
Return type:: str

convert_string_to_tokens(tokens_or_str, lang_module='en_spacy')[source]

Parameters:

tokens_or_str (Union[str, List[str], None]) –
lang_module (str) –

Return type:

List[str]