token

preprocess

remove_stopwords(tokens, lang, custom_stopwords=None)[source]

Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’.

Parameters:
  • tokens (list(str)) – list of tokens

  • lang (str) – language iso code (e.g : “en”)

  • custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default

Returns:

tokens without stopwords

Return type:

list

Raises:

ValueError – When inputs is not a list

remove_tokens_with_nonletters(tokens)[source]

Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’].

Parameters:

tokens (list) – list of tokens to be cleaned

Returns:

list of tokens without tokens with numbers

Return type:

list

remove_special_caracters_from_tokenslist(tokens)[source]

Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—‘,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”].

Parameters:

tokens (list) – list of tokens to be cleaned

Returns:

list of tokens without tokens that contains only special caracters

Return type:

list

remove_smallwords(tokens, smallwords_threshold)[source]

Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”].

Parameters:
  • text (list) – list of strings

  • smallwords_threshold (int) – threshold of small word

  • tokens (List[str]) –

Return type:

list

tokenizer

exception LanguageNotHandled[source]

Bases: Exception

exception LanguageNotInstalledError[source]

Bases: Exception

class SpacyModel(lang)[source]

Bases: object

class SingletonSpacyModel(lang)[source]

Bases: object

Parameters:

lang (str) –

model: Optional[Language] = None
get_lang_model()[source]
Return type:

Optional[str]

tokenize(text, lang_module='en_spacy')[source]

Convert text to a list of tokens.

Parameters:
  • lang_module (str {'en_spacy', 'en_nltk', 'fr_spacy', 'fr_moses', 'ko_spacy', 'ja_spacy'}) – choose the tokenization module according to the langage and the implementation. Recommanded: Spacy (faster, better results). To process other langages import models.Spacy_models

  • text (str) –

Returns:

list of string

Return type:

list

Raises:

ValueError – If lang_module is not a valid module name

untokenize(tokens, lang='fr')[source]

Inputs a list of tokens output string. [“J’”, ‘ai’] >>> “J’ ai”.

Parameters:
  • lang (string) – language code

  • tokens (List[str]) –

Returns:

text

Return type:

string

convert_tokens_to_string(tokens_or_str)[source]
Parameters:

tokens_or_str (Union[str, List[str], None]) –

Return type:

str

convert_string_to_tokens(tokens_or_str, lang_module='en_spacy')[source]
Parameters:
  • tokens_or_str (Union[str, List[str], None]) –

  • lang_module (str) –

Return type:

List[str]