token
preprocess
- remove_stopwords(tokens, lang, custom_stopwords=None)[source]
Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’.
- Parameters:
tokens (list(str)) – list of tokens
lang (str) – language iso code (e.g : “en”)
custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default
- Returns:
tokens without stopwords
- Return type:
list
- Raises:
ValueError – When inputs is not a list
- remove_tokens_with_nonletters(tokens)[source]
Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’].
- Parameters:
tokens (list) – list of tokens to be cleaned
- Returns:
list of tokens without tokens with numbers
- Return type:
list
- remove_special_caracters_from_tokenslist(tokens)[source]
Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—‘,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”].
- Parameters:
tokens (list) – list of tokens to be cleaned
- Returns:
list of tokens without tokens that contains only special caracters
- Return type:
list
- remove_smallwords(tokens, smallwords_threshold)[source]
Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”].
- Parameters:
text (list) – list of strings
smallwords_threshold (int) – threshold of small word
tokens (
List
[str
]) –
- Return type:
list
tokenizer
- tokenize(text, lang_module='en_spacy')[source]
Convert text to a list of tokens.
- Parameters:
lang_module (str {'en_spacy', 'en_nltk', 'fr_spacy', 'fr_moses', 'ko_spacy', 'ja_spacy'}) – choose the tokenization module according to the langage and the implementation. Recommanded: Spacy (faster, better results). To process other langages import models.Spacy_models
text (
str
) –
- Returns:
list of string
- Return type:
list
- Raises:
ValueError – If lang_module is not a valid module name
- untokenize(tokens, lang='fr')[source]
Inputs a list of tokens output string. [“J’”, ‘ai’] >>> “J’ ai”.
- Parameters:
lang (string) – language code
tokens (
List
[str
]) –
- Returns:
text
- Return type:
string