token
preprocess
- remove_stopwords(tokens, lang, custom_stopwords=None)[source]
- Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’. - Parameters:
- tokens (list(str)) – list of tokens 
- lang (str) – language iso code (e.g : “en”) 
- custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default 
 
- Returns:
- tokens without stopwords 
- Return type:
- list 
- Raises:
- ValueError – When inputs is not a list 
 
- remove_tokens_with_nonletters(tokens)[source]
- Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’]. - Parameters:
- tokens (list) – list of tokens to be cleaned 
- Returns:
- list of tokens without tokens with numbers 
- Return type:
- list 
 
- remove_special_caracters_from_tokenslist(tokens)[source]
- Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—‘,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”]. - Parameters:
- tokens (list) – list of tokens to be cleaned 
- Returns:
- list of tokens without tokens that contains only special caracters 
- Return type:
- list 
 
- remove_smallwords(tokens, smallwords_threshold)[source]
- Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”]. - Parameters:
- text (list) – list of strings 
- smallwords_threshold (int) – threshold of small word 
- tokens ( - List[- str]) –
 
- Return type:
- list 
 
tokenizer
- tokenize(text, lang_module='en_spacy')[source]
- Convert text to a list of tokens. - Parameters:
- lang_module (str {'en_spacy', 'en_nltk', 'fr_spacy', 'fr_moses', 'ko_spacy', 'ja_spacy'}) – choose the tokenization module according to the langage and the implementation. Recommanded: Spacy (faster, better results). To process other langages import models.Spacy_models 
- text ( - str) –
 
- Returns:
- list of string 
- Return type:
- list 
- Raises:
- ValueError – If lang_module is not a valid module name 
 
- untokenize(tokens, lang='fr')[source]
- Inputs a list of tokens output string. [“J’”, ‘ai’] >>> “J’ ai”. - Parameters:
- lang (string) – language code 
- tokens ( - List[- str]) –
 
- Returns:
- text 
- Return type:
- string