nlpretext
All the goto functions you need to handle NLP use-cases, integrated in NLPretext.
- class Preprocessor[source]
- Bases: - object- pipe(operation, args=None)[source]
- Add an operation and its arguments to pipe in the preprocessor. - Parameters:
- operation (callable) – text preprocessing function 
- args (dict of arguments) – 
 
- Return type:
- None
 
 
- augmentation
- basic- preprocess- normalize_whitespace()
- remove_whitespace()
- lower_text()
- filter_groups()
- ungroup_ignored_stopwords()
- remove_stopwords()
- remove_eol_characters()
- fix_bad_unicode()
- unpack_english_contractions()
- replace_urls()
- replace_emails()
- replace_phone_numbers()
- replace_numbers()
- replace_currency_symbols()
- remove_punct()
- remove_accents()
- remove_multiple_spaces_and_strip_text()
- filter_non_latin_characters()
 
 
- preprocess
- cli
- social
- token
preprocessor
- class Preprocessor[source]
- Bases: - object- pipe(operation, args=None)[source]
- Add an operation and its arguments to pipe in the preprocessor. - Parameters:
- operation (callable) – text preprocessing function 
- args (dict of arguments) – 
 
- Return type:
- None
 
 
textloader
- class TextLoader(text_column='text', encoding='utf-8', file_format=None, use_dask=True)[source]
- Bases: - object- read_text(files_path, file_format=None, encoding=None, compute_to_pandas=True, preprocessor=None)[source]
- Read the text files stored in files_path. - Parameters:
- files_path (string | list[string]) – single or multiple files path 
- file_format (string) – Format of the files to be loaded, to be selected among csv, json, parquet or txt 
- encoding ( - Optional[- str]) – encoding of the text to be loaded, can be utf-8 or latin-1 for example
- compute_to_pandas (bool) – True if user wants Dask Dataframe to be computed as pandas DF, False otherwise 
- preprocessor (nlpretext.preprocessor.Preprocessor) – NLPretext preprocessor can be specified to pre-process text after loading 
 
- Return type:
- dask.dataframe | pandas.DataFrame