nlpretext
All the goto functions you need to handle NLP use-cases, integrated in NLPretext.
- class Preprocessor[source]
Bases:
object
- pipe(operation, args=None)[source]
Add an operation and its arguments to pipe in the preprocessor.
- Parameters:
operation (callable) – text preprocessing function
args (dict of arguments) –
- Return type:
None
- augmentation
- basic
- preprocess
normalize_whitespace()
remove_whitespace()
lower_text()
filter_groups()
ungroup_ignored_stopwords()
remove_stopwords()
remove_eol_characters()
fix_bad_unicode()
unpack_english_contractions()
replace_urls()
replace_emails()
replace_phone_numbers()
replace_numbers()
replace_currency_symbols()
remove_punct()
remove_accents()
remove_multiple_spaces_and_strip_text()
filter_non_latin_characters()
- preprocess
- cli
- social
- token
preprocessor
- class Preprocessor[source]
Bases:
object
- pipe(operation, args=None)[source]
Add an operation and its arguments to pipe in the preprocessor.
- Parameters:
operation (callable) – text preprocessing function
args (dict of arguments) –
- Return type:
None
textloader
- class TextLoader(text_column='text', encoding='utf-8', file_format=None, use_dask=True)[source]
Bases:
object
- read_text(files_path, file_format=None, encoding=None, compute_to_pandas=True, preprocessor=None)[source]
Read the text files stored in files_path.
- Parameters:
files_path (string | list[string]) – single or multiple files path
file_format (string) – Format of the files to be loaded, to be selected among csv, json, parquet or txt
encoding (
Optional
[str
]) – encoding of the text to be loaded, can be utf-8 or latin-1 for examplecompute_to_pandas (bool) – True if user wants Dask Dataframe to be computed as pandas DF, False otherwise
preprocessor (nlpretext.preprocessor.Preprocessor) – NLPretext preprocessor can be specified to pre-process text after loading
- Return type:
dask.dataframe | pandas.DataFrame