nlpretext
All the goto functions you need to handle NLP use-cases, integrated in NLPretext.
- class Preprocessor[source]
Bases:
object- pipe(operation, args=None)[source]
Add an operation and its arguments to pipe in the preprocessor.
- Parameters:
operation (callable) – text preprocessing function
args (dict of arguments) –
- Return type:
None
- augmentation
- basic
- preprocess
normalize_whitespace()remove_whitespace()lower_text()filter_groups()ungroup_ignored_stopwords()remove_stopwords()remove_eol_characters()fix_bad_unicode()unpack_english_contractions()replace_urls()replace_emails()replace_phone_numbers()replace_numbers()replace_currency_symbols()remove_punct()remove_accents()remove_multiple_spaces_and_strip_text()filter_non_latin_characters()
- preprocess
- cli
- social
- token
preprocessor
- class Preprocessor[source]
Bases:
object- pipe(operation, args=None)[source]
Add an operation and its arguments to pipe in the preprocessor.
- Parameters:
operation (callable) – text preprocessing function
args (dict of arguments) –
- Return type:
None
textloader
- class TextLoader(text_column='text', encoding='utf-8', file_format=None, use_dask=True)[source]
Bases:
object- read_text(files_path, file_format=None, encoding=None, compute_to_pandas=True, preprocessor=None)[source]
Read the text files stored in files_path.
- Parameters:
files_path (string | list[string]) – single or multiple files path
file_format (string) – Format of the files to be loaded, to be selected among csv, json, parquet or txt
encoding (
Optional[str]) – encoding of the text to be loaded, can be utf-8 or latin-1 for examplecompute_to_pandas (bool) – True if user wants Dask Dataframe to be computed as pandas DF, False otherwise
preprocessor (nlpretext.preprocessor.Preprocessor) – NLPretext preprocessor can be specified to pre-process text after loading
- Return type:
dask.dataframe | pandas.DataFrame