nlpretext

All the goto functions you need to handle NLP use-cases, integrated in NLPretext.

class Preprocessor[source]

Bases: object

pipe(operation, args=None)[source]

Add an operation and its arguments to pipe in the preprocessor.

Parameters:
  • operation (callable) – text preprocessing function

  • args (dict of arguments) –

Return type:

None

static build_pipeline(operation_list)[source]

Build sklearn pipeline from a operation list.

Parameters:

operation_list (iterable) – list of __operations of preprocessing

Return type:

sklearn.pipeline.Pipeline

run(text)[source]

Apply pipeline to text.

Parameters:

text (string) – text to preprocess

Return type:

string

preprocessor

class Preprocessor[source]

Bases: object

pipe(operation, args=None)[source]

Add an operation and its arguments to pipe in the preprocessor.

Parameters:
  • operation (callable) – text preprocessing function

  • args (dict of arguments) –

Return type:

None

static build_pipeline(operation_list)[source]

Build sklearn pipeline from a operation list.

Parameters:

operation_list (iterable) – list of __operations of preprocessing

Return type:

sklearn.pipeline.Pipeline

run(text)[source]

Apply pipeline to text.

Parameters:

text (string) – text to preprocess

Return type:

string

textloader

class TextLoader(text_column='text', encoding='utf-8', file_format=None, use_dask=True)[source]

Bases: object

read_text(files_path, file_format=None, encoding=None, compute_to_pandas=True, preprocessor=None)[source]

Read the text files stored in files_path.

Parameters:
  • files_path (string | list[string]) – single or multiple files path

  • file_format (string) – Format of the files to be loaded, to be selected among csv, json, parquet or txt

  • encoding (Optional[str]) – encoding of the text to be loaded, can be utf-8 or latin-1 for example

  • compute_to_pandas (bool) – True if user wants Dask Dataframe to be computed as pandas DF, False otherwise

  • preprocessor (nlpretext.preprocessor.Preprocessor) – NLPretext preprocessor can be specified to pre-process text after loading

Return type:

dask.dataframe | pandas.DataFrame