Scrubber¤
Overview¤
The Scrubber module provides a flexible, pipeline-based system for text cleaning and normalization as part of the Lexos project. It enables users to preprocess text by applying a customizable sequence of "scrubber components" (pipes) to remove, replace, or normalize elements such as punctuation, digits, whitespace, and more.
Features¤
- Modular pipeline for text scrubbing
- Built-in registry of reusable scrubber components
- Easy addition and removal of pipeline components
- Support for custom components and configuration
- Integration with other Lexos modules
- Batch processing of texts via generator interface
- Robust error handling
Submodules¤
Normalize¤
The normalize submodule contains functions to normalize all bullet points, hyphenated words, letters (to lowercase), quotation marks, repeating characters, unicode, and whitespace by replacing them with more standardized characters.
Pipeline¤
The pipeline submodule allows the user to create a pipeline which calls functions from the other submodules in a specific order.
Registry¤
The registry submodule contains functions get_component and get_components to get one component from a string, or multiple from a tuple, respectively.
Remove¤
The remove submodule contains functions to remove accents, all brackets ( ) [ ] { } and the text within them, digits, new_lines, given regex a pattern, Project Gutenberg headers, punctuation, tabs, and tags.
Replace¤
The replace submodule contains functions which replace currency symbols, digits, emails, emojis, hashtags, given a regex pattern, phone numbers, punctuation, special characters, urls, and user handles with a string of the form _TYPE_.
Resources¤
The resources submodule contains the HTMLTextExtractor class, a subclass of html.parser.HTMLParser.
Scrubber¤
The scrubber submodule contains the main logic for the Scrubber module. It contains the Pipe dataclass and the Scrubber class. The Pipe class contains only a call method and the Scrubber class contains an initialization method along with methods add_pipe, pipe, remove_pipe, reset, and scrub. The Scrubber class also contains the attribute pipes which returns a list of the pipeline components. The submodule also includes the function scrub which takes in the text to scrub, the pipeline, and the optional factory and returns the scrubbed text
Tags¤
The tags submodule uses Beautiful Soup for several functions to remove attributes, remove comments, remove doctypes, remove elements, remove tags, replace attributes, and replace tags in HTML and XML files.
Utils¤
The utils submodule contains the function get_tags.