Skip to content

Resources¤

The resources component of Scrubber contains a set of functions for replacing strings and patterns in text.

lexos.scrubber.resources.HTMLTextExtractor ¤

Bases: html.parser.HTMLParser

Simple subclass of :class:html.parser.HTMLParser.

Collects data elements (non-tag, -comment, -pi, etc. elements) fed to the parser, then make them available as stripped, concatenated text via HTMLTextExtractor.get_text().

Note

Users probably shouldn't deal with this class directly; instead, use :func:remove.remove_html_tags()`.

Source code in lexos\scrubber\resources.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class HTMLTextExtractor(html.parser.HTMLParser):
    """Simple subclass of :class:`html.parser.HTMLParser`.

    Collects data elements (non-tag, -comment, -pi, etc. elements)
    fed to the parser, then make them available as stripped, concatenated
    text via `HTMLTextExtractor.get_text()`.

    Note:
        Users probably shouldn't deal with this class directly;
        instead, use `:func:`remove.remove_html_tags()`.
    """

    def __init__(self):
        """Initialize the parser."""
        super().__init__()
        self.data = []

    def handle_data(self, data):
        """Handle data elements."""
        self.data.append(data)

    def get_text(self, sep: str = "") -> str:
        """Return the collected text."""
        return sep.join(self.data).strip()

__init__() ¤

Initialize the parser.

Source code in lexos\scrubber\resources.py
23
24
25
26
def __init__(self):
    """Initialize the parser."""
    super().__init__()
    self.data = []

get_text(sep='') ¤

Return the collected text.

Source code in lexos\scrubber\resources.py
32
33
34
def get_text(self, sep: str = "") -> str:
    """Return the collected text."""
    return sep.join(self.data).strip()

handle_data(data) ¤

Handle data elements.

Source code in lexos\scrubber\resources.py
28
29
30
def handle_data(self, data):
    """Handle data elements."""
    self.data.append(data)

lexos.scrubber.resources._get_punct_translation_table() cached ¤

Get the punctuation translation table.

Source code in lexos\scrubber\resources.py
170
171
172
173
174
175
176
177
178
179
180
@functools.lru_cache(maxsize=None)
def _get_punct_translation_table():
    """Get the punctuation translation table."""
    return dict.fromkeys(
        (
            i
            for i in range(sys.maxunicode)
            if unicodedata.category(chr(i)).startswith("P")
        ),
        " ",
    )

lexos.scrubber.resources.__getattr__(name) ¤

Call an attribute lookup from a table.

Source code in lexos\scrubber\resources.py
184
185
186
187
188
189
def __getattr__(name: str) -> Any:
    """Call an attribute lookup from a table."""
    if name == "PUNCT_TRANSLATION_TABLE":
        return _get_punct_translation_table()
    else:
        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

Constants¤

There are also a number of constants:

  • QUOTE_TRANSLATION_TABLE
  • RE_BRACKETS_CURLY
  • RE_BRACKETS_ROUND
  • RE_BRACKETS_SQUARE
  • RE_BULLET_POINTS
  • RE_CURRENCY_SYMBOL
  • RE_EMAIL
  • RE_EMOJI
  • RE_HASHTAG
  • RE_HYPHENATED_WORD
  • RE_LINEBREAK
  • RE_NONBREAKING_SPACE
  • RE_NUMBER
  • RE_PHONE_NUMBER
  • RE_SHORT_URL
  • RE_TAB
  • RE_URL
  • RE_USER_HANDLE
  • RE_ZWSP