Skip to content

The Lexos API

Resources

Resources¤

The resources component of Scrubber contains a set of functions for replacing strings and patterns in text.

`lexos.scrubber.resources.HTMLTextExtractor` ¤

Bases: html.parser.HTMLParser

Simple subclass of :class:html.parser.HTMLParser.

Collects data elements (non-tag, -comment, -pi, etc. elements) fed to the parser, then make them available as stripped, concatenated text via HTMLTextExtractor.get_text().

Note

Users probably shouldn't deal with this class directly; instead, use :func:remove.remove_html_tags()`.

Source code in lexos\scrubber\resources.py

class HTMLTextExtractor(html.parser.HTMLParser):
    """Simple subclass of :class:`html.parser.HTMLParser`.

    Collects data elements (non-tag, -comment, -pi, etc. elements)
    fed to the parser, then make them available as stripped, concatenated
    text via `HTMLTextExtractor.get_text()`.

    Note:
        Users probably shouldn't deal with this class directly;
        instead, use `:func:`remove.remove_html_tags()`.
    """

    def __init__(self):
        """Initialize the parser."""
        super().__init__()
        self.data = []

    def handle_data(self, data):
        """Handle data elements."""
        self.data.append(data)

    def get_text(self, sep: str = "") -> str:
        """Return the collected text."""
        return sep.join(self.data).strip()

`init()` ¤

Initialize the parser.

Source code in lexos\scrubber\resources.py

def __init__(self):
    """Initialize the parser."""
    super().__init__()
    self.data = []

`get_text(sep='')` ¤

Return the collected text.

Source code in lexos\scrubber\resources.py

def get_text(self, sep: str = "") -> str:
    """Return the collected text."""
    return sep.join(self.data).strip()

`handle_data(data)` ¤

Handle data elements.

Source code in lexos\scrubber\resources.py

def handle_data(self, data):
    """Handle data elements."""
    self.data.append(data)

`lexos.scrubber.resources._get_punct_translation_table()` `cached` ¤

Get the punctuation translation table.

Source code in lexos\scrubber\resources.py

@functools.lru_cache(maxsize=None)
def _get_punct_translation_table():
    """Get the punctuation translation table."""
    return dict.fromkeys(
        (
            i
            for i in range(sys.maxunicode)
            if unicodedata.category(chr(i)).startswith("P")
        ),
        " ",
    )

`lexos.scrubber.resources.getattr(name)` ¤

Call an attribute lookup from a table.

Source code in lexos\scrubber\resources.py

def __getattr__(name: str) -> Any:
    """Call an attribute lookup from a table."""
    if name == "PUNCT_TRANSLATION_TABLE":
        return _get_punct_translation_table()
    else:
        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

Constants¤

There are also a number of constants:

QUOTE_TRANSLATION_TABLE
RE_BRACKETS_CURLY
RE_BRACKETS_ROUND
RE_BRACKETS_SQUARE
RE_BULLET_POINTS
RE_CURRENCY_SYMBOL
RE_EMAIL
RE_EMOJI
RE_HASHTAG
RE_HYPHENATED_WORD
RE_LINEBREAK
RE_NONBREAKING_SPACE
RE_NUMBER
RE_PHONE_NUMBER
RE_SHORT_URL
RE_TAB
RE_URL
RE_USER_HANDLE
RE_ZWSP