Resources¤
The resources
component of Scrubber
contains a set of functions for replacing strings and patterns in text.
lexos.scrubber.resources.HTMLTextExtractor
¤
Bases: html.parser.HTMLParser
Simple subclass of :class:html.parser.HTMLParser
.
Collects data elements (non-tag, -comment, -pi, etc. elements)
fed to the parser, then make them available as stripped, concatenated
text via HTMLTextExtractor.get_text()
.
Note
Users probably shouldn't deal with this class directly;
instead, use :func:
remove.remove_html_tags()`.
Source code in lexos\scrubber\resources.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
__init__()
¤
Initialize the parser.
Source code in lexos\scrubber\resources.py
23 24 25 26 |
|
get_text(sep='')
¤
Return the collected text.
Source code in lexos\scrubber\resources.py
32 33 34 |
|
handle_data(data)
¤
Handle data elements.
Source code in lexos\scrubber\resources.py
28 29 30 |
|
lexos.scrubber.resources._get_punct_translation_table()
cached
¤
Get the punctuation translation table.
Source code in lexos\scrubber\resources.py
170 171 172 173 174 175 176 177 178 179 180 |
|
lexos.scrubber.resources.__getattr__(name)
¤
Call an attribute lookup from a table.
Source code in lexos\scrubber\resources.py
184 185 186 187 188 189 |
|
Constants¤
There are also a number of constants:
QUOTE_TRANSLATION_TABLE
RE_BRACKETS_CURLY
RE_BRACKETS_ROUND
RE_BRACKETS_SQUARE
RE_BULLET_POINTS
RE_CURRENCY_SYMBOL
RE_EMAIL
RE_EMOJI
RE_HASHTAG
RE_HYPHENATED_WORD
RE_LINEBREAK
RE_NONBREAKING_SPACE
RE_NUMBER
RE_PHONE_NUMBER
RE_SHORT_URL
RE_TAB
RE_URL
RE_USER_HANDLE
RE_ZWSP