Resources¤
The resources component of Scrubber contains a set of functions for replacing strings and patterns in text.
lexos.scrubber.resources.HTMLTextExtractor
¤
Bases: html.parser.HTMLParser
Simple subclass of :class:html.parser.HTMLParser.
Collects data elements (non-tag, -comment, -pi, etc. elements)
fed to the parser, then make them available as stripped, concatenated
text via HTMLTextExtractor.get_text().
Note
Users probably shouldn't deal with this class directly;
instead, use :func:remove.remove_html_tags()`.
Source code in lexos\scrubber\resources.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
__init__()
¤
Initialize the parser.
Source code in lexos\scrubber\resources.py
23 24 25 26 | |
get_text(sep='')
¤
Return the collected text.
Source code in lexos\scrubber\resources.py
32 33 34 | |
handle_data(data)
¤
Handle data elements.
Source code in lexos\scrubber\resources.py
28 29 30 | |
lexos.scrubber.resources._get_punct_translation_table()
cached
¤
Get the punctuation translation table.
Source code in lexos\scrubber\resources.py
170 171 172 173 174 175 176 177 178 179 180 | |
lexos.scrubber.resources.__getattr__(name)
¤
Call an attribute lookup from a table.
Source code in lexos\scrubber\resources.py
184 185 186 187 188 189 | |
Constants¤
There are also a number of constants:
QUOTE_TRANSLATION_TABLERE_BRACKETS_CURLYRE_BRACKETS_ROUNDRE_BRACKETS_SQUARERE_BULLET_POINTSRE_CURRENCY_SYMBOLRE_EMAILRE_EMOJIRE_HASHTAGRE_HYPHENATED_WORDRE_LINEBREAKRE_NONBREAKING_SPACERE_NUMBERRE_PHONE_NUMBERRE_SHORT_URLRE_TABRE_URLRE_USER_HANDLERE_ZWSP