Skip to content

Remove¤

The remove component of Scrubber contains a set of functions for removing strings and patterns from text.

lexos.scrubber.remove.accents(text, *, fast=False, accents=None) ¤

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters:

Name Type Description Default
text str

The text from which accents will be removed.

required
fast bool

If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.

False
accents Union[str, tuple]

An optional string or tuple of strings indicating the names of diacritics to be stripped.

None

Returns:

Type Description
str

str

fast=True can be significantly faster than fast=False,

but its transformation of text is less "safe" and more likely to result in changes of meaning, spelling errors, etc.

See Also
Source code in lexos\scrubber\remove.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def accents(text: str, *, fast: bool = False, accents: Union[str, tuple] = None) -> str:
    """Remove accents from any accented unicode characters in `text`, either
    by replacing them with ASCII equivalents or removing them entirely.

    Args:
        text (str): The text from which accents will be removed.
        fast: If False, accents are removed from any unicode symbol
            with a direct ASCII equivalent; if True, accented chars
            for all unicode symbols are removed, regardless.
        accents: An optional string or tuple of strings indicating the
            names of diacritics to be stripped.

    Returns:
        str

    Note: `fast=True` can be significantly faster than `fast=False`,
        but its transformation of `text` is less "safe" and more likely
        to result in changes of meaning, spelling errors, etc.

    See Also:
        - For a chart containing Unicode standard names of diacritics, see
        https://en.wikipedia.org/wiki/Combining_Diacritical_Marks#Character_table
        - For a more powerful (but slower) alternative, check out `unidecode`:
        https://github.com/avian2/unidecode
    """
    if fast is False:
        if accents:
            if isinstance(accents, str):
                accents = set(unicodedata.lookup(accents))
            elif len(accents) == 1:
                accents = set(unicodedata.lookup(accents[0]))
            else:
                accents = set(map(unicodedata.lookup, accents))
            return "".join(
                char
                for char in unicodedata.normalize("NFKD", text)
                if char not in accents
            )
        else:
            return "".join(
                char
                for char in unicodedata.normalize("NFKD", text)
                if not unicodedata.combining(char)
            )
    else:
        return (
            unicodedata.normalize("NFKD", text)
            .encode("ascii", errors="ignore")
            .decode("ascii")
        )

lexos.scrubber.remove.brackets(text, *, only=None) ¤

Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.

Parameters:

Name Type Description Default
text str

The text from which brackets will be removed.

required
only Optional[str | Collection[str]]

Remove only those bracketed contents as specified here: "curly", "square", and/or "round". For example, "square" removes only those contents found between square brackets, while ["round", "square"] removes those contents found between square or round brackets, but not curly.

None

Returns:

Type Description
str

str

Note

This function relies on regular expressions, applied sequentially for curly, square, then round brackets; as such, it doesn't handle nested brackets of the same type and may behave unexpectedly on text with "wild" use of brackets. It should be fine removing structured bracketed contents, as is often used, for instance, to denote in-text citations.

Source code in lexos\scrubber\remove.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def brackets(
    text: str,
    *,
    only: Optional[str | Collection[str]] = None,
) -> str:
    """Remove text within curly {}, square [], and/or round () brackets, as well as
    the brackets themselves.

    Args:
        text (str): The text from which brackets will be removed.
        only: Remove only those bracketed contents as specified here: "curly", "square",
            and/or "round". For example, `"square"` removes only those contents found
            between square brackets, while `["round", "square"]` removes those contents
            found between square or round brackets, but not curly.

    Returns:
        str

    Note:
        This function relies on regular expressions, applied sequentially for curly,
        square, then round brackets; as such, it doesn't handle nested brackets of the
        same type and may behave unexpectedly on text with "wild" use of brackets.
        It should be fine removing structured bracketed contents, as is often used,
        for instance, to denote in-text citations.
    """
    only = utils.to_collection(only, val_type=str, col_type=set)
    if only is None or "curly" in only:
        text = resources.RE_BRACKETS_CURLY.sub("", text)
    if only is None or "square" in only:
        text = resources.RE_BRACKETS_SQUARE.sub("", text)
    if only is None or "round" in only:
        text = resources.RE_BRACKETS_ROUND.sub("", text)
    return text

lexos.scrubber.remove.digits(text, *, only=None) ¤

Remove digits.

Remove digits from text by replacing all instances of digits (or a subset thereof specified by only) with whitespace.

Removes signed/unsigned numbers and decimal/delimiter-separated numbers. Does not remove currency symbols. Some tokens containing digits will be modified.

Parameters:

Name Type Description Default
text str

The text from which digits will be removed.

required
only Optional[str | Collection[str]]

Remove only those digits specified here. For example, "9" removes only 9, while ["1", "2", "3"] removes 1, 2, 3; if None, all unicode digits marks are removed.

None

Returns:

Type Description
str

str

Source code in lexos\scrubber\remove.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def digits(text: str, *, only: Optional[str | Collection[str]] = None) -> str:
    """Remove digits.

    Remove digits from `text` by replacing all instances of digits
    (or a subset thereof specified by `only`) with whitespace.

    Removes signed/unsigned numbers and decimal/delimiter-separated
    numbers. Does not remove currency symbols. Some tokens containing
    digits will be modified.

    Args:
        text (str): The text from which digits will be removed.
        only: Remove only those digits specified here. For example,
            `"9"` removes only 9, while `["1", "2", "3"]` removes 1, 2, 3;
            if None, all unicode digits marks are removed.

    Returns:
        str
    """
    if only:
        if isinstance(only, list):
            pattern = re.compile(f'[{"".join(only)}]')
        else:
            pattern = re.compile(only)
    else:
        # Using "." to represent any unicode character used to indicate
        # a decimal number, and "***" to represent any sequence of
        # unicode digits, this pattern will match:
        # 1) ***
        # 2) ***.***
        unicode_digits = ""
        for i in range(sys.maxunicode):
            if unicodedata.category(chr(i)).startswith("N"):
                unicode_digits = unicode_digits + chr(i)
        pattern = re.compile(
            r"([+-]?["
            + re.escape(unicode_digits)
            + r"])|((?<="
            + re.escape(unicode_digits)
            + r")[\u0027|\u002C|\u002E|\u00B7|"
            r"\u02D9|\u066B|\u066C|\u2396][" + re.escape(unicode_digits) + r"]+)",
            re.UNICODE,
        )
    return str(re.sub(pattern, r"", text))

lexos.scrubber.remove.new_lines(text) ¤

Remove new lines.

Remove all line-breaking spaces.

Parameters:

Name Type Description Default
text str

The text from which new lines will be removed.

required

Returns:

Type Description
str

The normalized text.

Source code in lexos\scrubber\remove.py
236
237
238
239
240
241
242
243
244
245
246
247
def new_lines(text: str) -> str:
    """Remove new lines.

    Remove all line-breaking spaces.

    Args:
        text (str): The text from which new lines will be removed.

    Returns:
        The normalized text.
    """
    return resources.RE_LINEBREAK.sub("", text).strip()

lexos.scrubber.remove.pattern(text, *, pattern) ¤

Remove strings from text using a regex pattern.

Parameters:

Name Type Description Default
text str

The text from which patterns will be removed.

required
pattern Union[str, Collection[str]]

The pattern to match.

required

Returns:

Type Description
str

str

Source code in lexos\scrubber\remove.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def pattern(text: str, *, pattern: Union[str, Collection[str]]) -> str:
    """Remove strings from `text` using a regex pattern.

    Args:
        text (str): The text from which patterns will be removed.
        pattern: The pattern to match.

    Returns:
        str
    """
    if isinstance(pattern, list):
        pattern = "|".join(pattern)
    pat = re.compile(pattern)
    return re.sub(pat, "", text)

lexos.scrubber.remove.project_gutenberg_headers(text) ¤

Remove Project Gutenberg headers and footers.

Parameters:

Name Type Description Default
text str

The text from which headers and footers will be removed.

required

Returns:

Type Description
str

str

Notes

This function is reproduced from Gutenberg package's strip_headers() function (https://github.com/c-w/gutenberg), itself a port ofthe C++ utility by Johannes Krugel.

Source code in lexos\scrubber\remove.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def project_gutenberg_headers(text: str) -> str:
    """Remove Project Gutenberg headers and footers.

    Args:
        text (str): The text from which headers and footers will be removed.

    Returns:
        str

    Notes:
        This function is reproduced from Gutenberg package's `strip_headers()`
        function (https://github.com/c-w/gutenberg), itself a port ofthe C++ utility
        by Johannes Krugel.
    """
    lines = text.splitlines()
    sep = str(os.linesep)

    out = []
    i = 0
    footer_found = False
    ignore_section = False

    for line in lines:
        reset = False

        if i <= 600:
            # Check if the header ends here
            if any(line.startswith(token) for token in resources.TEXT_START_MARKERS):
                reset = True

            # If it's the end of the header, delete the output produced so far.
            # May be done several times, if multiple lines occur indicating the
            # end of the header
            if reset:
                out = []
                continue

        if i >= 100:
            # Check if the footer begins here
            if any(line.startswith(token) for token in resources.TEXT_END_MARKERS):
                footer_found = True

            # If it's the beginning of the footer, stop output
            if footer_found:
                break

        if any(line.startswith(token) for token in resources.LEGALESE_START_MARKERS):
            ignore_section = True
            continue
        elif any(line.startswith(token) for token in resources.LEGALESE_END_MARKERS):
            ignore_section = False
            continue

        if not ignore_section:
            out.append(line.rstrip(sep))
            i += 1

    return sep.join(out).strip()

lexos.scrubber.remove.punctuation(text, *, exclude=None, only=None) ¤

Remove punctuation from text.

Removes all instances of punctuation (or a subset thereof specified by only).

Parameters:

Name Type Description Default
text str

The text from which punctuation will be removed.

required
exclude Optional[str | Collection[str]]

Remove all punctuation except designated characters.

None
only Optional[str | Collection[str]]

Remove only those punctuation marks specified here. For example, "." removes only periods, while [",", ";", ":"] removes commas, semicolons, and colons; if None, all unicode punctuation marks are removed.

None

Returns:

Type Description
str

str

Note

When only=None, Python's built-in str.translate() is used; otherwise, a regular expression is used. The former's performance can be up to an order of magnitude faster.

Source code in lexos\scrubber\remove.py
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
def punctuation(
    text: str,
    *,
    exclude: Optional[str | Collection[str]] = None,
    only: Optional[str | Collection[str]] = None,
) -> str:
    """Remove punctuation from `text`.

    Removes all instances of punctuation (or a subset thereof specified by `only`).

    Args:
        text (str): The text from which punctuation will be removed.
        exclude: Remove all punctuation except designated characters.
        only: Remove only those punctuation marks specified here. For example,
            `"."` removes only periods, while `[",", ";", ":"]` removes commas,
            semicolons, and colons; if None, all unicode punctuation marks are removed.

    Returns:
        str

    Note:
        When `only=None`, Python's built-in `str.translate()` is used;
        otherwise, a regular expression is used. The former's performance
        can be up to an order of magnitude faster.
    """
    if only is not None:
        only = utils.to_collection(only, val_type=str, col_type=set)
        return re.sub("[{}]+".format(re.escape("".join(only))), "", text)
    else:
        if exclude:
            exclude = utils.ensure_list(exclude)
        else:
            exclude = []
        # Note: We can't use the cached translation table because it replaces
        # the punctuation with whitespace, so we have to build a new one.
        translation_table = dict.fromkeys(
            (
                i
                for i in range(sys.maxunicode)
                if unicodedata.category(chr(i)).startswith("P")
                and chr(i) not in exclude
            ),
            "",
        )
        return text.translate(translation_table)

lexos.scrubber.remove.tabs(text) ¤

Remove tabs.

If you want to replace tabs with a single space, use normalize.whitespace() instead.

Parameters:

Name Type Description Default
text str

The text from which tabs will be removed.

required

Returns:

Type Description
str

The stripped text.

Source code in lexos\scrubber\remove.py
313
314
315
316
317
318
319
320
321
322
323
324
325
def tabs(text: str) -> str:
    """Remove tabs.

    If you want to replace tabs with a single space, use
    `normalize.whitespace()` instead.

    Args:
        text (str): The text from which tabs will be removed.

    Returns:
        The stripped text.
    """
    return resources.RE_TAB.sub("", text)

lexos.scrubber.remove.tags(text, sep=' ', remove_whitespace=True) ¤

Remove tags from text.

Parameters:

Name Type Description Default
text str

The text from which tags will be removed.

required
sep str

A string to insert between tags and text found between them.

' '
remove_whitespace bool

If True, remove extra whitespace between text after tags are removed.

True

Returns:

Type Description
str

A string containing just the text found between tags and other

str

non-data elements.

Note
  • If you want to perfom selective removal of tags, use replace.tag_map instead.
  • This function relies on the stdlib html.parser.HTMLParser. It appears to work for stripping tags from both html and xml. Using lxml or BeautifulSoup might be faster, but this is untested.
  • This function preserves text in comments, as well as tags
Source code in lexos\scrubber\remove.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def tags(text: str, sep: str = " ", remove_whitespace: bool = True) -> str:
    """Remove tags from `text`.

    Args:
        text (str): The text from which tags will be removed.
        sep: A string to insert between tags and text found between them.
        remove_whitespace: If True, remove extra whitespace between text
            after tags are removed.

    Returns:
        A string containing just the text found between tags and other
        non-data elements.

    Note:
        - If you want to perfom selective removal of tags,
            use `replace.tag_map` instead.
        - This function relies on the stdlib `html.parser.HTMLParser`.
            It appears to work for stripping tags from both html and xml.
            Using `lxml` or BeautifulSoup might be faster, but this is untested.
        - This function preserves text in comments, as well as tags
    """
    parser = resources.HTMLTextExtractor()
    parser.feed(text)
    text = parser.get_text(sep=sep)
    if remove_whitespace:
        text = re.sub(r"[\n\s\t\v ]+", sep, text, re.UNICODE)
    return text

Note

Tag handling has been ported over from the Lexos web app, which uses BeautifulSoup and lxml to parse the tree. It will be good to watch the development of selectolax, which claims to be more efficient, at least for HTML. An implementation with spaCy is available in the spacy-html-tokenizer, though it may not be right for integration into Lexos since the output is a doc in which tokens are sentences.