Skip to content

Replace¤

The replace component of Scrubber contains a set of functions for replacing strings and patterns in text.

Important

Some functions have the same names as functions in the remove component. To distinguish them in the registry, replace functions with the same names are prefixed with re_. When loaded into a script, they can be given any name the user desires.

lexos.scrubber.replace.currency_symbols(text, repl='_CUR_') ¤

Replace all currency symbols in text with repl.

Parameters:

Name Type Description Default
text str

The text in which currency symbols will be replaced.

required
repl str

The replacement value for currency symbols.

'_CUR_'

Returns:

Name Type Description
str str

The text with currency symbols replaced.

Source code in lexos\scrubber\replace.py
15
16
17
18
19
20
21
22
23
24
25
def currency_symbols(text: str, repl: str = "_CUR_") -> str:
    """Replace all currency symbols in `text` with `repl`.

    Args:
        text (str): The text in which currency symbols will be replaced.
        repl (str): The replacement value for currency symbols.

    Returns:
        str: The text with currency symbols replaced.
    """
    return resources.RE_CURRENCY_SYMBOL.sub(repl, text)

lexos.scrubber.replace.digits(text, repl='_DIGIT_') ¤

Replace all digits in text with repl.

Parameters:

Name Type Description Default
text str

The text in which digits will be replaced.

required
repl str

The replacement value for digits.

'_DIGIT_'

Returns:

Name Type Description
str str

The text with digits replaced.

Source code in lexos\scrubber\replace.py
28
29
30
31
32
33
34
35
36
37
38
def digits(text: str, repl: str = "_DIGIT_") -> str:
    """Replace all digits in `text` with `repl`.

    Args:
        text (str): The text in which digits will be replaced.
        repl (str): The replacement value for digits.

    Returns:
        str: The text with digits replaced.
    """
    return resources.RE_NUMBER.sub(repl, text)

lexos.scrubber.replace.emails(text, repl='_EMAIL_') ¤

Replace all email addresses in text with repl.

Parameters:

Name Type Description Default
text str

The text in which emails will be replaced.

required
repl str

The replacement value for emails.

'_EMAIL_'

Returns:

Name Type Description
str str

The text with emails replaced.

Source code in lexos\scrubber\replace.py
41
42
43
44
45
46
47
48
49
50
51
def emails(text: str, repl: str = "_EMAIL_") -> str:
    """Replace all email addresses in `text` with `repl`.

    Args:
        text (str): The text in which emails will be replaced.
        repl (str): The replacement value for emails.

    Returns:
        str: The text with emails replaced.
    """
    return resources.RE_EMAIL.sub(repl, text)

lexos.scrubber.replace.emojis(text, repl='_EMOJI_') ¤

Replace all emoji and pictographs in text with repl.

Parameters:

Name Type Description Default
text str

The text in which emojis will be replaced.

required
repl str

The replacement value for emojis.

'_EMOJI_'

Returns:

Name Type Description
str str

The text with emojis replaced.

Note

If your Python has a narrow unicode build ("USC-2"), only dingbats and miscellaneous symbols are replaced because Python isn't able to represent the unicode data for things like emoticons. Sorry!

Source code in lexos\scrubber\replace.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def emojis(text: str, repl: str = "_EMOJI_") -> str:
    """
    Replace all emoji and pictographs in `text` with `repl`.

    Args:
        text (str): The text in which emojis will be replaced.
        repl (str): The replacement value for emojis.

    Returns:
        str: The text with emojis replaced.

    Note:
        If your Python has a narrow unicode build ("USC-2"), only dingbats
        and miscellaneous symbols are replaced because Python isn't able
        to represent the unicode data for things like emoticons. Sorry!
    """
    return resources.RE_EMOJI.sub(repl, text)

lexos.scrubber.replace.hashtags(text, repl='_HASHTAG_') ¤

Replace all hashtags in text with repl.

Parameters:

Name Type Description Default
text str

The text in which hashtags will be replaced.

required
repl str

The replacement value for hashtags.

'_HASHTAG_'

Returns:

Name Type Description
str str

The text with currency hashtags replaced.

Source code in lexos\scrubber\replace.py
73
74
75
76
77
78
79
80
81
82
83
def hashtags(text: str, repl: str = "_HASHTAG_") -> str:
    """Replace all hashtags in `text` with `repl`.

    Args:
        text (str): The text in which hashtags will be replaced.
        repl (str): The replacement value for hashtags.

    Returns:
        str: The text with currency hashtags replaced.
    """
    return resources.RE_HASHTAG.sub(repl, text)

lexos.scrubber.replace.pattern(text, *, pattern) ¤

Replace strings from text using a regex pattern.

Parameters:

Name Type Description Default
text str

The text in which a pattern or pattern will be replaced.

required
pattern Union[dict, Collection[dict]]

(Union[dict, Collection[dict]]): A dictionary or list of dictionaries containing the pattern(s) and replacement(s).

required

Returns:

Name Type Description
str str

The text with pattern(s) replaced.

Source code in lexos\scrubber\replace.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def pattern(text: str, *, pattern: Union[dict, Collection[dict]]) -> str:
    """Replace strings from `text` using a regex pattern.

    Args:
        text (str): The text in which a pattern or pattern will be replaced.
        pattern: (Union[dict, Collection[dict]]): A dictionary or list of dictionaries
            containing the pattern(s) and replacement(s).

    Returns:
        str: The text with pattern(s) replaced.
    """
    pattern = utils.ensure_list(pattern)
    for pat in pattern:
        k = str(*pat)
        match = re.compile(k)
        text = re.sub(match, pat[k], text)
    return text

lexos.scrubber.replace.phone_numbers(text, repl='_PHONE_') ¤

Replace all phone numbers in text with repl.

Parameters:

Name Type Description Default
text str

The text in which phone numbers will be replaced.

required
repl str

The replacement value for phone numbers.

'_PHONE_'

Returns:

Name Type Description
str str

The text with phone numbers replaced.

Source code in lexos\scrubber\replace.py
105
106
107
108
109
110
111
112
113
114
115
def phone_numbers(text: str, repl: str = "_PHONE_") -> str:
    """Replace all phone numbers in `text` with `repl`.

    Args:
        text (str): The text in which phone numbers will be replaced.
        repl (str): The replacement value for phone numbers.

    Returns:
        str: The text with phone numbers replaced.
    """
    return resources.RE_PHONE_NUMBER.sub(repl, text)

lexos.scrubber.replace.process_tag_replace_options(orig_text, tag, action, attribute) ¤

Replace html-style tags in text files according to user options.

Parameters:

Name Type Description Default
orig_text str

The user's text containing the original tag.

required
tag str

The particular tag to be processed.

required
action str

A string specifying the action to be performed on the tag.

required
attribute str

Replacement value for tag when "replace_with_attribute" is specified.

required

Returns:

Name Type Description
str str

The text after the specified tag is processed.

Notes
  • Action options are:
  • "remove_tag": Remove the tag
  • "remove_element": Remove the element and contents
  • "replace_element": Replace the tag with the specified attribute
  • The replacement of a tag with the value of an attribute may not be supported. This needs a second look.
Source code in lexos\scrubber\replace.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def process_tag_replace_options(
    orig_text: str, tag: str, action: str, attribute: str
) -> str:
    """Replace html-style tags in text files according to user options.

    Args:
        orig_text: The user's text containing the original tag.
        tag: The particular tag to be processed.
        action: A string specifying the action to be performed on the tag.
        attribute: Replacement value for tag when "replace_with_attribute" is specified.

    Returns:
        str: The text after the specified tag is processed.

    Notes:
      - Action options are:
        - "remove_tag": Remove the tag
        - "remove_element": Remove the element and contents
        - "replace_element": Replace the tag with the specified attribute
      - The replacement of a tag with the value of an attribute may not be supported. This needs a second look.
    """
    if action == "remove_tag":
        # searching for variants this specific tag:  <tag> ...
        pattern = re.compile(
            r"<(?:" + tag + r'(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)'
            r'\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?' + tag + r"\s*/?)>",
            re.MULTILINE | re.DOTALL | re.UNICODE,
        )

        # substitute all matching patterns with one space
        processed_text = re.sub(pattern, " ", orig_text)

    elif action == "remove_element":
        # <[whitespaces] TAG [SPACE attributes]> contents </[whitespaces]TAG>
        # as applied across newlines, (re.MULTILINE), on re.UNICODE,
        # and .* includes newlines (re.DOTALL)
        pattern = re.compile(
            r"<\s*" + re.escape(tag) + r"( .+?>|>).+?</\s*" + re.escape(tag) + ">",
            re.MULTILINE | re.DOTALL | re.UNICODE,
        )

        processed_text = re.sub(pattern, " ", orig_text)

    elif action == "replace_element":
        pattern = re.compile(
            r"<\s*" + re.escape(tag) + r".*?>.+?</\s*" + re.escape(tag) + ".*?>",
            re.MULTILINE | re.DOTALL | re.UNICODE,
        )

        processed_text = re.sub(pattern, attribute, orig_text)

    else:
        processed_text = orig_text  # Leave Tag Alone

    return processed_text

lexos.scrubber.replace.punctuation(text, *, exclude=None, only=None) ¤

Replace punctuation from text.

Replaces all instances of punctuation (or a subset thereof specified by only) with whitespace.

Parameters:

Name Type Description Default
text str

The text in which punctuation will be replaced.

required
exclude Optional[str | Collection[str]]

Remove all punctuation except designated characters.

None
only Optional[str | Collection[str]]

Remove only those punctuation marks specified here. For example, "." removes only periods, while [",", ";", ":"] removes commas, semicolons, and colons; if None, all unicode punctuation marks are removed.

None

Returns:

Type Description
str

str

Note

When only=None, Python's built-in str.translate() is used; otherwise, a regular expression is used. The former's performance can be up to an order of magnitude faster.

Source code in lexos\scrubber\replace.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
def punctuation(
    text: str,
    *,
    exclude: Optional[str | Collection[str]] = None,
    only: Optional[str | Collection[str]] = None,
) -> str:
    """Replace punctuation from `text`.

    Replaces all instances of punctuation (or a subset thereof specified by `only`)
    with whitespace.

    Args:
        text (str): The text in which punctuation will be replaced.
        exclude: Remove all punctuation except designated characters.
        only: Remove only those punctuation marks specified here. For example,
            `"."` removes only periods, while `[",", ";", ":"]` removes commas,
            semicolons, and colons; if None, all unicode punctuation marks are removed.

    Returns:
        str

    Note:
        When `only=None`, Python's built-in `str.translate()` is used;
        otherwise, a regular expression is used. The former's performance
        can be up to an order of magnitude faster.
    """
    if only is not None:
        only = utils.to_collection(only, val_type=str, col_type=set)
        return re.sub("[{}]+".format(re.escape("".join(only))), " ", text)
    else:
        if exclude:
            exclude = utils.ensure_list(exclude)
            translation_table = dict.fromkeys(
                (
                    i
                    for i in range(sys.maxunicode)
                    if unicodedata.category(chr(i)).startswith("P")
                    and chr(i) not in exclude
                ),
                " ",
            )
        else:
            translation_table = resources.PUNCT_TRANSLATION_TABLE
        return text.translate(translation_table)

lexos.scrubber.replace.special_characters(text, *, is_html=False, ruleset=None) ¤

Replace strings from text using a regex pattern.

Parameters:

Name Type Description Default
text str

The text in which special characters will be replaced.

required
is_html bool

Whether to replace HTML entities.

False
ruleset dict

A dict containing the special characters to match and their replacements.

None

Returns:

Type Description
str

str

Source code in lexos\scrubber\replace.py
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def special_characters(
    text: str, *, is_html: bool = False, ruleset: dict = None,
) -> str:
    """Replace strings from `text` using a regex pattern.

    Args:
        text (str): The text in which special characters will be replaced.
        is_html (bool): Whether to replace HTML entities.
        ruleset (dict): A dict containing the special characters to match and their replacements.

    Returns:
        str
    """
    if is_html:
        text = html.unescape(text)
    else:
        for k, v in ruleset.items():
            match = re.compile(k)
            text = re.sub(match, v, text)
    return text

lexos.scrubber.replace.tag_map(text, map, remove_comments=True, remove_doctype=True, remove_whitespace=False) ¤

Handle tags that are found in the text.

Parameters:

Name Type Description Default
text str

The text in which tags will be replaced.

required
remove_comments bool

Whether to remove comments.

True
remove_doctype bool

Whether to remove the doctype or xml declaration.

True
remove_whitespace bool

Whether to remove whitespace.

False

Returns:

Name Type Description
str str

The text after tags have been replaced.

Source code in lexos\scrubber\replace.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
def tag_map(
    text: str,
    # xmlhandlingoptions: List[dict],
    map: Dict[str],
    remove_comments: bool = True,
    remove_doctype: bool = True,
    remove_whitespace: bool = False,
) -> str:
    """Handle tags that are found in the text.

    Args:
        text (str): The text in which tags will be replaced.
        remove_comments (bool): Whether to remove comments.
        remove_doctype (bool): Whether to remove the doctype or xml declaration.
        remove_whitespace (bool): Whether to remove whitespace.

    Returns:
        str: The text after tags have been replaced.
    """
    if remove_whitespace:
        text = re.sub(
            r"[\n\s\t\v ]+", " ", text, re.UNICODE
        )  # Remove extra white space
    if remove_doctype:
        doctype = re.compile(r"<!DOCTYPE.*?>", re.DOTALL)
        text = re.sub(doctype, "", text)  # Remove DOCTYPE declarations
        text = re.sub(r"(<\?.*?>)", "", text)  # Remove xml declarations
    if remove_comments:
        text = re.sub(r"(<!--.*?-->)", "", text)  # Remove comments

    # This matches the DOCTYPE and all internal entity declarations
    doctype = re.compile(r"<!DOCTYPE.*?>", re.DOTALL)
    text = re.sub(doctype, "", text)  # Remove DOCTYPE declarations

    # Visit each tag:
    for tag, opts in map.items():
        action = opts["action"]
        attribute = opts["attribute"]
        text = process_tag_replace_options(text, tag, action, attribute)

    # One last catch-all removes extra whitespace from all the removed tags
    if remove_whitespace:
        text = re.sub(r"[\n\s\t\v ]+", " ", text, re.UNICODE)

    return text

lexos.scrubber.replace.urls(text, repl='_URL_') ¤

Replace all URLs in text with repl.

Parameters:

Name Type Description Default
text str

The text in which urls will be replaced.

required
repl str

The replacement value for urls.

'_URL_'

Returns:

Name Type Description
str str

The text with urls replaced.

Source code in lexos\scrubber\replace.py
293
294
295
296
297
298
299
300
301
302
303
def urls(text: str, repl: str = "_URL_") -> str:
    """Replace all URLs in `text` with `repl`.

    Args:
        text (str): The text in which urls will be replaced.
        repl (str): The replacement value for urls.

    Returns:
        str: The text with urls replaced.
    """
    return resources.RE_SHORT_URL.sub(repl, resources.RE_URL.sub(repl, text))

lexos.scrubber.replace.user_handles(text, repl='_USER_') ¤

Replace all (Twitter-style) user handles in text with repl.

Parameters:

Name Type Description Default
text str

The text in which user handles will be replaced.

required
repl str

The replacement value for user handles.

'_USER_'

Returns:

Name Type Description
str str

The text with user handles replaced.

Source code in lexos\scrubber\replace.py
306
307
308
309
310
311
312
313
314
315
316
def user_handles(text: str, repl: str = "_USER_") -> str:
    """Replace all (Twitter-style) user handles in `text` with `repl`.

    Args:
        text (str): The text in which user handles will be replaced.
        repl (str): The replacement value for user handles.

    Returns:
        str: The text with user handles replaced.
    """
    return resources.RE_USER_HANDLE.sub(repl, text)