Replace¤
The replace
component of Scrubber
contains a set of functions for replacing strings and patterns in text.
Important
Some functions have the same names as functions in the remove
component. To distinguish them in the registry, replace
functions with the same names are prefixed with re_
. When loaded into a script, they can be given any name the user desires.
lexos.scrubber.replace.currency_symbols(text, repl='_CUR_')
¤
Replace all currency symbols in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which currency symbols will be replaced. |
required |
repl |
str
|
The replacement value for currency symbols. |
'_CUR_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with currency symbols replaced. |
Source code in lexos\scrubber\replace.py
15 16 17 18 19 20 21 22 23 24 25 |
|
lexos.scrubber.replace.digits(text, repl='_DIGIT_')
¤
Replace all digits in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which digits will be replaced. |
required |
repl |
str
|
The replacement value for digits. |
'_DIGIT_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with digits replaced. |
Source code in lexos\scrubber\replace.py
28 29 30 31 32 33 34 35 36 37 38 |
|
lexos.scrubber.replace.emails(text, repl='_EMAIL_')
¤
Replace all email addresses in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which emails will be replaced. |
required |
repl |
str
|
The replacement value for emails. |
'_EMAIL_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with emails replaced. |
Source code in lexos\scrubber\replace.py
41 42 43 44 45 46 47 48 49 50 51 |
|
lexos.scrubber.replace.emojis(text, repl='_EMOJI_')
¤
Replace all emoji and pictographs in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which emojis will be replaced. |
required |
repl |
str
|
The replacement value for emojis. |
'_EMOJI_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with emojis replaced. |
Note
If your Python has a narrow unicode build ("USC-2"), only dingbats and miscellaneous symbols are replaced because Python isn't able to represent the unicode data for things like emoticons. Sorry!
Source code in lexos\scrubber\replace.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
lexos.scrubber.replace.hashtags(text, repl='_HASHTAG_')
¤
Replace all hashtags in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which hashtags will be replaced. |
required |
repl |
str
|
The replacement value for hashtags. |
'_HASHTAG_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with currency hashtags replaced. |
Source code in lexos\scrubber\replace.py
73 74 75 76 77 78 79 80 81 82 83 |
|
lexos.scrubber.replace.pattern(text, *, pattern)
¤
Replace strings from text
using a regex pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which a pattern or pattern will be replaced. |
required |
pattern |
Union[dict, Collection[dict]]
|
(Union[dict, Collection[dict]]): A dictionary or list of dictionaries containing the pattern(s) and replacement(s). |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with pattern(s) replaced. |
Source code in lexos\scrubber\replace.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
lexos.scrubber.replace.phone_numbers(text, repl='_PHONE_')
¤
Replace all phone numbers in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which phone numbers will be replaced. |
required |
repl |
str
|
The replacement value for phone numbers. |
'_PHONE_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with phone numbers replaced. |
Source code in lexos\scrubber\replace.py
105 106 107 108 109 110 111 112 113 114 115 |
|
lexos.scrubber.replace.process_tag_replace_options(orig_text, tag, action, attribute)
¤
Replace html-style tags in text files according to user options.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
orig_text |
str
|
The user's text containing the original tag. |
required |
tag |
str
|
The particular tag to be processed. |
required |
action |
str
|
A string specifying the action to be performed on the tag. |
required |
attribute |
str
|
Replacement value for tag when "replace_with_attribute" is specified. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text after the specified tag is processed. |
Notes
- Action options are:
- "remove_tag": Remove the tag
- "remove_element": Remove the element and contents
- "replace_element": Replace the tag with the specified attribute
- The replacement of a tag with the value of an attribute may not be supported. This needs a second look.
Source code in lexos\scrubber\replace.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
lexos.scrubber.replace.punctuation(text, *, exclude=None, only=None)
¤
Replace punctuation from text
.
Replaces all instances of punctuation (or a subset thereof specified by only
)
with whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which punctuation will be replaced. |
required |
exclude |
Optional[str | Collection[str]]
|
Remove all punctuation except designated characters. |
None
|
only |
Optional[str | Collection[str]]
|
Remove only those punctuation marks specified here. For example,
|
None
|
Returns:
Type | Description |
---|---|
str
|
str |
Note
When only=None
, Python's built-in str.translate()
is used;
otherwise, a regular expression is used. The former's performance
can be up to an order of magnitude faster.
Source code in lexos\scrubber\replace.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
|
lexos.scrubber.replace.special_characters(text, *, is_html=False, ruleset=None)
¤
Replace strings from text
using a regex pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which special characters will be replaced. |
required |
is_html |
bool
|
Whether to replace HTML entities. |
False
|
ruleset |
dict
|
A dict containing the special characters to match and their replacements. |
None
|
Returns:
Type | Description |
---|---|
str
|
str |
Source code in lexos\scrubber\replace.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
|
lexos.scrubber.replace.tag_map(text, map, remove_comments=True, remove_doctype=True, remove_whitespace=False)
¤
Handle tags that are found in the text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which tags will be replaced. |
required |
remove_comments |
bool
|
Whether to remove comments. |
True
|
remove_doctype |
bool
|
Whether to remove the doctype or xml declaration. |
True
|
remove_whitespace |
bool
|
Whether to remove whitespace. |
False
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text after tags have been replaced. |
Source code in lexos\scrubber\replace.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
|
lexos.scrubber.replace.urls(text, repl='_URL_')
¤
Replace all URLs in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which urls will be replaced. |
required |
repl |
str
|
The replacement value for urls. |
'_URL_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with urls replaced. |
Source code in lexos\scrubber\replace.py
293 294 295 296 297 298 299 300 301 302 303 |
|
lexos.scrubber.replace.user_handles(text, repl='_USER_')
¤
Replace all (Twitter-style) user handles in text
with repl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text in which user handles will be replaced. |
required |
repl |
str
|
The replacement value for user handles. |
'_USER_'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with user handles replaced. |
Source code in lexos\scrubber\replace.py
306 307 308 309 310 311 312 313 314 315 316 |
|