Remove¤
A set of functions for removing strings and patterns from text.
accents(text: str, *, fast: Optional[bool] = False, accents: Optional[str | tuple[str, ...]] = None) -> str
¤
Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which accents will be removed. |
required |
fast
|
Optional[bool]
|
If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless. |
False
|
accents
|
Optional[str | tuple[str, ...]]
|
An optional string or tuple of strings indicating the names of diacritics to be stripped. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with accents removed. |
fast=True can be significantly faster than fast=False,
but its transformation of text is less "safe" and more likely
to result in changes of meaning, spelling errors, etc.
See Also
- For a chart containing Unicode standard names of diacritics, see https://en.wikipedia.org/wiki/Combining_Diacritical_Marks#Character_table
- For a more powerful (but slower) alternative, check out
unidecode: https://github.com/avian2/unidecode
Source code in lexos/scrubber/remove.py
brackets(text: str, *, only: Optional[str | Collection[str]] = ['curly', 'square', 'round']) -> str
¤
Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which brackets will be removed. |
required |
only
|
Optional[str | Collection[str]]
|
Remove only those bracketed contents
as specified here: "curly", "square", and/or "round". For example,
|
['curly', 'square', 'round']
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with brackets removed. |
Note
This function relies on regular expressions, applied sequentially for curly, square, then round brackets; as such, it doesn't handle nested brackets of the same type and may behave unexpectedly on text with "wild" use of brackets. It should be fine removing structured bracketed contents, as is often used, for instance, to denote in-text citations.
Source code in lexos/scrubber/remove.py
digits(text: str, *, only: Optional[str | Collection[str]] = None) -> str
¤
Remove digits.
Remove digits from text by replacing all instances of digits
(or a subset thereof specified by only) with whitespace.
Removes signed/unsigned numbers and decimal/delimiter-separated numbers. Does not remove currency symbols. Some tokens containing digits will be modified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which digits will be removed. |
required |
only
|
Optional[str | Collection[str]]
|
Remove only those digits specified here. For example,
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with digits removed. |
Source code in lexos/scrubber/remove.py
new_lines(text: str) -> str
¤
Remove new lines.
Remove all line-breaking spaces.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which new lines will be removed. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with line-breaking spaces removed. |
Source code in lexos/scrubber/remove.py
pattern(text: str, *, pattern: Optional[str | Collection[str]]) -> str
¤
Remove strings from text using a regex pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which patterns will be removed. |
required |
pattern
|
Optional[str | Collection[str]]
|
The pattern to match. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with the pattern removed. |
Source code in lexos/scrubber/remove.py
project_gutenberg_headers(text: str) -> str
¤
Remove Project Gutenberg headers and footers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which headers and footers will be removed. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with Project Gutenberg boilerplate removed. |
Notes
This function is reproduced from Gutenberg package's strip_headers()
function (https://github.com/c-w/gutenberg), itself a port of the C++ utility
by Johannes Krugel.
Source code in lexos/scrubber/remove.py
punctuation(text: str, *, exclude: Optional[str | Collection[str]] = None, only: Optional[str | Collection[str]] = None) -> str
¤
Remove punctuation from text.
Removes all instances of punctuation (or a subset thereof specified by only).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which punctuation will be removed. |
required |
exclude
|
Optional[str | Collection[str]]
|
Remove all punctuation except designated characters. |
None
|
only
|
Optional[str | Collection[str]]
|
Remove only those punctuation marks specified here.
For example, |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with punctuation removed. |
Note
When only=None, Python's built-in str.translate() is used;
otherwise, a regular expression is used. The former's performance
can be up to an order of magnitude faster.
Source code in lexos/scrubber/remove.py
tabs(text: str) -> str
¤
Remove tabs.
If you want to replace tabs with a single space, use
normalize.whitespace() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which tabs will be removed. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The text with tabs removed. |
Source code in lexos/scrubber/remove.py
tags(text: str, sep: Optional[str] = ' ', remove_whitespace: Optional[bool] = True) -> str
¤
Remove tags from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text from which tags will be removed. |
required |
sep
|
Optional[str]
|
A string to insert between tags and text found between them. |
' '
|
remove_whitespace
|
Optional[bool]
|
If True, remove extra whitespace between text after tags are removed. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A string containing just the text found between tags and other non-data elements. |
Note
- If you want to perfom selective removal of tags,
use
replace.tag_mapinstead. - This function relies on the stdlib
html.parser.HTMLParser. It appears to work for stripping tags from both html and xml. Usinglxmlor BeautifulSoup might be faster, but this is untested. - This function preserves text in comments, as well as tags