Normalize¤
The collection of "Normalize" functions take notations which are not standardized (such as - or * or ~ for a bullet point) and replaces them all with the same, normalized notation.
bullet_points(text: str) -> str
¤
Normalize bullet points.
Normalises all "fancy" bullet point symbols in text to just the basic
ASCII "-", provided they are the first non-whitespace characters on a new
line (like a list of items). Duplicates Textacy's utils.normalize_bullets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The normalized text. |
Source code in lexos/scrubber/normalize.py
hyphenated_words(text: str) -> str
¤
Normalize hyphenated words.
Normalize words in text that have been split across lines by a hyphen
for visual consistency (aka hyphenated) by joining the pieces back together,
sans hyphen and whitespace. Duplicates Textacy's utils.normalize_hyphens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The normalized text. |
Source code in lexos/scrubber/normalize.py
lower_case(text: str) -> str
¤
Convert text to lower case.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to convert to lower case. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The converted text. |
quotation_marks(text: str) -> str
¤
Normalize quotation marks.
Normalize all "fancy" single- and double-quotation marks in text
to just the basic ASCII equivalents. Note that this will also normalize fancy
apostrophes, which are typically represented as single quotation marks.
Duplicates Textacy's utils.normalize_quotation_marks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The normalized text. |
Source code in lexos/scrubber/normalize.py
repeating_chars(text: str, *, chars: Optional[str], maxn: Optional[int] = 1) -> str
¤
Normalize repeating characters in text.
Truncating their number of consecutive repetitions to maxn.
Duplicates Textacy's utils.normalize_repeating_chars.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
chars
|
Optional[str]
|
One or more characters whose consecutive repetitions are to be normalized, e.g. "." or "?!". |
required |
maxn
|
Optional[int]
|
Maximum number of consecutive repetitions of |
1
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
str |
Source code in lexos/scrubber/normalize.py
unicode(text: str, *, form: Optional[Literal['NFC', 'NFD', 'NFKC', 'NFKD']] = 'NFC') -> str
¤
Normalize unicode characters in text into canonical forms.
Duplicates Textacy's utils.normalize_unicode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
form
|
Optional[Literal['NFC', 'NFD', 'NFKC', 'NFKD']]
|
Form of normalization applied to unicode characters. For example, an "e" with accute accent "´" can be written as "e´" (canonical decomposition, "NFD") or "é" (canonical composition, "NFC"). Unicode can be normalized to NFC form without any change in meaning, so it's usually a safe bet. If "NFKC", additional normalizations are applied that can change characters' meanings, e.g. ellipsis characters are replaced with three periods. |
'NFC'
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The normalized text. |
Source code in lexos/scrubber/normalize.py
whitespace(text: str) -> str
¤
Normalize whitespace.
Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to normalize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The normalized text. |