Normalize¤
The normalize
component of Scrubber
contains functions to perform a variety of text manipulations. The functions are frequently applied at the beginning of a scrubbing pipeline.
lexos.scrubber.normalize.bullet_points(text)
¤
Normalize bullet points.
Normalises all "fancy" bullet point symbols in text
to just the basic
ASCII "-", provided they are the first non-whitespace characters on a new
line (like a list of items). Duplicates Textacy's utils.normalize_bullets
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
Returns:
Type | Description |
---|---|
str
|
The normalized text. |
Source code in lexos\scrubber\normalize.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
lexos.scrubber.normalize.hyphenated_words(text)
¤
Normalize hyphenated words.
Normalize words in text
that have been split across lines by a hyphen
for visual consistency (aka hyphenated) by joining the pieces back together,
sans hyphen and whitespace. Duplicates Textacy's utils.normalize_hyphens
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
Returns:
Type | Description |
---|---|
str
|
The normalized text. |
Source code in lexos\scrubber\normalize.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
lexos.scrubber.normalize.lower_case(text)
¤
Convert text
to lower case.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to convert to lower case. |
required |
Returns:
Type | Description |
---|---|
str
|
The converted text. |
Source code in lexos\scrubber\normalize.py
41 42 43 44 45 46 47 48 49 50 |
|
lexos.scrubber.normalize.quotation_marks(text)
¤
Normalize quotation marks.
Normalize all "fancy" single- and double-quotation marks in text
to just the basic ASCII equivalents. Note that this will also normalize fancy
apostrophes, which are typically represented as single quotation marks.
Duplicates Textacy's utils.normalize_quotation_marks
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
Returns:
Type | Description |
---|---|
str
|
The normalized text. |
Source code in lexos\scrubber\normalize.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
lexos.scrubber.normalize.repeating_chars(text, *, chars, maxn=1)
¤
Normalize repeating characters in text
.
Truncating their number of consecutive repetitions to maxn
.
Duplicates Textacy's utils.normalize_repeating_chars
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
chars |
str
|
One or more characters whose consecutive repetitions are to be normalized, e.g. "." or "?!". |
required |
maxn |
int
|
Maximum number of consecutive repetitions of |
1
|
Returns:
Type | Description |
---|---|
str
|
str |
Source code in lexos\scrubber\normalize.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
lexos.scrubber.normalize.unicode(text, *, form='NFC')
¤
Normalize unicode characters in text
into canonical forms.
Duplicates Textacy's utils.normalize_unicode
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
form |
Literal['NFC', 'NFD', 'NFKC', 'NFKD']
|
Form of normalization applied to unicode characters. For example, an "e" with accute accent "´" can be written as "e´" (canonical decomposition, "NFD") or "é" (canonical composition, "NFC"). Unicode can be normalized to NFC form without any change in meaning, so it's usually a safe bet. If "NFKC", additional normalizations are applied that can change characters' meanings, e.g. ellipsis characters are replaced with three periods. |
'NFC'
|
See Also
Source code in lexos\scrubber\normalize.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
lexos.scrubber.normalize.whitespace(text)
¤
Normalize whitespace.
Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to normalize. |
required |
Returns:
Type | Description |
---|---|
str
|
The normalized text. |
Source code in lexos\scrubber\normalize.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|