Skip to content

Normalize¤

The normalize component of Scrubber contains functions to perform a variety of text manipulations. The functions are frequently applied at the beginning of a scrubbing pipeline.

lexos.scrubber.normalize.bullet_points(text) ¤

Normalize bullet points.

Normalises all "fancy" bullet point symbols in text to just the basic ASCII "-", provided they are the first non-whitespace characters on a new line (like a list of items). Duplicates Textacy's utils.normalize_bullets.

Parameters:

Name Type Description Default
text str

The text to normalize.

required

Returns:

Type Description
str

The normalized text.

Source code in lexos\scrubber\normalize.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def bullet_points(text: str) -> str:
    """Normalize bullet points.

    Normalises all "fancy" bullet point symbols in `text` to just the basic
    ASCII "-", provided they are the first non-whitespace characters on a new
    line (like a list of items). Duplicates Textacy's `utils.normalize_bullets`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return resources.RE_BULLET_POINTS.sub(r"\1-", text)

lexos.scrubber.normalize.hyphenated_words(text) ¤

Normalize hyphenated words.

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace. Duplicates Textacy's utils.normalize_hyphens.

Parameters:

Name Type Description Default
text str

The text to normalize.

required

Returns:

Type Description
str

The normalized text.

Source code in lexos\scrubber\normalize.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def hyphenated_words(text: str) -> str:
    """Normalize hyphenated words.

    Normalize words in `text` that have been split across lines by a hyphen
    for visual consistency (aka hyphenated) by joining the pieces back together,
    sans hyphen and whitespace. Duplicates Textacy's `utils.normalize_hyphens`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return resources.RE_HYPHENATED_WORD.sub(r"\1\2", text)

lexos.scrubber.normalize.lower_case(text) ¤

Convert text to lower case.

Parameters:

Name Type Description Default
text str

The text to convert to lower case.

required

Returns:

Type Description
str

The converted text.

Source code in lexos\scrubber\normalize.py
41
42
43
44
45
46
47
48
49
50
def lower_case(text: str) -> str:
    """Convert `text` to lower case.

    Args:
        text (str): The text to convert to lower case.

    Returns:
        The converted text.
    """
    return text.lower()

lexos.scrubber.normalize.quotation_marks(text) ¤

Normalize quotation marks.

Normalize all "fancy" single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks. Duplicates Textacy's utils.normalize_quotation_marks.

Parameters:

Name Type Description Default
text str

The text to normalize.

required

Returns:

Type Description
str

The normalized text.

Source code in lexos\scrubber\normalize.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def quotation_marks(text: str) -> str:
    """Normalize quotation marks.

    Normalize all "fancy" single- and double-quotation marks in `text`
    to just the basic ASCII equivalents. Note that this will also normalize fancy
    apostrophes, which are typically represented as single quotation marks.
    Duplicates Textacy's `utils.normalize_quotation_marks`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return text.translate(resources.QUOTE_TRANSLATION_TABLE)

lexos.scrubber.normalize.repeating_chars(text, *, chars, maxn=1) ¤

Normalize repeating characters in text.

Truncating their number of consecutive repetitions to maxn. Duplicates Textacy's utils.normalize_repeating_chars.

Parameters:

Name Type Description Default
text str

The text to normalize.

required
chars str

One or more characters whose consecutive repetitions are to be normalized, e.g. "." or "?!".

required
maxn int

Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.

1

Returns:

Type Description
str

str

Source code in lexos\scrubber\normalize.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def repeating_chars(text: str, *, chars: str, maxn: int = 1) -> str:
    """Normalize repeating characters in `text`.

    Truncating their number of consecutive repetitions to `maxn`.
    Duplicates Textacy's `utils.normalize_repeating_chars`.

    Args:
        text (str): The text to normalize.
        chars: One or more characters whose consecutive repetitions are to be
            normalized, e.g. "." or "?!".
        maxn: Maximum number of consecutive repetitions of `chars` to which
            longer repetitions will be truncated.

    Returns:
        str
    """
    return re.sub(r"({}){{{},}}".format(re.escape(chars), maxn + 1), chars * maxn, text)

lexos.scrubber.normalize.unicode(text, *, form='NFC') ¤

Normalize unicode characters in text into canonical forms.

Duplicates Textacy's utils.normalize_unicode.

Parameters:

Name Type Description Default
text str

The text to normalize.

required
form Literal['NFC', 'NFD', 'NFKC', 'NFKD']

Form of normalization applied to unicode characters. For example, an "e" with accute accent "´" can be written as "e´" (canonical decomposition, "NFD") or "é" (canonical composition, "NFC"). Unicode can be normalized to NFC form without any change in meaning, so it's usually a safe bet. If "NFKC", additional normalizations are applied that can change characters' meanings, e.g. ellipsis characters are replaced with three periods.

'NFC'
See Also
Source code in lexos\scrubber\normalize.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def unicode(text: str, *, form: Literal["NFC", "NFD", "NFKC", "NFKD"] = "NFC") -> str:
    """Normalize unicode characters in `text` into canonical forms.

    Duplicates Textacy's `utils.normalize_unicode`.

    Args:
        text (str): The text to normalize.
        form: Form of normalization applied to unicode characters.
            For example, an "e" with accute accent "´" can be written as "e´"
            (canonical decomposition, "NFD") or "é" (canonical composition, "NFC").
            Unicode can be normalized to NFC form without any change in meaning,
            so it's usually a safe bet. If "NFKC", additional normalizations are applied
            that can change characters' meanings, e.g. ellipsis characters are replaced
            with three periods.

    See Also:
        https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize
    """
    return unicodedata.normalize(form, text)

lexos.scrubber.normalize.whitespace(text) ¤

Normalize whitespace.

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

Parameters:

Name Type Description Default
text str

The text to normalize.

required

Returns:

Type Description
str

The normalized text.

Source code in lexos\scrubber\normalize.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def whitespace(text: str) -> str:
    """Normalize whitespace.

    Replace all contiguous zero-width spaces with an empty string,
    line-breaking spaces with a single newline, and non-breaking spaces
    with a single space, then strip any leading/trailing whitespace.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    text = resources.RE_ZWSP.sub("", text)
    text = resources.RE_LINEBREAK.sub(r"\n", text)
    text = resources.RE_NONBREAKING_SPACE.sub(" ", text)
    return text.strip()