Normalize¤

The normalize component of Scrubber contains functions to perform a variety of text manipulations. The functions are frequently applied at the beginning of a scrubbing pipeline.

`lexos.scrubber.normalize.bullet_points(text)` ¤

Normalize bullet points.

Normalises all "fancy" bullet point symbols in text to just the basic ASCII "-", provided they are the first non-whitespace characters on a new line (like a list of items). Duplicates Textacy's utils.normalize_bullets.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required

Returns:

Type	Description
`str`	The normalized text.

Source code in lexos\scrubber\normalize.py

def bullet_points(text: str) -> str:
    """Normalize bullet points.

    Normalises all "fancy" bullet point symbols in `text` to just the basic
    ASCII "-", provided they are the first non-whitespace characters on a new
    line (like a list of items). Duplicates Textacy's `utils.normalize_bullets`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return resources.RE_BULLET_POINTS.sub(r"\1-", text)

`lexos.scrubber.normalize.hyphenated_words(text)` ¤

Normalize hyphenated words.

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace. Duplicates Textacy's utils.normalize_hyphens.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required

Returns:

Type	Description
`str`	The normalized text.

Source code in lexos\scrubber\normalize.py

def hyphenated_words(text: str) -> str:
    """Normalize hyphenated words.

    Normalize words in `text` that have been split across lines by a hyphen
    for visual consistency (aka hyphenated) by joining the pieces back together,
    sans hyphen and whitespace. Duplicates Textacy's `utils.normalize_hyphens`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return resources.RE_HYPHENATED_WORD.sub(r"\1\2", text)

`lexos.scrubber.normalize.lower_case(text)` ¤

Convert text to lower case.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to convert to lower case.	required

Returns:

Type	Description
`str`	The converted text.

Source code in lexos\scrubber\normalize.py

def lower_case(text: str) -> str:
    """Convert `text` to lower case.

    Args:
        text (str): The text to convert to lower case.

    Returns:
        The converted text.
    """
    return text.lower()

`lexos.scrubber.normalize.quotation_marks(text)` ¤

Normalize quotation marks.

Normalize all "fancy" single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks. Duplicates Textacy's utils.normalize_quotation_marks.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required

Returns:

Type	Description
`str`	The normalized text.

Source code in lexos\scrubber\normalize.py

def quotation_marks(text: str) -> str:
    """Normalize quotation marks.

    Normalize all "fancy" single- and double-quotation marks in `text`
    to just the basic ASCII equivalents. Note that this will also normalize fancy
    apostrophes, which are typically represented as single quotation marks.
    Duplicates Textacy's `utils.normalize_quotation_marks`.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    return text.translate(resources.QUOTE_TRANSLATION_TABLE)

`lexos.scrubber.normalize.repeating_chars(text, *, chars, maxn=1)` ¤

Normalize repeating characters in text.

Truncating their number of consecutive repetitions to maxn. Duplicates Textacy's utils.normalize_repeating_chars.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required
`chars`	`str`	One or more characters whose consecutive repetitions are to be normalized, e.g. "." or "?!".	required
`maxn`	`int`	Maximum number of consecutive repetitions of `chars` to which longer repetitions will be truncated.	`1`

Returns:

Type	Description
`str`	str

Source code in lexos\scrubber\normalize.py

def repeating_chars(text: str, *, chars: str, maxn: int = 1) -> str:
    """Normalize repeating characters in `text`.

    Truncating their number of consecutive repetitions to `maxn`.
    Duplicates Textacy's `utils.normalize_repeating_chars`.

    Args:
        text (str): The text to normalize.
        chars: One or more characters whose consecutive repetitions are to be
            normalized, e.g. "." or "?!".
        maxn: Maximum number of consecutive repetitions of `chars` to which
            longer repetitions will be truncated.

    Returns:
        str
    """
    return re.sub(r"({}){{{},}}".format(re.escape(chars), maxn + 1), chars * maxn, text)

`lexos.scrubber.normalize.unicode(text, *, form='NFC')` ¤

Normalize unicode characters in text into canonical forms.

Duplicates Textacy's utils.normalize_unicode.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required
`form`	`Literal['NFC', 'NFD', 'NFKC', 'NFKD']`	Form of normalization applied to unicode characters. For example, an "e" with accute accent "´" can be written as "e´" (canonical decomposition, "NFD") or "é" (canonical composition, "NFC"). Unicode can be normalized to NFC form without any change in meaning, so it's usually a safe bet. If "NFKC", additional normalizations are applied that can change characters' meanings, e.g. ellipsis characters are replaced with three periods.	`'NFC'`

`lexos.scrubber.normalize.whitespace(text)` ¤

Normalize whitespace.

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to normalize.	required

Returns:

Type	Description
`str`	The normalized text.

Source code in lexos\scrubber\normalize.py

def whitespace(text: str) -> str:
    """Normalize whitespace.

    Replace all contiguous zero-width spaces with an empty string,
    line-breaking spaces with a single newline, and non-breaking spaces
    with a single space, then strip any leading/trailing whitespace.

    Args:
        text (str): The text to normalize.

    Returns:
        The normalized text.
    """
    text = resources.RE_ZWSP.sub("", text)
    text = resources.RE_LINEBREAK.sub(r"\n", text)
    text = resources.RE_NONBREAKING_SPACE.sub(" ", text)
    return text.strip()

Normalize¤

lexos.scrubber.normalize.bullet_points(text) ¤

lexos.scrubber.normalize.hyphenated_words(text) ¤

lexos.scrubber.normalize.lower_case(text) ¤

lexos.scrubber.normalize.quotation_marks(text) ¤

lexos.scrubber.normalize.repeating_chars(text, *, chars, maxn=1) ¤

lexos.scrubber.normalize.unicode(text, *, form='NFC') ¤

lexos.scrubber.normalize.whitespace(text) ¤

`lexos.scrubber.normalize.bullet_points(text)` ¤

`lexos.scrubber.normalize.hyphenated_words(text)` ¤

`lexos.scrubber.normalize.lower_case(text)` ¤

`lexos.scrubber.normalize.quotation_marks(text)` ¤

`lexos.scrubber.normalize.repeating_chars(text, *, chars, maxn=1)` ¤

`lexos.scrubber.normalize.unicode(text, *, form='NFC')` ¤

`lexos.scrubber.normalize.whitespace(text)` ¤