Skip to content

Cutter Registry¤

The Ginsu class allows the user to cut documents pre-tokenized with spaCy. Documents can be split into a pre-determined number of segments, based on the number of tokens, or based on tokens defined as milestones.

lexos.cutter.registry.character_tokenizer(text) ¤

Tokenize by single characters, keeping whitespace.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Name Type Description
list list

A list of character tokens.

Source code in lexos\cutter\registry.py
24
25
26
27
28
29
30
31
32
33
def character_tokenizer(text: str) -> list:
    """Tokenize by single characters, keeping whitespace.

    Args:
        text: The text to tokenize.

    Returns:
        list: A list of character tokens.
    """
    return [char for char in text]

lexos.cutter.registry.linebreak_tokenizer(text) ¤

Tokenize by linebreaks, keeping whitespace.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Name Type Description
list list

A list of line tokens.

Source code in lexos\cutter\registry.py
37
38
39
40
41
42
43
44
45
46
def linebreak_tokenizer(text: str) -> list:
    """Tokenize by linebreaks, keeping whitespace.

    Args:
        text: The text to tokenize.

    Returns:
        list: A list of line tokens.
    """
    return text.splitlines(keepends=True)

lexos.cutter.registry.whitespace_tokenizer(text) ¤

Tokenize on whitespace, keeping whitespace.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Name Type Description
list list

A list of pseudo-word tokens.

Source code in lexos\cutter\registry.py
11
12
13
14
15
16
17
18
19
20
def whitespace_tokenizer(text: str) -> list:
    """Tokenize on whitespace, keeping whitespace.

    Args:
        text: The text to tokenize.

    Returns:
        list: A list of pseudo-word tokens.
    """
    return re.findall(r"\S+\s*", text)

lexos.cutter.registry.load(s) ¤

Load a single tokenizer from a string.

Parameters:

Name Type Description Default
s str

The name of the function.

required

Returns:

Name Type Description
list Callable

A tokenizer function.

Source code in lexos\cutter\registry.py
55
56
57
58
59
60
61
62
63
64
def load(s: str) -> Callable:
    """Load a single tokenizer from a string.

    Args:
        s: The name of the function.

    Returns:
        list: A tokenizer function.
    """
    return tokenizers.get(s)