Cutter Registry¤
The Ginsu
class allows the user to cut documents pre-tokenized with spaCy. Documents can be split into a pre-determined number of segments, based on the number of tokens, or based on tokens defined as milestones.
lexos.cutter.registry.character_tokenizer(text)
¤
Tokenize by single characters, keeping whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to tokenize. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of character tokens. |
Source code in lexos\cutter\registry.py
24 25 26 27 28 29 30 31 32 33 |
|
lexos.cutter.registry.linebreak_tokenizer(text)
¤
Tokenize by linebreaks, keeping whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to tokenize. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of line tokens. |
Source code in lexos\cutter\registry.py
37 38 39 40 41 42 43 44 45 46 |
|
lexos.cutter.registry.whitespace_tokenizer(text)
¤
Tokenize on whitespace, keeping whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to tokenize. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of pseudo-word tokens. |
Source code in lexos\cutter\registry.py
11 12 13 14 15 16 17 18 19 20 |
|
lexos.cutter.registry.load(s)
¤
Load a single tokenizer from a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s |
str
|
The name of the function. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
Callable
|
A tokenizer function. |
Source code in lexos\cutter\registry.py
55 56 57 58 59 60 61 62 63 64 |
|