Whitespace Counter Tokenizer¤
This class inherits from the main Tokenizer class and extends it by counting runs of spaces and line breaks.
WhitespaceCounter
pydantic-model
¤
Bases: Tokenizer
Whitespace tokenizer that captures line breaks and counts runs of spaces.
Fields:
-
model(Optional[str]) -
max_length(Optional[int]) -
disable(Optional[list[str]]) -
stopwords(Optional[list[str] | str]) -
nlp(Optional[Language])
Source code in lexos/tokenizer/whitespace_counter.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
components: list[str]
property
¤
Return the spaCy pipeline components.
disable: Optional[list[str]] = []
pydantic-field
¤
A list of spaCy pipeline components to disable.
disabled: list[str]
property
¤
Return the disabled spaCy pipeline components.
max_length: Optional[int] = 2000000
pydantic-field
¤
The maximum length of the doc.
model: Optional[str] = 'xx_sent_ud_sm'
pydantic-field
¤
The name of the spaCy model to be used for tokenization.
pipeline: list[str]
property
¤
Return the spaCy pipeline components.
stopwords: Optional[list[str] | str] = []
pydantic-field
¤
A list of stop words to apply to docs.
__call__(texts: str | Iterable[str]) -> Doc | Iterable[Doc]
¤
Tokenize a string or an iterable of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
str | Iterable[str]
|
The text(s) to be tokenized. |
required |
Returns:
| Type | Description |
|---|---|
Doc | Iterable[Doc]
|
Doc | Iterable[Doc]: The tokenized doc(s). |
Source code in lexos/tokenizer/__init__.py
__init__(**data) -> None
¤
Initialise the Tokenizer class.
Source code in lexos/tokenizer/__init__.py
add_extension(name: str, default: str) -> None
¤
Add an extension to the spaCy Token class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the extension. |
required |
default
|
str
|
The default value of the extension. |
required |
Source code in lexos/tokenizer/__init__.py
add_stopwords(stopwords: str | list[str]) -> None
¤
Add stopwords to the tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stopwords
|
str | Iterable[str]
|
A list of stopwords to add to the model. |
required |
Source code in lexos/tokenizer/__init__.py
make_doc(text: str, max_length: int = None, disable: list[str] = []) -> Doc
¤
Return a doc from a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be parsed. |
required |
max_length
|
int
|
The maximum length of the doc. |
None
|
disable
|
list[str]
|
A list of spaCy pipeline components to disable. |
[]
|
Returns:
| Name | Type | Description |
|---|---|---|
Doc |
Doc
|
A spaCy doc object. |
Source code in lexos/tokenizer/whitespace_counter.py
make_docs(texts: Iterable[str], max_length: int = None, disable: Iterable[str] = [], chunk_size: int = 1000) -> Iterable[Doc]
¤
Return a generator of docs from an iterable of texts, processing in chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Iterable[str]
|
The texts to process. |
required |
max_length
|
int
|
Maximum doc length. |
None
|
disable
|
Iterable[str]
|
Pipeline components to disable. |
[]
|
chunk_size
|
int
|
Number of docs to process per chunk. |
1000
|
Yields:
| Name | Type | Description |
|---|---|---|
Doc |
Iterable[Doc]
|
spaCy Doc objects. |
Source code in lexos/tokenizer/whitespace_counter.py
remove_extension(name: str) -> None
¤
Remove an extension from the spaCy Token class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the extension. |
required |
remove_stopwords(stopwords: str | list[str]) -> None
¤
Remove stopwords from the tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stopwords
|
str | list[str]
|
A list of stopwords to remove from the model. |
required |
Source code in lexos/tokenizer/__init__.py
model: Optional[str] = 'xx_sent_ud_sm'
pydantic-field
¤
The name of the spaCy model to be used for tokenization.
max_length: Optional[int] = 2000000
pydantic-field
¤
The maximum length of the doc.
disable: Optional[list[str]] = []
pydantic-field
¤
A list of spaCy pipeline components to disable.
stopwords: Optional[list[str] | str] = []
pydantic-field
¤
A list of stop words to apply to docs.
nlp: Optional[Language]
pydantic-field
¤
model_config = ConfigDict(arbitrary_types_allowed=True, json_schema_extra=(DocJSONSchema.schema()), validate_assignment=True)
class-attribute
instance-attribute
¤
__init__(**data) -> None
¤
Initialise the Tokenizer class.
Source code in lexos/tokenizer/__init__.py
__call__(texts: str | Iterable[str]) -> Doc | Iterable[Doc]
¤
Tokenize a string or an iterable of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
str | Iterable[str]
|
The text(s) to be tokenized. |
required |
Returns:
| Type | Description |
|---|---|
Doc | Iterable[Doc]
|
Doc | Iterable[Doc]: The tokenized doc(s). |
Source code in lexos/tokenizer/__init__.py
pipeline: list[str]
property
¤
Return the spaCy pipeline components.
components: list[str]
property
¤
Return the spaCy pipeline components.
disabled: list[str]
property
¤
Return the disabled spaCy pipeline components.
_get_token_widths(text: str) -> tuple[list[str], list[int]]
¤
Get the widths of tokens in a doc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text. |
required |
Returns:
| Type | Description |
|---|---|
tuple[list[str], list[int]]
|
tuple[list[str], list[int]]: A tuple containing the tokens and widths. |
Source code in lexos/tokenizer/whitespace_counter.py
add_extension(name: str, default: str) -> None
¤
Add an extension to the spaCy Token class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the extension. |
required |
default
|
str
|
The default value of the extension. |
required |
Source code in lexos/tokenizer/__init__.py
add_stopwords(stopwords: str | list[str]) -> None
¤
Add stopwords to the tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stopwords
|
str | Iterable[str]
|
A list of stopwords to add to the model. |
required |
Source code in lexos/tokenizer/__init__.py
make_doc(text: str, max_length: int = None, disable: list[str] = []) -> Doc
¤
Return a doc from a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to be parsed. |
required |
max_length
|
int
|
The maximum length of the doc. |
None
|
disable
|
list[str]
|
A list of spaCy pipeline components to disable. |
[]
|
Returns:
| Name | Type | Description |
|---|---|---|
Doc |
Doc
|
A spaCy doc object. |
Source code in lexos/tokenizer/whitespace_counter.py
make_docs(texts: Iterable[str], max_length: int = None, disable: Iterable[str] = [], chunk_size: int = 1000) -> Iterable[Doc]
¤
Return a generator of docs from an iterable of texts, processing in chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Iterable[str]
|
The texts to process. |
required |
max_length
|
int
|
Maximum doc length. |
None
|
disable
|
Iterable[str]
|
Pipeline components to disable. |
[]
|
chunk_size
|
int
|
Number of docs to process per chunk. |
1000
|
Yields:
| Name | Type | Description |
|---|---|---|
Doc |
Iterable[Doc]
|
spaCy Doc objects. |
Source code in lexos/tokenizer/whitespace_counter.py
remove_extension(name: str) -> None
¤
Remove an extension from the spaCy Token class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the extension. |
required |
remove_stopwords(stopwords: str | list[str]) -> None
¤
Remove stopwords from the tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stopwords
|
str | list[str]
|
A list of stopwords to remove from the model. |
required |