Tokenizer¤
The Tokenizer
uses spaCy to convert texts to documents using one of the functions below. If no language model is specified, spaCy's multi-lingual xx_sent_ud_sm
model is used.
lexos.tokenizer._add_remove_stopwords(nlp, add_stopwords, remove_stopwords)
¤
Add and remove stopwords from the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nlp |
spacy.Vocab
|
The model to add stopwords to. |
required |
add_stopwords |
Union[List[str], str]
|
A list of stopwords to add to the model. |
required |
remove_stopwords |
Union[bool, List[str], str]
|
A list of stopwords to remove from the
model, or |
required |
Returns:
Type | Description |
---|---|
spacy.Vocab
|
spacy.Vocab: The model with stopwords added and removed. |
Source code in lexos\tokenizer\__init__.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
lexos.tokenizer._get_disabled_components(disable=None, pipeline_components=None)
¤
Get a list of components to disable in the pipeline.
Source code in lexos\tokenizer\__init__.py
72 73 74 75 76 77 78 79 80 81 82 83 |
|
lexos.tokenizer._get_excluded_components(exclude=None, pipeline_components=None)
¤
Get a list of components to exclude from the pipeline.
Source code in lexos\tokenizer\__init__.py
58 59 60 61 62 63 64 65 66 67 68 69 |
|
lexos.tokenizer._load_model(model, disable=None, exclude=None)
¤
Load a model from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str
|
The path to the model. |
required |
disable |
List[str]
|
A list of spaCy pipeline components to disable. |
None
|
exclude |
List[str]
|
A list of spaCy pipeline components to exclude. |
None
|
Returns:
Name | Type | Description |
---|---|---|
object |
object
|
The loaded model. |
Note
Attempts to disable or exclude components not found in the pipeline are ignored without raising an error.
Source code in lexos\tokenizer\__init__.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
|
lexos.tokenizer._validate_input(input)
¤
Ensure that input is a string, Doc, or bytes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
Any
|
The input to be tested. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Raises:
Type | Description |
---|---|
LexosException(Exception)
|
Raise an error if the input is not valid. |
Source code in lexos\tokenizer\__init__.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
lexos.tokenizer.doc_from_ngrams(ngrams, model='xx_sent_ud_sm', strict=False, disable=[], exclude=[])
¤
Generate spaCy doc from a list of ngrams.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ngrams |
list
|
A list of ngrams. |
required |
model |
object
|
The language model to use for tokenisation. |
'xx_sent_ud_sm'
|
strict |
bool
|
Whether to preserve token divisions, include whitespace in the source. |
False
|
disable |
List[str]
|
A list of spaCy pipeline components to disable. |
[]
|
exclude |
List[str]
|
A list of spaCy pipeline components to exclude. |
[]
|
Returns:
Name | Type | Description |
---|---|---|
object |
object
|
A spaCy doc |
Notes
The strict=False
setting will allow spaCy's language model to remove whitespace from
ngrams and split punctuation into separate tokens. strict=True
will preserve the
sequences in the source list.
Source code in lexos\tokenizer\__init__.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
|
lexos.tokenizer.docs_from_ngrams(ngrams, model='xx_sent_ud_sm', strict=False, disable=[], exclude=[])
¤
Generate spaCy doc from a list of ngram lists.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ngrams |
List[list]
|
A list of ngram lists. |
required |
model |
object
|
The language model to use for tokenisation. |
'xx_sent_ud_sm'
|
strict |
bool
|
Whether to preserve token divisions, include whitespace in the source. |
False
|
disable |
List[str]
|
A list of spaCy pipeline components to disable. |
[]
|
exclude |
List[str]
|
A list of spaCy pipeline components to exclude. |
[]
|
Returns:
Type | Description |
---|---|
List[object]
|
List[object]: A list of spaCy docs |
Source code in lexos\tokenizer\__init__.py
250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
|
lexos.tokenizer.generate_character_ngrams(text, size=1, drop_whitespace=True)
¤
Generate character n-grams from raw text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The source text. |
required |
size |
int
|
The size of the ngram. |
1
|
drop_whitespace |
bool
|
Whether to preserve whitespace in the ngram list. |
True
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: A list of ngrams |
Source code in lexos\tokenizer\__init__.py
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
|
lexos.tokenizer.make_doc(text, model='xx_sent_ud_sm', max_length=2000000, disable=[], exclude=[], add_stopwords=[], remove_stopwords=[], pipeline_components=[])
¤
Return a doc from a text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be parsed. |
required |
model |
object
|
The model to be used. |
'xx_sent_ud_sm'
|
max_length |
int
|
The maximum length of the doc. |
2000000
|
disable |
List[str]
|
A list of spaCy pipeline components to disable. |
[]
|
exclude |
List[str]
|
A list of spaCy pipeline components to exclude. |
[]
|
add_stopwords |
Union[List[str], str]
|
A list of stop words to add to the model. |
[]
|
remove_stopwords |
Union[bool, List[str], str]
|
A list of stop words to remove
from the model. If |
[]
|
pipeline_components |
List[dict]
|
A list custom component dicts to add to the pipeline. See https://spacy.io/api/language/#add_pipe for more information. |
[]
|
Returns:
Name | Type | Description |
---|---|---|
object |
object
|
A spaCy doc object. |
Source code in lexos\tokenizer\__init__.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
lexos.tokenizer.make_docs(texts, model='xx_sent_ud_sm', max_length=2000000, disable=[], exclude=[], add_stopwords=[], remove_stopwords=[], pipeline_components=[])
¤
Return a list of docs from a text or list of texts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts |
Union[List[str], str]
|
The text(s) to be parsed. |
required |
model |
object
|
The model to be used. |
'xx_sent_ud_sm'
|
max_length |
int
|
The maximum length of the doc. |
2000000
|
disable |
List[str]
|
A list of spaCy pipeline components to disable. |
[]
|
exclude |
List[str]
|
A list of spaCy pipeline components to exclude. |
[]
|
add_stopwords |
Union[List[str], str]
|
A list of stop words to add to the model. |
[]
|
remove_stopwords |
Union[bool, List[str], str]
|
A list of stop words to remove
from the model. If |
[]
|
pipeline_components |
List[dict]
|
A list custom component dicts to add to the pipeline. See https://spacy.io/api/language/#add_pipe for more information. |
[]
|
Returns:
Type | Description |
---|---|
List[object]
|
List[object]: A list of spaCy doc objects. |
Source code in lexos\tokenizer\__init__.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
|
lexos.tokenizer.ngrams_from_doc(doc, size=2)
¤
Generate a list of ngrams from a spaCy doc.
A wrapper for textacy.extract.basics.ngrams
. With basic functionality.
Further functionality can be accessed by calling textacy
directly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc |
object
|
A spaCy doc |
required |
size |
int
|
The size of the ngrams. |
2
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: A list of ngrams. |
Source code in lexos\tokenizer\__init__.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
|