Util¤
util
¤
utils.py.
This file contains helper functions used by multiple modules.
Last Updated: June 24, 2025 Lasty Tested: June 24, 2025
Functions:
| Name | Description |
|---|---|
ensure_list |
Ensure string is converted to a Path. |
ensure_path |
Ensure string is converted to a Path. |
get_encoding |
Use chardet to return the encoding type of a string. |
get_paths |
Get a list paths in a directory. |
get_token_extension_names |
Get the names of token extensions from a spaCy Doc. |
is_valid_colour |
Check if a string is a valid colour. |
load_spacy_model |
Load a spaCy language model. |
normalize |
Normalise a string to LexosFile format. |
normalize_file |
Normalise a file to LexosFile format and save the file. |
normalize_files |
Normalise a list of files to LexosFile format and save the files. |
normalize_strings |
Normalise a list of strings to LexosFile format. |
strip_doc |
Strip leading and normalise trailing whitespace in a spaCy Doc. |
to_collection |
Validate and cast a value or values to a collection. |
ensure_list(item: Any) -> list
¤
Ensure string is converted to a Path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
Any
|
Anything. |
required |
Returns:
| Type | Description |
|---|---|
list
|
The item inside a list if it is not already a list. |
ensure_path(path: Any) -> Any
¤
Ensure string is converted to a Path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Any
|
Anything. If string, it's converted to Path. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
Path or original argument. |
Source code in lexos/util.py
get_encoding(input_string: bytes) -> str
¤
Use chardet to return the encoding type of a string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_string
|
bytes
|
A bytestring. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The string's encoding type. |
Source code in lexos/util.py
get_paths(path: Path | str) -> list
¤
Get a list paths in a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
The path to the directory. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
A list of file paths. |
get_token_extension_names(doc: Doc) -> list[str]
¤
Get the names of token extensions from a spaCy Doc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Doc
|
spaCy Doc to analyze. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: a list of token extensions. |
Source code in lexos/util.py
is_valid_colour(color: str) -> bool
¤
Check if a string is a valid colour.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
color
|
str
|
A string representing a colour. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the string is a valid colour, False otherwise. |
Note: Implements Pydantic's Color type for validation. See https://docs.pydantic.dev/2.0/usage/types/extra_types/color_types/ for more information.
Source code in lexos/util.py
load_spacy_model(model: Language | str) -> Language
¤
Load a spaCy language model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Language | str
|
The spaCy model to load, either as a Language object or a string representing the model name. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Language |
Language
|
The loaded spaCy language model. |
Raises:
| Type | Description |
|---|---|
LexosException
|
If the model cannot be loaded or if the model type is incorrect. |
Source code in lexos/util.py
normalize(raw_bytes: bytes | str) -> str
¤
Normalise a string to LexosFile format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_bytes
|
bytes | str
|
The input bytestring. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalised version of the input string. |
Source code in lexos/util.py
normalize_file(filepath: Path | str, destination_dir: Path | str = '.') -> None
¤
Normalise a file to LexosFile format and save the file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
Path | str
|
The path to the input file. |
required |
destination_dir
|
Path | str
|
The path to the directory where the files. will be saved. |
'.'
|
Source code in lexos/util.py
normalize_files(filepaths: list[Path | str], destination_dir: Path | str = '.') -> None
¤
Normalise a list of files to LexosFile format and save the files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepaths
|
list[Path | str]
|
The list of paths to input files. |
required |
destination_dir
|
Path | str
|
The path to the directory where the files. will be saved. |
'.'
|
Source code in lexos/util.py
normalize_strings(strings: list[str]) -> list[str]
¤
Normalise a list of strings to LexosFile format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strings
|
list[Path | str]
|
The list of input strings. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of normalised versions of the input strings. |
Source code in lexos/util.py
strip_doc(doc: Doc) -> Doc
¤
Strip leading and normalise trailing whitespace in a spaCy Doc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Doc
|
spaCy Doc to analyze |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Doc |
Doc
|
the Doc with leading and trailing whitespace removed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If Doc is empty or contains only whitespace. |
If the final token has trailing whitespace, this will be preserved.
You can remove the space with:
```python words = [t.text for t in doc] spaces = [t.whitespace_ for t in doc] spaces[-1] = "" doc = Doc(doc.vocab, words=words, spaces=spaces)
But you will lose all entities and custom extensions. So it makes more sense to call doc.text.strip() when needed instead.
Source code in lexos/util.py
to_collection(val: AnyVal | Collection[AnyVal], val_type: type[Any] | tuple[type[Any], ...], col_type: type[Any]) -> Collection[AnyVal]
¤
Validate and cast a value or values to a collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
val
|
AnyVal | Collection[AnyVal]
|
Value or values to validate and cast. |
required |
val_type
|
type[Any] | tuple[type[Any], ...]
|
Type of each value in collection, e.g. |
required |
col_type
|
type[Any]
|
Type of collection to return, e.g. |
required |
Returns:
| Type | Description |
|---|---|
Collection[AnyVal]
|
Collection[AnyVal]: Collection of type |
Raises:
| Type | Description |
|---|---|
TypeError
|
An invalid value was passed. |