Skip to content

Tokenizing Texts

Language Models¤

The tokenizer module is a big change for Lexos, as it formally separates tokenization from preprocessing. In the Lexos app, users employ Scrubber to massage the text into shape using their implicit knowledge about the text's language. Tokenization then takes place by splitting the text according to a regular expression pattern (normally whitespace). By contrast, the Lexos API uses a language model that formalizes the implicit rules and thus automates the tokenization process. Language models can implement both rule-based and probabilistic strategies for separating document strings into tokens. Because they have built-in procedures appropriate to specific languages, language models can often do a better job of tokenization than the approach used in the Lexos app.

Important

There are some trade-offs to using language models. Because the algorithm does more than split strings, processing times can be greater. In addition, tokenization is no longer (explicitly) language agnostic. A language model is "opinionated" and it may overfit the data. At the same time, if no language model exists for the language being tokenized, the results may not be satisfactory. The Lexos strategy for handling this situation is described below.

Tokenized Documents¤

A tokenized document can be defined as a text split into tokens in which each token is stored with any number of annotations assigned by the model. These annotations are token "attributes". The structure of a tokenized document can then be conceived in theory as a list of dicts like the following, where each keyword is an attribute.

tokenized_doc = [
    {"text": "The", "part_of_speech": "noun", "is_stopword": "True"},
    {"text": "end", "part_of_speech": "noun", "is_stopword": "False"}
]

It is then a simple matter to iterate through the document and retrieve all the tokens that are not stopwords using a Python list comprehension.

non_stopwords = [
    token for token in tokenized_doc
    if token["is_stopword"] == False
]

Many filtering procedures are easy to implement in this way.

For languages such as Modern English, language models exist that can automatically annotate tokens with information like parts of speech, lemmas, stop words, and other information. However, token attributes can also be set after the text has been tokenized.

If no language model exists for the text's language, it will only be possible to tokenize using general rules, and it will not be possible to add other annotations (at the tokenization stage). But new language models, including models for historical languages, are being produced all the time, and this is a growing area of interest in the Digital Humanities.

spaCy Docs¤

The Lexos API wraps the spaCy Natural Language Processing (NLP) library for loading language models and tokenizing texts. Because spaCy has excellent documentation and fairly wide acceptance in the Digital Humanities community, it is a good tool to use under the bonnet. spaCy has a growing number of language models in a number of languages, as well as wrappers for loading models from other common NLP libraries such as Stanford Stanza.

Note

The architecture of the Scrubber module is partially built on top of the preprocessing functions in Textacy, which also accesses and extends spaCy.

In spaCy, texts are parsed into spacy.Doc objects consisting of sequences of annotated tokens.

Note

In order to formalize the difference between a text string that has been scrubbed and one that has been tokenized, we refer wherever possible to the string as a "text" and to the tokenized spacy.Doc object as a "document" (or just "doc"). We continue to refer to the individual items as "documents" if we are not concerned with their data type.

Each token is spacy.Token object which stores all the token's attributes.

The Lexos API wraps this procedure in the tokenizer.make_doc() function:

from lexos import tokenizer

doc = tokenizer.make_doc(text)

This returns a spacy.Doc object.

By default the tokenizer uses spaCy's "xx_sent_ud_sm" language model, which has been trained for tokenization and sentence segmentation on multiple languages. This model performs statistical sentence segmentation and possesses general rules for token segmentation that work well for a variety of languages.

If you were making a document from a text in a language which rquired a more language-specific model, you would specify the model to be used. For instance, to use spaCy's small English model trained on web texts, you would call

doc = tokenizer.make_doc(text, model="en_core_web_sm")

tokenizer also has a make_docs() function to parse a list of texts into spaCy docs.

Important

Tokenization using spaCy uses a lot of memory. For a small English-language model, the parser and named entity recognizer (NER) can require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the memory limit with the max_length parameter. The limit is in number of characters (the default is set to 2,000,000 for Lexos), so you can check whether your inputs are too long by checking len(text). If you are not using RAM-hungry pipeline components, you can disable or exclude them to avoid errors an increase efficiency (see the discussion on the spaCy pipeline below). In some cases, it may also be possible to cut the texts into segments before tokenization.

A list of individual tokens can be obtained by iterating over the spaCy doc:

# Get a list of tokens
tokens = [token.text for token in doc]

Here the text attribute stores the original text form of the token. SpaCy docs are non-destructive because they preserve the original text alongside the list of tokens and their attributes. You can access the original text of the entire doc by calling doc.text (assuming you have assigned the Doc object to the doc variable). Indeed, calling doc.to_json() will return a JSON representation which gives the start and end position of each token in the original text!

As mentioned above, you can use a Python list comprehension to filter the the contents of the doc using information in the document's attributes. For instance:

# Get a list of non-punctuation tokens
non_punct_tokens = [token.text for token in doc if not token.is_punct]

The example above leverages the built-in is_punct attribute to indicate whether the token is defined as (or predicted to be) a punctuation mark in the language model. SpaCy docs have a number of built-in attributes, which are described in the spaCy API reference.

Note

It is possible to extend spaCy's Doc object with its extension attribute. Lexos has a sample is_fruit extension (borrowed from the spaCy docs), which is illustrated below. Note that extensions are accessed via the underscore prefix, as shown.

# Indicate whether the token is labelled as fruit
for token in doc:
    print(token._.is_fruit)

The sample extension can be found in lexos.tokenizer.extensions.

The spaCy Pipeline¤

Once spaCy tokenizes a text, it normally passes the resulting document to a pipeline of functions to parse it for other features. Typically, these functions will perform actions such as part-of-speech tagging, labelling syntactic dependencies, and identifying named entities. Processing times can be increased by disabling pipeline components if they are unavailable in the language model or not needed for the application's purposes. make_doc() and make_docs() will automatically run all pipeline components in the model unless they are disabled or excluded with the disable or exclude parameter. Check the model's documentation for the names of the components it includes.

It is also possible to include custom pipeline components, which can be inserted at any point in the pipeline order. Custom components are supplied with the pipeline_components parameter, which takes a dictionary containing the keyword "custom". The value is a list of dictionaries where each dictionary contains information about the component as described in spaCy's documentation.

Note

The pipeline_components dict also contains disable and exclude keywords. The values are lists of components which will be merged with any components supplied in the disable or exclude paramaters of make_doc() and make_docs().

The ability to add custom pipeline components is valuable for certain language- or application-specific scenarios. However, it also opens Lexos up to the wealth of third-part pipeline components available through the spaCy Universe.

Handling Stop Words¤

Every token in a spaCy doc has an is_stop attribute. Most language models will have a list of default stop words, and this list is used to set the is_stop attribute True for every token when the document is parsed. It is possible to add stop words to the default list by passing a list to make_doc() and make_docs() with the add_stopwords argument:

doc = tokenizer.make_doc(
    text,
    model="en_core_web_sm",
    add_stopwords=["yes", "no", "maybe"]
)

The remove_stopwords argument removes stop words from the default list. If remove_stopwords=True, all stop words are removed.

Important

add_stopwords and remove_stopwords do not remove stop word tokens from the doc; rather, they modify the stop word list used to set the is_stop attribute of individual tokens. To get a list of tokens without stop words, you must do something like [token for token in doc if not token.is_stop]. If you a are producing a corpus of documents in which the documents will be processed by different models, it is most efficient to process the documents in batches, one batch for each model.

LexosDocs¤

The Lexos API also has a LexosDoc class, which provides a wrapper for spaCy docs. Its use is illustrated below.

from lexos.tokenizer.lexosdoc import LexosDoc

lexos_doc = LexosDoc(doc)
tokens = lexos_doc.get_tokens()

This example just returns [token.text for token in doc], so it is not strictly necessary. But using the LexosDoc wrapper can be useful for producing clean code. In other cases, it might be useful to manipulate spaCy docs with methods that do not access their built-in or extended attributes or method. For instance, LexosDoc.get_token_attrs() shows what attributes are available for tokens in the doc and LexosDoc.to_dataframe() exports the tokens and their attributes to a pandas dataframe.

Ngrams¤

Both texts and documents can be parsed into sequences of two or more tokens called ngrams. Many spaCy models can identify syntactic units such as noun chunks. These capabilities are not covered here since they are language specific. Instead, the section below describe how to obtain more general ngram sequences.

Generating Word Ngrams¤

The easiest method of obtaining ngrams from a text is to create a spaCy doc and then call Textacy's textacy.extract.basics.ngrams method:

import spacy
import textacy.extract.basics.ngrams as ng

nlp = spacy.load("xx_sent_ud_sm")

text = "The end is nigh."

doc = nlp(text)

ngrams = list(ng(doc, 2, min_freq=1))

This will produce [The end, end is, is nigh]. The output is a list of spaCy tokens. (An additional [token.text for token in ngrams] is required to ensure that you have quoted strings: ["The end", "end is", "is nigh"]).

Textacy has a lot of additional options, which are documented in the Textacy API reference under textacy.extract.basics.ngrams. However, if you do not need these options, you can use Tokenizer's helper function ngrams_from_doc():

import spacy

nlp = spacy.load("xx_sent_ud_sm")

text = "The end is nigh."

doc = nlp(text)

ngrams = ngrams_from_doc(doc, size=2)

Notice that in both cases, the output will be a list of overlapping ngrams generated by a rolling window across the pre-tokenized document. If you want your document to contain ngrams as tokens, you will need to create a new document using Tokenizer's doc_from_ngrams() function:

doc = doc_from_ngrams(ngrams, strict=True)

Note

Setting strict=False will preserve all the whitespace in the ngrams; otherwise, your language model may modify the output by doing things like splitting punctuation into separate tokens.

There is also a doc_from_ngrams() function to which you can feed multiple lists of ngrams.

A possible workflow might call Textacy directly to take advantage of some its filters, when generating ngrams and then calling doc_from_ngrams() to pipe the extracted tokens back into a doc. textacy.extract.basics.ngrams has sister functions that do things like extract noun chunks (if available in the language model), making this a very powerful approach generating ngrams with semantic information.

Generating Character Ngrams¤

Character ngrams at their most basic level split the untokenized string every N characters. So "The end is nigh." would produce something like ["Th", "e ", "nd", " i", "s ", "ni", "gh", "."] (if we wanted to preserve the whitespace). Tokenizer does this with the generate_ngrams]() function:

text = "The end is nigh."

ngrams = generate_character_ngrams(text, 2, drop_whitespace=False)

This will produce the output shown above. If we wanted to output ["Th", "en", "di", "sn", "ig", "h."], we would use drop_whitespace=True (which is the default).

Note

generate_character_ngrams() is a wrapper for Python's textwrap.wrap method, which can also be called directly.

Once you have produced a list of ngrams, you can create a doc from them using ngrams_from_doc(), as shown above.

Use generate_character_ngrams() (a) when you simply want a list of non-overlapping ngrams, or (b) when you want to produce docs with non-overlapping ngrams as tokens.

Note that your language model may not be able apply labels effectively to ngram tokens, so working with character ngrams is primarily useful if you are planning to work with the token forms only, or if the ngram size you use maps closely to character lengths of words in the language you are working in.