Proprocessing Steps¤

Overview¤

Before you begin to analyze your documents, you may which to perform a number of preprocessing procedures on them to prepare them for analysis.

Lexos offers four methods of preprocessing:

Scrubbing (also called text cleaning): Transformation of the raw text string for of your data
Tokenization: Split your text into countable tokens, optionally with language-specific rules that may also annotate tokens with linguistic metadata.
Cutting: Split your text into smaller texts (or perform the reverse by merging smaller texts). Boundaries for cutting can be based on a variety of string or token patterns, as well as structural divisions (milestones) in your texts.
Filtering: Remove tokens from pre-tokenized texts based on token annotations.

Click on each of the links to learn more.