Skip to content

Proprocessing Steps¤

Overview¤

Before you begin to analyze your documents, you may which to perform a number of preprocessing procedures on them to prepare them for analysis.

Lexos offers four methods of preprocessing:

  1. Scrubbing (also called text cleaning): Transformation of the raw text string for of your data
  2. Tokenization: Split your text into countable tokens, optionally with language-specific rules that may also annotate tokens with linguistic metadata.
  3. Cutting: Split your text into smaller texts (or perform the reverse by merging smaller texts). Boundaries for cutting can be based on a variety of string or token patterns, as well as structural divisions (milestones) in your texts.
  4. Filtering: Remove tokens from pre-tokenized texts based on token annotations.

Click on each of the links to learn more.