Proprocessing Steps¤
Overview¤
Before you begin to analyze your documents, you may which to perform a number of preprocessing procedures on them to prepare them for analysis.
Lexos offers four methods of preprocessing:
- Scrubbing (also called text cleaning): Transformation of the raw text string for of your data
- Tokenization: Split your text into countable tokens, optionally with language-specific rules that may also annotate tokens with linguistic metadata.
- Cutting: Split your text into smaller texts (or perform the reverse by merging smaller texts). Boundaries for cutting can be based on a variety of string or token patterns, as well as structural divisions (milestones) in your texts.
- Filtering: Remove tokens from pre-tokenized texts based on token annotations.
Click on each of the links to learn more.