Keywords in Context (KWIC)¤
Overview¤
Keywords in Context (KWIC) is a common method of finding all the examples of a term in a document in the context of the text immediately before and after the term. Lexos provides sophisticated methods of searching for keywords that make the process easy.
Basic Usage¤
The basic procedure is as follows:
# Import the Kwic class
from lexos.kwic import Kwic
# Define a text
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife."
# Define a pattern
pattern = "universally"
# Create an instance of a Kwic object
kwic = Kwic()
# Pass the object the desired parameters
kwic(docs=text, patterns=pattern, window=10)
This will display
| doc | context_before | keyword | context_after | |
|---|---|---|---|---|
| 0 | Doc 1 | s a truth | universally | acknowled |
You will notice that the keywords docs and patterns are plural. This is because the Kwic class will also accept multiple documents and lists of patterns. A document, in the case, can be a raw text string, but spaCy Doc objects are also accepted. Patterns may be raw strings or regex patterns.
The window keyword will by default provide a window of n characters around each keyword found.
The standard output is a pandas DataFrame. You can either display this DataFrame directly, or assign it to a variable for further processing:
By default, your documents will be labelled "Doc 1", "Doc 2", "Doc3", etc. However, you can supply a list of labels with the labels keyword.
Here are some other useful parameters:
case_sensitive: If set toFalse, theKwicclass will perform case-insensitive searches.sort_by: If you wish to sort the DataFrame, use this keyword to set the column for sorting.ascending: Set toTrueorFalseto set the order for sorting the DataFrame.as_df: If set toFalse, the output will be a list of tuples, where the first item is the "before" window, the second item is the matched pattern, and the third item is the "after" window.
Searching Token Patterns¤
If you have spaCy Docs, you may wish to search for token patterns. A simple way to do this is to set the matcher to "tokens".
Consider the example above. If we have a spaCy Doc, we can re-write it as
| doc | context_before | keyword | context_after | |
|---|---|---|---|---|
| 0 | Doc 1 | that a single man in | possession | of a good fortune must |
Now the Kwic class will search for the token "universally", and the "before" and "after" windows will be counted in tokens, rather than characters.
We can use the use_regex keyword to use a regular expression for our pattern:
patterns = r".ingle"
kwic(docs=doc, labels=None, patterns=patterns, window=5, matcher="tokens", use_regex=True, case_sensitive=False)
This will find any token containing a character followed by "ingle" (and the search will be case insensitive).
| doc | context_before | keyword | context_after | |
|---|---|---|---|---|
| 0 | Doc 1 | universally acknowledged, that a | single | man in possession of a |
We can also perform more sophisticated token-based searches using spaCy's rule-matching syntax. To use it, we set the matcher parameter to "rule". See the spaCy documentation for details of how to construct rules for token-based matching.
pattern1 = [{"LOWER": "truth"}, {"LOWER": "universally"}, {"LOWER": "acknowledged"}]
pattern2 = [{"TEXT": "possession"}]
patterns = [pattern1, pattern2]
kwic(docs=doc, patterns=patterns, window=5, matcher="rule")
| doc | context_before | keyword | context_after | |
|---|---|---|---|---|
| 0 | Doc 1 | It is a | truth universally acknowledged | , that a single man |
| 1 | Doc 1 | that a single man in | possession | of a good fortune must |
Finally, we can also search multi-token patterns using spaCy's PhraseMatcher by setting matcher to "phrase".
patterns = ["truth universally acknowledged", "possession"]
kwic(docs=doc, patterns=patterns, window=5, matcher="phrase")
| doc | context_before | keyword | context_after | |
|---|---|---|---|---|
| 0 | Doc 1 | It is a | truth universally acknowledged | , that a single man |
| 1 | Doc 1 | that a single man in | possession | of a good fortune must |
This differs from the previous example because the patterns are first pre-processed into Doc objects, which can be significantly faster to process if you have large numbers of patterns.