Overview¤
The Corpus
module consists of a Corpus
class that helps you manage assets in your workflow and serialize them to disk for later use. It is strictly optional; you may find it sufficient to load your documents into memory with a Loader
or to manage your corpus assets independently through a different application.
It is important to realise that a Lexos Corpus
is primarily a manager for project assets; it is not used for acquiring those assets and is not used for analysing them, apart from the generation of a few statistics. In general, using a Corpus
will require a workflow like the following:
- Use
Loader
to acquire texts. - Optionally, use
Scrubber
to perform any required preprocessing on the texts inLoader.texts
. - Optionally, use
Tokenizer
to convert the original or scrubbed texts to spaCyDoc
objects. - Add the documents (texts or spaCy
Doc
objects) to theCorpus
.
If you wished to analyse the documents, you would get them from the Corpus
and run them through Tokenizer
if you did not do so before adding them.
From this workflow, you should be able to see that you can skip the Corpus
entirely. The Corpus
simply allows you to attach metadata to the documents, such as a name, description, or classification label, and to save them to and retrieve them from disk easily. One of the important metadata categories is whether or not a document is active. A Corpus
allows you to retrieve subsets of your documents based on this and other metadata categories.
Creating a Corpus¤
Begin by importing the Corpus
module:
import lexos.corpus as cp
We use the cp
alias so that we can use corpus
as a variable below:
corpus = cp.Corpus(
name="My Corpus",
description="My test corpus",
corpus_dir="corpus"
)
The name
, description
, and corpus_dir
arguments are all optional. corpus_dir
is the directory where the corpus will be stored, and the default is "corpus" in the current directory. You can use additional keywords to instantiate the corpus with other metadata such as author or creation date. When you run the code above, the corpus directory will be created if it does not already exist.
You can view your corpus metadata in a number of ways:
Corpus.meta_table()
will return an overview of your corpus as a pandas dataframe. You can also view this information as a dict with corpus.meta
. The following individual properties can also be viewed:
ids
: A list of document ids in the Corpus.names
: A list of document names in the Corpus.docs
: A dict of document ids and docs in the Corpus if you have opted to cache them to RAM.num_docs
: The number of docs in the Corpus.num_active_docs
: The number of active docs in the Corpus.num_tokens
: The number of tokens in the Corpus.num_terms
: The number of terms in the Corpus.terms
: A set of unique terms in the Corpus.
Call these functions with code like Corpus.num_tokens
. You can also get a Python collections Counter
object containing the corpus term counts with Corpus.get_term_counts()
.
These attributes should all be empty or 0 when the corpus is first created.
Note
The Corpus
class is constructed using Pydantic's BaseModel
class. This means that it has access to any of Pydantic's attributes and methods, such as dict()
and json()
.
Corpus Records¤
The basic unit of storage in a Corpus
is a Record
object. This is a Python object that provides access to the record's content and its metadata. Constructing a Record
is simple. You just have to feed it some content
and, in most cases, give it a name:
record = Record(content=mydoc, name="greeting")
Behind the scenes, the Record
class will give the record a default id
of 1 (unless you specify a different integer) and set the is_active
property to True
(unless you set instantiate the object with it set to False
). See lexos.corpus.Record for other arguments that can be passed to the Record
class.
You can also create a Record
from a dict using Pydantic's parse_obj()
method:
record = Record.parse_obj({"content": mydoc, "name": "greeting"})
See the Pydantic documentation for helper functions for parsing json or file content into objects.
Once instantiated, a record provides access to the following information:
preview
: A preview of the first 50 characters of the record's text.terms
: A set of unique terms in the record's content.text
: The full text of the record's content.tokens
: A list of tokens in the record's content.num_terms
: The number of unique terms in the record's content.num_tokens
: The number of tokens in the record's content.
Important
Term counts do not collapse upper- and lower-case words, so, if this is important, you must get the tokens, convert to lower case, and then generate the list of terms yourself. Alternatively, you may use Scrubber
to preprocess your data before creating the Record
object.
Record.set()
allows you to set arbitrary Record
attributes (such as author or date), and Record.save()
allows you to save the file to disk.
Important
When a Record
object is saved to disk, it is serialized as a binary pickle file, which is not human readable. To restore it, you use a normal Python method of reading a binary file:
with open(filename, "rb") as f:
record = pickle.load(f)
The pickle format is not considered secure, so never unpickle a file you do not trust.
In the latest version of spaCy, it is possible to serialize to JSON, but these methods have not yet been integrated in the Lexos API.
The Record
class accepts content only in the form of a pre-tokenized spaCy doc. However, it is possible to store an untokenized text by creating a blank spaCy language model and feeding it the lexos.corpus.NullTokenizer class. This simply returns a spaCy doc with the text as a single token.
nlp = spacy.blank("xx")
nlp.tokenizer = NullTokenizer(nlp.vocab)
content = nlp(content)
record = Record(content)
Note that the entire text will be counted as a single token and a single term, so it is preferable to tokenize the text first or to plan to do so later.
If the content is already a tokenized document, it is necessary to label it as such in the metadata. Here is an example of how you would do it:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hi there!")
record = Record(content=doc, name="greeting", is_parsed=True)
The is_parsed
attribute allows Corpus
to know that it is dealing with a tokenized document. You can still access the full text by calling record.text
, but you can also access individual tokens by calling record.content[0]
(to get the first token).
If you want a dictionary with a record's full metadata, probably the easiest method is metadata = record.dict().remove("content")
.
Why serialize records with pickle
?
A Lexos Record
is a Python object which contains a spaCy Doc
object which contains spaCy Token
objects. This complex structure creates a scenario which cannot be handled by other serialization formats without some serious hacks. There are some concerns about whether serialization and de-serialization will be fast enough when working with many records in a corpus (and lesser concerns about the security of the format), but for the moment it is the easiest and most straightforward format to work with. This is something to be revisited at a future date, especially now that spaCy has added a Doc.to_json()
method.
Adding Records to a Corpus¤
Adding records to a corpus is simple with Corpus.add_record()
:
record = Record(content=doc, name="greeting", is_parsed=True)
corpus.add_record(record, cache=True)
There is also a Corpus.add_records()
, which takes a list of records.
By default, the record's content is not cached in memory; instead, the entire record is serialized to disk. If you want to keep it in memory, you can set cache=True
(as above). This will allow you to access the record from corpus.docs
without having to fetch the record from disk.
Note
At present the docs
property in the Corpus
class is the only place where a clear distinction between a "record" and a "document" is made.
Adding Documents to a Corpus¤
It is not necessary to pre-generate records from documents before adding them to a corpus. You can also use Corpus.add()
to add a document directly:
# Use a text string
unparsed_doc = "Hi there!"
corpus.add_record(unparsed_doc, name="greeting")
# Create
parsed_doc = nlp("This is a full sentence.")
corpus.add(content=parsed_doc, name="sentence", is_parsed=True)
By default, the is_active
attribute is True
.
You can set additional metadata properties by supplying a metadata
dict:
metadata = {"author": "John Smith", "date": "2011"}
corpus.add(
content=parsed_doc,
name="sentence",
is_parsed=True,
metadata=metadata,
cache=True
)
The corresponding Corpus.add_docs()
allows you to insert multiple documents. The format is a little more complicated. It takes a list of dicts with the document as the value of the content
property:
docs = [
{"content": doc1, "name": "Doc 1"},
{"content": doc2, "name": "Doc 2"},
]
corpus.add_docs(docs, cache=True)
All the arguments accepted by Corpus.add()
can be set as keys in the docs
dictionary.
Important
Whether you are adding documents or records to a corpus, a check is made to ensure that the records stored have unique id
, name
, and filename
attributes. If you do not specify a name
for a document or record, a UUID will be used instead and will be used to generate a corresponding filename. The results of this can be unwieldy. In the future, some other method of ensuring uniqueness will be explored.
Getting Records from the Corpus¤
Individual records can be fetched using Corpus.get()
with a record id
:
record = corpus.get(1)
doc = record.content
The second line above extracts the spaCy doc from the records, and it can be treated like any spaCy doc.
You can also supply a list of ids to Corpus.get_records()
. If you pass nothing to the method, all the records in the corpus will be retrieved.
If you do not know the id(s) of the document(s) you want, you can provide a query for Corpus.get_records()
:
records = corpus.get_records(query="id < 10")
for record in records:
print(record.name)
This will yield a generator with each of the records with an id
less than 10.
Note
On the back end, Corpus.get()
and Corpus.get_records()
call Corpus.meta
, which contains a subset of the metadata for each record. A pandas dataframe is constructed from this metadata. The query
can therefore be anything acceptable to pandas.DataFrame.query()
. This allows complex queries to be performed on the corpus.
If you want just the metadata for a record, probably the easiest method is metadata = corpus.get(1).dict().remove("content")
.
Viewing the Records Table¤
Corpus.records_table()
generates a pandas dataframe with each record in a separate row. By default, the id
, name
, filename
, num_tokens
, num_terms
, is_active
, and is_parsed
attributes are displayed columns. You can supply your own list of columns with the columns
argument, or you can exclude specific columns with the exclude
argument.
Setting Record Properties¤
After a corpus is instantiated, you can set the properties of individual records with Corpus.set()
:
corpus.set(1, {"name": "John Smith"})
Removing Records¤
Corpus.remove()
and Corpus.remove_records()
can be used to remove records from a corpus. The former takes an id number and the latter takes a list of ids.
Using Records¤
Typically, you would retrieve records using Corpus.get_records()
and then pass their content to another Lexos module. For example, here is how you would create a document-term matrix:
# Get the records
records = corpus.get_records()
# Extract the documents and labels
docs = [record.content for record in records]
labels = [record.name for record in records]
# Import the dtm module and generate a document-term matrix
from lexos.dtm import DTM
# Build the DTM
dtm = DTM(docs, labels)