LexosDoc¤
A wrapper class for a spaCy doc which allows for extra methods.
A convenience that allows you to use Doc extensions without the underscore prefix.
Note
There is probably no need for this class. We can just keep a
library of functions in a file called tokenizer.py
and import them.
If certain functions get used commonly, they can be turned into Doc
extensions.
lexos.tokenizer.lexosdoc.LexosDoc
¤
A wrapper class for a spaCy doc which allows for extra methods.
A convenience that allows you to use Doc extensions without the underscore prefix.
Note: There is probably no need for this class. We can just keep a
library of functions in a file called tokenizer.py
and import them.
If certain functions get used commonly, they can be turned into Doc
extensions.
Source code in lexos\tokenizer\lexosdoc.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
|
__init__(doc)
¤
Initialize a LexosDoc object.
Source code in lexos\tokenizer\lexosdoc.py
25 26 27 28 29 30 |
|
get_term_counts(limit=None, start=0, end=None, filters=None, regex=False, normalize=False, normalize_with_filters=False, as_df=False)
¤
Get a list of word counts for each token in the doc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self |
object
|
A spaCy doc. |
required |
limit |
int
|
The maximum number of tokens to count. |
None
|
start |
Any
|
The index of the first token to count. |
0
|
end |
Any
|
The index of the last token to count after limit is applied. |
None
|
filters |
List[Union[Dict[str, str], str]]
|
A list of Doc attributes to ignore. |
None
|
regex |
bool
|
Whether to match the dictionary value using regex. |
False
|
normalize |
bool
|
Whether to return raw counts or relative frequencies. |
False
|
normalize_with_filters |
bool
|
Whether to normalize based on the number of tokens after filters are applied. |
False
|
as_df |
bool
|
Whether to return a pandas dataframe. |
False
|
Returns:
Type | Description |
---|---|
Union[List, pd.DataFrame]
|
Union[List, pd.DataFrame]: A list of word count tuples for |
Union[List, pd.DataFrame]
|
each token in the doc. Alternatively, a pandas dataframe. |
Source code in lexos\tokenizer\lexosdoc.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
get_token_attrs()
¤
Get a list of attributes for each token in the doc.
Returns a dict with "spacy_attributes" and "extensions".
Note: This function relies on sampling the first token in a doc to compile the list of attributes. It does not check for consistency. Currently, it is up to the user to reconcile inconsistencies between docs.
Source code in lexos\tokenizer\lexosdoc.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
get_tokens()
¤
Return a list of tokens in the doc.
Source code in lexos\tokenizer\lexosdoc.py
92 93 94 |
|
to_dataframe(cols=['text'], show_ranges=True)
¤
Get a pandas dataframe of the doc attributes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cols |
List[str]
|
A list of columns to include in the dataframe. |
['text']
|
show_ranges |
bool
|
Whether to include the token start and end positions in the dataframe. |
True
|
Returns a pandas dataframe of the doc attributes.
Note: It is a good idea to call LexosDoc.get_token_attrs()
first
to check which attributes are available for the doc.
Source code in lexos\tokenizer\lexosdoc.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|