Skip to content

Utils¤

The utils component of Scrubber contains helper functions shared by the other components.

lexos.scrubber.utils.get_tags(text) ¤

Get information about the tags in a text.

Parameters:

Name Type Description Default
text str

The text to be analyzed.

required

Returns:

Name Type Description
dict dict

A dict with the keys "tags" and "attributes". "Tags is a list of unique tag names

dict

in the data and "attributes" is a list of dicts containing the attributes and values

dict

for those tags that have attributes.

Note

The procedure tries to parse the markup as well-formed XML using ETree; otherwise, it falls back to BeautifulSoup's parser.

Source code in lexos\scrubber\utils.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def get_tags(text: str) -> dict:
    """Get information about the tags in a text.

    Args:
        text (str): The text to be analyzed.

    Returns:
        dict: A dict with the keys "tags" and "attributes". "Tags is a list of unique tag names
        in the data and "attributes" is a list of dicts containing the attributes and values
        for those tags that have attributes.

    Note:
        The procedure tries to parse the markup as well-formed XML using ETree; otherwise, it falls back to BeautifulSoup's parser.
    """
    import json
    import re
    from xml.etree import ElementTree

    from natsort import humansorted

    tags = []
    attributes = []

    try:
        root = ElementTree.fromstring(text)
        for element in root.iter():
            if re.sub("{.+}", "", element.tag) not in tags:
                tags.append(re.sub("{.+}", "", element.tag))
            if element.attrib != {}:
                attributes.append({re.sub("{.+}", "", element.tag): element.attrib})
        tags = humansorted(tags)
        attributes = json.loads(json.dumps(attributes, sort_keys=True))
    except ElementTree.ParseError:
        import bs4
        from bs4 import BeautifulSoup

        soup = BeautifulSoup(text, "xml")
        for e in soup:
            if isinstance(e, bs4.element.ProcessingInstruction):
                e.extract()
        [tags.append(tag.name) for tag in soup.find_all()]
        [attributes.append({tag.name: tag.attrs}) for tag in soup.find_all()]
        tags = humansorted(tags)
        attributes = json.loads(json.dumps(attributes, sort_keys=True))
    return {"tags": tags, "attributes": attributes}