Tagsยค
A set of functions to replace or remove HTML/XML tags using Beautiful Soup.
remove_attribute(text: str, selector: str, attribute: str = None, mode: str = 'html', matcher_type: str = 'exact', attribute_value: Optional[str] = None, attribute_filter: Optional[str] = None) -> str
ยค
Removes attributes from HTML/XML elements.
Removes specified attributes from elements matching the selector. Can filter elements by specific attribute or attribute value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
selector
|
str
|
Tag name or CSS selector to match elements |
required |
attribute
|
str
|
Attribute name to remove. |
None
|
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
matcher_type
|
str
|
Type of match to perform, either "exact", "contains", or "regex" |
'exact'
|
attribute_value
|
Optional[str]
|
Optional value for the attribute filter |
None
|
attribute_filter
|
Optional[str]
|
Optional attribute name to filter elements |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Processed text with attributes removed from matching elements |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> text = '<div class="main" id="content">Text</div>'
>>> remove_attributes(text, "div", "class")
'<div id="content">Text</div>'
>>> text = '<p class="a">Keep</p><p class="b" id="x">Remove attrs</p>'
>>> remove_attributes(text, "p", attribute_filter="class", attribute_value="b")
'<p class="a">Keep</p><p>Remove attrs</p>'
Source code in lexos/scrubber/tags.py
remove_comments(text: str, mode: str = 'html') -> str
ยค
Removes comments from HTML or XML text.
Uses BeautifulSoup to find and remove all comments from HTML or XML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
Returns:
| Type | Description |
|---|---|
str
|
String containing the HTML/XML content with all comments removed |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> html = '<!-- Header comment --><div>Content</div><!-- Footer -->'
>>> remove_comments(html)
'<div>Content</div>'
>>> xml = '<?xml version="1.0"?><!-- Config --><root>Data</root>'
>>> remove_comments(xml, mode="xml")
'<?xml version="1.0"?><root>Data</root>'
Source code in lexos/scrubber/tags.py
remove_doctype(text: str) -> str
ยค
Removes a document type declaration from HTML or XML text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
Returns:
| Type | Description |
|---|---|
str
|
String containing the HTML/XML content with document type declaration removed |
Source code in lexos/scrubber/tags.py
remove_element(text: str, selector: str, mode: str = 'html', matcher_type: str = 'exact', attribute: str = None, attribute_value: str = None) -> str
ยค
Removes HTML/XML elements using BeautifulSoup.
Removes elements that match the given selector from HTML or XML text. Can further filter elements by specific attribute or attribute value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
selector
|
str
|
Tag name or CSS selector to match elements for removal |
required |
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
matcher_type
|
str
|
Type of match to perform, either "exact", "contains", or "regex" |
'exact'
|
attribute
|
str
|
Optional attribute name to filter elements |
None
|
attribute_value
|
str
|
Optional value for the attribute filter |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Processed text with matching elements removed |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> text = "<p class='a'>Keep</p><p class='b'>Remove</p><div>Remove</div>"
>>> remove_element(text, "div")
'<p>Keep</p>'
>>> remove_element("text", "p", attribute="class", attribute_value="b")
"<p class='a'>Keep</p><div>Remove</div>"
Source code in lexos/scrubber/tags.py
remove_tag(text: str, selector: str, mode: str = 'html', matcher_type: str = 'exact', attribute: str = None, attribute_value: str = None) -> str
ยค
Removes HTML/XML tags but keeps their inner content.
Removes tags matching the selector while preserving their inner content. Can filter elements by specific attribute or attribute value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
selector
|
str
|
Tag name or CSS selector to match elements for unwrapping |
required |
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
matcher_type
|
str
|
Type of match to perform, either "exact", "contains", or "regex" |
'exact'
|
attribute
|
str
|
Optional attribute name to filter elements |
None
|
attribute_value
|
str
|
Optional value for the attribute filter |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Processed text with matching tags unwrapped but content preserved |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> text = "<div><p>Keep this</p></div><span>And this</span>"
>>> remove_tag(text, "div")
'<p>Keep this</p><span>And this</span>'
>>> text = "<p class='a'>Keep tag</p><p class='b'>Remove tag only</p>"
>>> remove_tag(text, "p", attribute="class", attribute_value="b")
"<p class='a'>Keep tag</p>Remove tag only"
Source code in lexos/scrubber/tags.py
replace_attribute(text: str, selector: str, old_attribute: str, new_attribute: str, mode: str = 'html', matcher_type: str = 'exact', attribute_value: Optional[str] = None, replace_value: Optional[str] = None, attribute_filter: Optional[str] = None, filter_value: Optional[str] = None) -> str
ยค
Replaces HTML/XML element attributes or their values.
This function finds elements matching the selector and replaces attribute names or attribute values. It can filter elements by a specific attribute/value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
selector
|
str
|
Tag name or CSS selector to match elements |
required |
old_attribute
|
str
|
Name of the attribute to replace |
required |
new_attribute
|
str
|
Name of the new attribute (or same name if only changing value) |
required |
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
matcher_type
|
str
|
Type of match to perform, either "exact", "contains", or "regex" |
'exact'
|
attribute_value
|
Optional[str]
|
Only replace attributes with this specific value |
None
|
replace_value
|
Optional[str]
|
New value to use (keeps original value if None) |
None
|
attribute_filter
|
Optional[str]
|
Optional attribute name to filter elements |
None
|
filter_value
|
Optional[str]
|
Optional value for the attribute filter |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Processed text with attributes replaced in matching elements |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> # Replace class attribute with data-type, keeping the value
>>> text = '<div class="main">Text</div>'
>>> replace_attribute(text, "div", "class", "data-type")
'<div data-type="main">Text</div>'
>>> # Replace class="info" with class="highlight"
>>> text = '<p class="info">Text</p><p class="data">More</p>'
>>> replace_attribute(text, "p", "class", "class", filter_value="info", replace_value="highlight")
'<p class="highlight">Text</p><p class="data">More</p>'
>>> # Only replace attributes on elements with a specific attribute value
>>> text = '<div class="main" id="content">Text</div><div class="sidebar">Side</div>'
>>> replace_attribute(text, "div", "class", "role", attribute_filter="id", filter_value="content")
'<div role="main" id="content">Text</div><div class="sidebar">Side</div>'
Source code in lexos/scrubber/tags.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 | |
replace_tag(text: str, selector: str, replacement: str, mode: str = 'html', matcher_type: str = 'exact', attribute: str = None, attribute_value: str = None, preserve_attributes: bool = True) -> str
ยค
Replaces HTML/XML tags with another tag while preserving content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
HTML or XML text to process |
required |
selector
|
str
|
Tag name or CSS selector to match elements for replacement |
required |
replacement
|
str
|
New tag name to replace the matched elements with |
required |
mode
|
str
|
Parser mode, either "html" or "xml" |
'html'
|
matcher_type
|
str
|
Type of match to perform, either "exact", "contains", or "regex" |
'exact'
|
attribute
|
str
|
Optional attribute name to filter elements |
None
|
attribute_value
|
str
|
Optional value for the attribute filter |
None
|
preserve_attributes
|
bool
|
Whether to preserve original tag attributes |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Processed text with matching tags replaced but content preserved |
Raises:
| Type | Description |
|---|---|
LexosException
|
If mode is not "html" or "xml" |
Examples:
>>> text = "<div><p>Keep this</p></div>"
>>> replace_tag(text, "div", "section")
'<section><p>Keep this</p></section>'
>>> text = "<p class='a'>Keep</p><p class='b' id='x'>Replace tag</p>"
>>> replace_tag(text, "p", "span", attribute="class", attribute_value="b")
"<p class='a'>Keep</p><span class='b' id='x'>Replace tag</span>"