`libearth.sanitizer` — Sanitize HTML tags¶

class libearth.sanitizer.HtmlSanitizer¶

HTML parser that is internally used by sanitize_html() function.

DISALLOWED_SCHEMES = frozenset(['about', 'jscript', 'livescript', 'javascript', 'mocha', 'vbscript', 'data'])¶: (collections.Set) The set of disallowed URI schemes e.g. javascript:.

DISALLOWED_STYLE_PATTERN = <_sre.SRE_Pattern object at 0x3135bd0>¶: (re.RegexObject) The regular expression pattern that matches to disallowed CSS properties.

class libearth.sanitizer.MarkupTagCleaner¶: HTML parser that is internally used by clean_html() function.

libearth.sanitizer.clean_html(html)¶

Strip all markup tags from html string. That means, it simply makes the given html document a plain text.

libearth.sanitizer.sanitize_html(html)¶

Sanitize the given html string. It removes the following tags and attributes that are not secure nor useful for RSS reader layout:

<script> tags
display: none; styles
JavaScript event attributes e.g. onclick, onload
href attributes that start with javascript:, jscript:, livescript:, vbscript:, data:, about:, or mocha:.

Parameters:	html (`str`) – html string to sanitize
Returns:	cleaned plain text
Return type:	`str`

libearth.sanitizer — Sanitize HTML tags¶