libearth.sanitizer
— Sanitize HTML tags¶
-
class
libearth.sanitizer.
HtmlSanitizer
¶ HTML parser that is internally used by
sanitize_html()
function.-
DISALLOWED_SCHEMES
= frozenset(['about', 'jscript', 'livescript', 'javascript', 'mocha', 'vbscript', 'data'])¶ (
collections.Set
) The set of disallowed URI schemes e.g.javascript:
.
-
DISALLOWED_STYLE_PATTERN
= <_sre.SRE_Pattern object at 0x3135bd0>¶ (
re.RegexObject
) The regular expression pattern that matches to disallowed CSS properties.
-
-
class
libearth.sanitizer.
MarkupTagCleaner
¶ HTML parser that is internally used by
clean_html()
function.
-
libearth.sanitizer.
clean_html
(html)¶ Strip all markup tags from
html
string. That means, it simply makes the givenhtml
document a plain text.Parameters: html ( str
) – html string to cleanReturns: cleaned plain text Return type: str
-
libearth.sanitizer.
sanitize_html
(html)¶ Sanitize the given
html
string. It removes the following tags and attributes that are not secure nor useful for RSS reader layout:<script>
tagsdisplay: none;
styles- JavaScript event attributes e.g.
onclick
,onload
href
attributes that start withjavascript:
,jscript:
,livescript:
,vbscript:
,data:
,about:
, ormocha:
.
Parameters: html ( str
) – html string to sanitizeReturns: cleaned plain text Return type: str