libearth.sanitizer — Sanitize HTML tags

class libearth.sanitizer.HtmlSanitizer

HTML parser that is internally used by sanitize_html() function.

DISALLOWED_SCHEMES = frozenset(['about', 'jscript', 'livescript', 'javascript', 'mocha', 'vbscript', 'data'])

(collections.Set) The set of disallowed URI schemes e.g. javascript:.

DISALLOWED_STYLE_PATTERN = <_sre.SRE_Pattern object at 0x3135bd0>

(re.RegexObject) The regular expression pattern that matches to disallowed CSS properties.

class libearth.sanitizer.MarkupTagCleaner

HTML parser that is internally used by clean_html() function.


Strip all markup tags from html string. That means, it simply makes the given html document a plain text.

Parameters:html (str) – html string to clean
Returns:cleaned plain text
Return type:str

Sanitize the given html string. It removes the following tags and attributes that are not secure nor useful for RSS reader layout:

  • <script> tags
  • display: none; styles
  • JavaScript event attributes e.g. onclick, onload
  • href attributes that start with javascript:, jscript:, livescript:, vbscript:, data:, about:, or mocha:.
Parameters:html (str) – html string to sanitize
Returns:cleaned plain text
Return type:str