Safely remove all html code from a string in python -
i've been reading many q&a on how remove html code string using python none satisfying. need way remove tags, preserve/convert html entities , work utf-8 strings.
apparently beautifulsoup vulnerable specially crafted html strings, built simple parser htmlparser texts losing entities
from htmlparser import htmlparser class myhtmlparser(htmlparser): def __init__(self): htmlparser.__init__(self) self.data = [] def handle_data(self, data): self.data.append(data) def handle_charref(self, name): self.data.append(name) def handle_entityref(self, ent): self.data.append(ent)
gives me like
[u'asia, sp', u'cialiste du voyage ', ...
losing entity accented "e" in spécialiste.
using 1 of many regexp can find answers similar questions have edge cases not considered.
is there module use?
bleach excellent task. need. has extensive test suite checks strange edge cases tags slip through. have never had issue it.
Comments
Post a Comment