Safely remove all html code from a string in python -

- March 15, 2013

i've been reading many q&a on how remove html code string using python none satisfying. need way remove tags, preserve/convert html entities , work utf-8 strings.

apparently beautifulsoup vulnerable specially crafted html strings, built simple parser htmlparser texts losing entities

from htmlparser import htmlparser  class myhtmlparser(htmlparser):     def __init__(self):         htmlparser.__init__(self)         self.data = []      def handle_data(self, data):         self.data.append(data)      def handle_charref(self, name):         self.data.append(name)      def handle_entityref(self, ent):         self.data.append(ent)

gives me like

[u'asia, sp', u'cialiste du voyage ', ...

losing entity accented "e" in spécialiste.

using 1 of many regexp can find answers similar questions have edge cases not considered.

is there module use?

bleach excellent task. need. has extensive test suite checks strange edge cases tags slip through. have never had issue it.

Search This Blog

Kiastu

Safely remove all html code from a string in python -

Comments

Post a Comment

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -

ios - Undefined symbols for architecture armv7: "_OBJC_CLASS_$_SSZipArchive" -