java - Parse broken HTML Sites with XPath -


i questions python here , tools found python, new question: need query things html site xpath.

my current code looks this:

url url = new url("http://somesite.com"); connection = (httpurlconnection) url.openconnection(); connection.connect();  document doc = documentbuilderfactory.newinstance().newdocumentbuilder()                                      .parse(new inputsource(connection.getinputstream()));  xpathfactory xpathfactory = xpathfactory.newinstance(); xpath xpath = xpathfactory.newxpath(); xpathexpression expr = xpath.compile("//span[@class='a-class']"); string price = (string) expr.evaluate(doc, xpathconstants.string); 

the problem is, page broken or xpath has problems read:

[fatal error] :4:254: entity name must follow '&' in entity reference.
org.xml.sax.saxparseexception; linenumber: 4; columnnumber: 254; entity name must follow '&' in entity reference.
@ com.sun.org.apache.xerces.internal.parsers.domparser.parse(domparser.java:251)
@ com.sun.org.apache.xerces.internal.jaxp.documentbuilderimpl.parse(documentbuilderimpl.java:300)

is there tool can read html sites better? or should use regex on page?

is there tool can read html sites better?

people speak highly of jsoup.


Comments

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

sql - ASP.NET SqlDataSource, like on SelectCommand -

ios - Undefined symbols for architecture armv7: "_OBJC_CLASS_$_SSZipArchive" -