java - Parse broken HTML Sites with XPath -

- March 15, 2014

this question has answer here:

how “scan” website (or page) info, , bring program? 10 answers
parse web site html java [duplicate] 3 answers

i questions python here , tools found python, new question: need query things html site xpath.

my current code looks this:

url url = new url("http://somesite.com"); connection = (httpurlconnection) url.openconnection(); connection.connect();  document doc = documentbuilderfactory.newinstance().newdocumentbuilder()                                      .parse(new inputsource(connection.getinputstream()));  xpathfactory xpathfactory = xpathfactory.newinstance(); xpath xpath = xpathfactory.newxpath(); xpathexpression expr = xpath.compile("//span[@class='a-class']"); string price = (string) expr.evaluate(doc, xpathconstants.string);

the problem is, page broken or xpath has problems read:

[fatal error] :4:254: entity name must follow '&' in entity reference.
org.xml.sax.saxparseexception; linenumber: 4; columnnumber: 254; entity name must follow '&' in entity reference.
@ com.sun.org.apache.xerces.internal.parsers.domparser.parse(domparser.java:251)
@ com.sun.org.apache.xerces.internal.jaxp.documentbuilderimpl.parse(documentbuilderimpl.java:300)

is there tool can read html sites better? or should use regex on page?

is there tool can read html sites better?

people speak highly of jsoup.

Search This Blog

Kiastu

java - Parse broken HTML Sites with XPath -

Comments

Post a Comment

Popular posts from this blog

android - getbluetoothservice() called with no bluetoothmanagercallback -

javascript - Image onload event not firing in firefox -

sql - ASP.NET SqlDataSource, like on SelectCommand -