java - Parse broken HTML Sites with XPath -
this question has answer here:
- how “scan” website (or page) info, , bring program? 10 answers
- parse web site html java [duplicate] 3 answers
i questions python here , tools found python, new question: need query things html site xpath.
my current code looks this:
url url = new url("http://somesite.com"); connection = (httpurlconnection) url.openconnection(); connection.connect(); document doc = documentbuilderfactory.newinstance().newdocumentbuilder() .parse(new inputsource(connection.getinputstream())); xpathfactory xpathfactory = xpathfactory.newinstance(); xpath xpath = xpathfactory.newxpath(); xpathexpression expr = xpath.compile("//span[@class='a-class']"); string price = (string) expr.evaluate(doc, xpathconstants.string);
the problem is, page broken or xpath has problems read:
[fatal error] :4:254: entity name must follow '&' in entity reference.
org.xml.sax.saxparseexception; linenumber: 4; columnnumber: 254; entity name must follow '&' in entity reference.
@ com.sun.org.apache.xerces.internal.parsers.domparser.parse(domparser.java:251)
@ com.sun.org.apache.xerces.internal.jaxp.documentbuilderimpl.parse(documentbuilderimpl.java:300)
is there tool can read html sites better? or should use regex on page?
is there tool can read html sites better?
people speak highly of jsoup.
Comments
Post a Comment