public class JsoupBasedHtmlParser extends HTMLParser
LagartoBasedHtmlParser and this one (adapter pattern)ATT_BACKGROUND, ATT_CODE, ATT_CODEBASE, ATT_DATA, ATT_HREF, ATT_IS_IMAGE, ATT_REL, ATT_SRC, ATT_STYLE, ATT_TYPE, DEFAULT_PARSER, PARSER_CLASSNAME, STYLESHEET, TAG_APPLET, TAG_BASE, TAG_BGSOUND, TAG_BODY, TAG_EMBED, TAG_FRAME, TAG_IFRAME, TAG_IMAGE, TAG_INPUT, TAG_LINK, TAG_OBJECT, TAG_SCRIPT| Constructor and Description |
|---|
JsoupBasedHtmlParser() |
| Modifier and Type | Method and Description |
|---|---|
Iterator<URL> |
getEmbeddedResourceURLs(byte[] html,
URL baseUrl,
URLCollection coll,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
protected boolean |
isReusable()
Parsers should over-ride this method if the parser class is re-usable, in
which case the class will be cached for the next getParser() call.
|
getEmbeddedResourceURLs, getEmbeddedResourceURLs, getParser, getParserpublic Iterator<URL> getEmbeddedResourceURLs(byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
HTMLParserAll URLs should be added to the Collection.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException. N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
getEmbeddedResourceURLs in class HTMLParserhtml - HTML codebaseUrl - Base URL from which the HTML code was obtainedcoll - URLCollectionencoding - CharsetHTMLParseExceptionprotected boolean isReusable()
HTMLParserisReusable in class HTMLParserCopyright © 1998-2015 Apache Software Foundation. All Rights Reserved.