public class Jsoup extends Object
Modifier and Type | Method and Description |
---|---|
static String |
clean(String bodyHtml,
String baseUri,
Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
tags and attributes.
|
static String |
clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of
permitted
tags and attributes.
|
static String |
clean(String bodyHtml,
Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
tags and attributes.
|
static Connection |
connect(String url)
Creates a new
Connection to a URL. |
static boolean |
isValid(String bodyHtml,
Whitelist whitelist)
Test if the input HTML has only tags and attributes allowed by the Whitelist.
|
static Document |
parse(File in,
String charsetName)
Parse the contents of a file as HTML.
|
static Document |
parse(File in,
String charsetName,
String baseUri)
Parse the contents of a file as HTML.
|
static Document |
parse(InputStream in,
String charsetName,
String baseUri)
Read an input stream, and parse it to a Document.
|
static Document |
parse(InputStream in,
String charsetName,
String baseUri,
Parser parser)
Read an input stream, and parse it to a Document.
|
static Document |
parse(String html)
Parse HTML into a Document.
|
static Document |
parse(String html,
String baseUri)
Parse HTML into a Document.
|
static Document |
parse(String html,
String baseUri,
Parser parser)
Parse HTML into a Document, using the provided Parser.
|
static Document |
parse(URL url,
int timeoutMillis)
Fetch a URL, and parse it as HTML.
|
static Document |
parseBodyFragment(String bodyHtml)
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. |
static Document |
parseBodyFragment(String bodyHtml,
String baseUri)
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. |
public static Document parse(String html, String baseUri)
html
- HTML to parsebaseUri
- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href>
tag.public static Document parse(String html, String baseUri, Parser parser)
html
- HTML to parsebaseUri
- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href>
tag.parser
- alternate parser
to use.public static Document parse(String html)
<base href>
tag.html
- HTML to parseparse(String, String)
public static Connection connect(String url)
Connection
to a URL. Use to fetch and parse a HTML page.
Use examples:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
url
- URL to connect to. The protocol must be http
or https
.public static Document parse(File in, String charsetName, String baseUri) throws IOException
in
- file to load HTML fromcharsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parse(File in, String charsetName) throws IOException
in
- file to load HTML fromcharsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).IOException
- if the file could not be found, or read, or if the charsetName is invalid.parse(File, String, String)
public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException
in
- input stream to read. Make sure to close it after parsing.charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOException
in
- input stream to read. Make sure to close it after parsing.charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.parser
- alternate parser
to use.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parseBodyFragment(String bodyHtml, String baseUri)
body
of the HTML.bodyHtml
- body HTML fragmentbaseUri
- URL to resolve relative URLs against.Document.body()
public static Document parseBodyFragment(String bodyHtml)
body
of the HTML.bodyHtml
- body HTML fragmentDocument.body()
public static Document parse(URL url, int timeoutMillis) throws IOException
connect(String)
instead.
The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8
.
url
- URL to fetch (with a GET). The protocol must be http
or https
.timeoutMillis
- Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.MalformedURLException
- if the request URL is not a HTTP or HTTPS URL, or is otherwise malformedHttpStatusException
- if the response is not OK and HTTP response errors are not ignoredUnsupportedMimeTypeException
- if the response mime type is not supported and those errors are not ignoredSocketTimeoutException
- if the connection times outIOException
- if a connection or read error occursconnect(String)
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist)
bodyHtml
- input untrusted HTML (body fragment)baseUri
- URL to resolve relative URLs againstwhitelist
- white-list of permitted HTML elementsCleaner.clean(Document)
public static String clean(String bodyHtml, Whitelist whitelist)
bodyHtml
- input untrusted HTML (body fragment)whitelist
- white-list of permitted HTML elementsCleaner.clean(Document)
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings)
bodyHtml
- input untrusted HTML (body fragment)baseUri
- URL to resolve relative URLs againstwhitelist
- white-list of permitted HTML elementsoutputSettings
- document output settings; use to control pretty-printing and entity escape modesCleaner.clean(Document)
public static boolean isValid(String bodyHtml, Whitelist whitelist)
bodyHtml
- HTML to testwhitelist
- whitelist to test againstclean(String, org.jsoup.safety.Whitelist)
Copyright © 2009–2016 Jonathan Hedley. All rights reserved.