NetBeans 6 delivers great updates to the Matisse GUI builder. Spend a few minutes with Roman Strobl and get an expert briefing on what's new and what has changed. (sponsored)
In this, the third and final installation of Andres' Introduction to Groovy series, you learn about how Groovy handles variable numbers of arguments, named parameters, currying, and more about Groovy operators. Including, some new operators.
Swing Fuse (actually just Fuse), is a framework designed to make it easier to create your own custom desktop components. In this article, Daniel Spiewak shows you how to get started and provides sample source code you can download.
Willam Louth shows how he uses JXInsight Probes to investigate probable performance issues with code bases that he is not familiar with. He also highlights possible pitfalls in creating a benchmark, as well as in the analysis of results.
The
Open Source HTML Parsers in Java
page is useful in listing the HTML parsers that are out there. But it doesn't give much of a clue about which are the "best" in a given situation. In other words, how should one decide which HTML parser to use? And, doesn't the proliferation of HTML parsers out there imply that there is something wrong with the JDK's own HTML parser,
javax.swing.text.html.HTMLEditorKit.Parser
?
All things being equal, shouldn't one prefer to use a utility provided by the JDK over one provided by a third party library? (For this reason, I'm assuming that all things are not equal in this case.) I've been parsing HTML using the JDK's HTML parser, based on the approach described in
Parsing HTML with Swing
, although that's an article written in 2003, so it may be dated. The author of that article points to this weakness of the Swing HTML parser, at least, at the time of writing, back in 2003: "The biggest downside to this HTML parser is that it is not thread safe (thread safety has always been a problem with Swing components). This HTML processor is no different. I have used the Swing parser in heavily threaded environments, and it has resulted in a crash—eventually. If you want to use this HTML processor in a heavily threaded environment, you need to take steps to ensure that only one thread uses it at a time."
Is that the only weakness here? (By the way, on the positive side, the author writes: "I have used this parser with a number of programs that I have written, and I have found it to be very useful. It is particularly helpful for handling improperly formatted HTML, which can trip up some HTML parsers.") I guess the other HTML parsers may have
additional
features, those that relate to
transformation
in addition to parsing. And the other parsers probably allow for walking the DOM, rather than inspecting tags in the way that the Swing HTML Parser does. I have used JTidy before, but didn't find the benefits to outweigh the cumbersomeness of having to deal with a third party library.
Anyone care to share their experiences with these utilities?
Depends on what you want to do with it. And you should use what works best of course. I generally end up using lots of custom libraries on any Java project (much better than reinventing the wheel yourself).
Basically in the few cases that I had to process HTML, I wanted to be able to treat it like a DOM tree so I ended up using jtidy. This allowed me to use simple xpath to extract bits and pieces out of the html.
Main issues:
- jtidy seems to be not maintained for several years, I ended up pulling an unreleased version from the version repository with several quite essential fixes
- it is quite slow
- html documents can be quite big and processing large dom trees can be expensive
- Jtidy was dropping tags it didn't understand (e.g. abbr, which is legal html and commonly used in microformats)
I don't see how thread safety is essential for a parser. Making things thread safe is generally quite expensive and unless you explicitly need to mess with the AST from multiple threads, you should not be even trying to do that
For complex html querying, I use tagsoup ( http://home.ccil.org/~cowan/XML/tagsoup/ ) to fix the html into a proper xhtml form first, then I can use xpath or xslt to retrieve information I need. You can even have the xhtml displayed on screen with flying saucer.
What's the reason for all of you not using the JDK's HTML parser? (1) Lack of knowledge about its existence, (2) Frustration with it, (3) Doesn't matter that it exists, since there are great to excellent 3rd party HTML parsers out there anyway, (4) Something else?
...and as a corollary, I wanted a streaming parser that would basically let me apply a filter on tags as they were being shot out the socket to the client.
> (thread safety has always been a problem with Swing
> components).
Problem? I wouldn't say Thread safety is a problem in Swing unless you don't use the components correctly. The idea is thread confinement. If Thread safety were added to all Swing components, the API would be too slow to use.
You can also use HPricot from JRuby, it works great. HPricot is great to handle invalid HTML or grab HTML parts using CSS or XPath selectors (you can get from the Firefox Firebug plugin).
Cobra is a pure Java HTML renderer and DOM parser that is being developed to support HTML 4, Javascript and CSS 2.
Cobra can be used as a Javascript-aware and CSS-aware HTML DOM parser, independently of the Cobra rendering engine. Javascript DOM modifications that occur during parsing (e.g. via document.write) will be reflected in the parsed DOM, unless Javascript is disabled.
How do you parse HTML in Java?
At 6:58 AM on Jan 16, 2008, Geertjan wrote:
Fresh Jobs for Developers Post a job opportunity
All things being equal, shouldn't one prefer to use a utility provided by the JDK over one provided by a third party library? (For this reason, I'm assuming that all things are not equal in this case.) I've been parsing HTML using the JDK's HTML parser, based on the approach described in Parsing HTML with Swing , although that's an article written in 2003, so it may be dated. The author of that article points to this weakness of the Swing HTML parser, at least, at the time of writing, back in 2003: "The biggest downside to this HTML parser is that it is not thread safe (thread safety has always been a problem with Swing components). This HTML processor is no different. I have used the Swing parser in heavily threaded environments, and it has resulted in a crash—eventually. If you want to use this HTML processor in a heavily threaded environment, you need to take steps to ensure that only one thread uses it at a time."
Is that the only weakness here? (By the way, on the positive side, the author writes: "I have used this parser with a number of programs that I have written, and I have found it to be very useful. It is particularly helpful for handling improperly formatted HTML, which can trip up some HTML parsers.") I guess the other HTML parsers may have additional features, those that relate to transformation in addition to parsing. And the other parsers probably allow for walking the DOM, rather than inspecting tags in the way that the Swing HTML Parser does. I have used JTidy before, but didn't find the benefits to outweigh the cumbersomeness of having to deal with a third party library.
Anyone care to share their experiences with these utilities?
12 replies so far (
Post your own)
Re: How do you parse HTML in Java?
I use the HTMLParser and HTMLLexer librarieshttp://htmlparser.sourceforge.net/
It has a somewhat unique API but it's very usable and wonderful for parsing sometimes woefully formatted HTML.
Re: How do you parse HTML in Java?
Depends on what you want to do with it. And you should use what works best of course. I generally end up using lots of custom libraries on any Java project (much better than reinventing the wheel yourself).Basically in the few cases that I had to process HTML, I wanted to be able to treat it like a DOM tree so I ended up using jtidy. This allowed me to use simple xpath to extract bits and pieces out of the html.
Main issues:
- jtidy seems to be not maintained for several years, I ended up pulling an unreleased version from the version repository with several quite essential fixes
- it is quite slow
- html documents can be quite big and processing large dom trees can be expensive
- Jtidy was dropping tags it didn't understand (e.g. abbr, which is legal html and commonly used in microformats)
I don't see how thread safety is essential for a parser. Making things thread safe is generally quite expensive and unless you explicitly need to mess with the AST from multiple threads, you should not be even trying to do that
Re: How do you parse HTML in Java?
I have used and been fairly happy with the Jericho HTML Parser:http://jerichohtml.sourceforge.net/doc/index.html
Open Source, forgiving of badly formatted HTML (e.g. missing end tags etc), more lightweight than something DOM-based
Alasdair
Re: How do you parse HTML in Java?
For complex html querying, I use tagsoup ( http://home.ccil.org/~cowan/XML/tagsoup/ ) to fix the html into a proper xhtml form first, then I can use xpath or xslt to retrieve information I need. You can even have the xhtml displayed on screen with flying saucer.Re: How do you parse HTML in Java?
What's the reason for all of you not using the JDK's HTML parser? (1) Lack of knowledge about its existence, (2) Frustration with it, (3) Doesn't matter that it exists, since there are great to excellent 3rd party HTML parsers out there anyway, (4) Something else?Re: How do you parse HTML in Java?
...and as a corollary, I wanted a streaming parser that would basically let me apply a filter on tags as they were being shot out the socket to the client.I inevitably ended up writing my own for that.
Re: How do you parse HTML in Java?
> (thread safety has always been a problem with Swing> components).
Problem? I wouldn't say Thread safety is a problem in Swing unless you don't use the components correctly. The idea is thread confinement. If Thread safety were added to all Swing components, the API would be too slow to use.
Re: How do you parse HTML in Java?
You can also use HPricot from JRuby, it works great. HPricot is great to handle invalid HTML or grab HTML parts using CSS or XPath selectors (you can get from the Firefox Firebug plugin).doc here: http://code.whytheluckystiff.net/hpricot/
Also bare in mind that since HPricot in Ruby use a C librairie to run faster, a specific version of it has been ported to JRuby, read this:
http://ola-bini.blogspot.com/2007/02/hpricot-goodness.html
Then you can deal with your parsing directly from JRuby which is great, but you can also put results inside javabeans and use them back from Java.
If you are looking for flexibility and simplicity, HPricot is for you, if you rather looking for Java language seamless integration, it's not (yet).
Regards,
Raphaël.
Re: How do you parse HTML in Java?
I have used Cobra:Cobra is a pure Java HTML renderer and DOM parser that is being developed to support HTML 4, Javascript and CSS 2.
Cobra can be used as a Javascript-aware and CSS-aware HTML DOM parser, independently of the Cobra rendering engine. Javascript DOM modifications that occur during parsing (e.g. via document.write) will be reflected in the parsed DOM, unless Javascript is disabled.
http://html.xamjwg.org/cobra.jsp
Re: How do you parse HTML in Java?
I use apache axiom.Works very well for me.
http://ws.apache.org/commons/axiom/
Re: How do you parse HTML in Java?
There have too many parser, I don't know which is better.Re: How do you parse HTML in Java?
I use JTidy or NekoHTML. Both of them can serialize the result to DOM.http://jtidy.sourceforge.net/
http://nekohtml.sourceforge.net/