Forum Controls
Spotlight Features

The Rich Engineering Heritage Behind Dependency Injection

Andrew McVeigh takes us on a tour of the rich heritage behind dependency injection, what it represents, and tells us why its here to stay.

NetBeans 6: Matisse Updates

NetBeans 6 delivers great updates to the Matisse GUI builder. Spend a few minutes with Roman Strobl and get an expert briefing on what's new and what has changed.

Introduction to Groovy Part 3

In this, the third and final installation of Andres' Introduction to Groovy series, you learn about how Groovy handles variable numbers of arguments, named parameters, currying, and more about Groovy operators. Including, some new operators.

Easier Custom Components with Swing Fuse

Swing Fuse (actually just Fuse), is a framework designed to make it easier to create your own custom desktop components. In this article, Daniel Spiewak shows you how to get started and provides sample source code you can download.

Benchmark Analysis: Guice vs Spring

Willam Louth shows how he uses JXInsight Probes to investigate probable performance issues with code bases that he is not familiar with. He also highlights possible pitfalls in creating a benchmark, as well as in the analysis of results.
Replies: 12 - Pages: 1  
Threads: [ Previous | Next ]
  Click to reply to this thread Reply

How do you parse HTML in Java?

At 6:58 AM on Jan 16, 2008, Geertjan wrote:

The Open Source HTML Parsers in Java page is useful in listing the HTML parsers that are out there. But it doesn't give much of a clue about which are the "best" in a given situation. In other words, how should one decide which HTML parser to use? And, doesn't the proliferation of HTML parsers out there imply that there is something wrong with the JDK's own HTML parser, javax.swing.text.html.HTMLEditorKit.Parser ?

All things being equal, shouldn't one prefer to use a utility provided by the JDK over one provided by a third party library? (For this reason, I'm assuming that all things are not equal in this case.) I've been parsing HTML using the JDK's HTML parser, based on the approach described in Parsing HTML with Swing , although that's an article written in 2003, so it may be dated. The author of that article points to this weakness of the Swing HTML parser, at least, at the time of writing, back in 2003: "The biggest downside to this HTML parser is that it is not thread safe (thread safety has always been a problem with Swing components). This HTML processor is no different. I have used the Swing parser in heavily threaded environments, and it has resulted in a crash—eventually. If you want to use this HTML processor in a heavily threaded environment, you need to take steps to ensure that only one thread uses it at a time."

Is that the only weakness here? (By the way, on the positive side, the author writes: "I have used this parser with a number of programs that I have written, and I have found it to be very useful. It is particularly helpful for handling improperly formatted HTML, which can trip up some HTML parsers.") I guess the other HTML parsers may have additional features, those that relate to transformation in addition to parsing. And the other parsers probably allow for walking the DOM, rather than inspecting tags in the way that the Swing HTML Parser does. I have used JTidy before, but didn't find the benefits to outweigh the cumbersomeness of having to deal with a third party library.

Anyone care to share their experiences with these utilities?

1 . At 9:11 AM on Jan 16, 2008, Aaron Bonner wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

I use the HTMLParser and HTMLLexer libraries

http://htmlparser.sourceforge.net/

It has a somewhat unique API but it's very usable and wonderful for parsing sometimes woefully formatted HTML.
2 . At 9:42 AM on Jan 16, 2008, Jilles van Gurp DeveloperZone Top 100 wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

Depends on what you want to do with it. And you should use what works best of course. I generally end up using lots of custom libraries on any Java project (much better than reinventing the wheel yourself).

Basically in the few cases that I had to process HTML, I wanted to be able to treat it like a DOM tree so I ended up using jtidy. This allowed me to use simple xpath to extract bits and pieces out of the html.

Main issues:
- jtidy seems to be not maintained for several years, I ended up pulling an unreleased version from the version repository with several quite essential fixes
- it is quite slow
- html documents can be quite big and processing large dom trees can be expensive
- Jtidy was dropping tags it didn't understand (e.g. abbr, which is legal html and commonly used in microformats)

I don't see how thread safety is essential for a parser. Making things thread safe is generally quite expensive and unless you explicitly need to mess with the AST from multiple threads, you should not be even trying to do that
3 . At 9:59 AM on Jan 16, 2008, Alasdair Gilmour wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

I have used and been fairly happy with the Jericho HTML Parser:

http://jerichohtml.sourceforge.net/doc/index.html

Open Source, forgiving of badly formatted HTML (e.g. missing end tags etc), more lightweight than something DOM-based

Alasdair
4 . At 10:41 AM on Jan 16, 2008, jacklty wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

For complex html querying, I use tagsoup ( http://home.ccil.org/~cowan/XML/tagsoup/ ) to fix the html into a proper xhtml form first, then I can use xpath or xslt to retrieve information I need. You can even have the xhtml displayed on screen with flying saucer.
5 . At 12:19 PM on Jan 16, 2008, Geertjan wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

What's the reason for all of you not using the JDK's HTML parser? (1) Lack of knowledge about its existence, (2) Frustration with it, (3) Doesn't matter that it exists, since there are great to excellent 3rd party HTML parsers out there anyway, (4) Something else?
6 . At 12:48 PM on Jan 16, 2008, Will Hartung DeveloperZone Top 100 wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

...and as a corollary, I wanted a streaming parser that would basically let me apply a filter on tags as they were being shot out the socket to the client.

I inevitably ended up writing my own for that.
7 . At 2:36 PM on Jan 16, 2008, Tony Childs wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

> (thread safety has always been a problem with Swing
> components).

Problem? I wouldn't say Thread safety is a problem in Swing unless you don't use the components correctly. The idea is thread confinement. If Thread safety were added to all Swing components, the API would be too slow to use.
8 . At 8:43 PM on Jan 17, 2008, Raphael Valyi wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

You can also use HPricot from JRuby, it works great. HPricot is great to handle invalid HTML or grab HTML parts using CSS or XPath selectors (you can get from the Firefox Firebug plugin).

doc here: http://code.whytheluckystiff.net/hpricot/

Also bare in mind that since HPricot in Ruby use a C librairie to run faster, a specific version of it has been ported to JRuby, read this:
http://ola-bini.blogspot.com/2007/02/hpricot-goodness.html

Then you can deal with your parsing directly from JRuby which is great, but you can also put results inside javabeans and use them back from Java.

If you are looking for flexibility and simplicity, HPricot is for you, if you rather looking for Java language seamless integration, it's not (yet).

Regards,

Raphaël.
9 . At 5:44 AM on Jan 18, 2008, Ivica Loncar wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

I have used Cobra:

Cobra is a pure Java HTML renderer and DOM parser that is being developed to support HTML 4, Javascript and CSS 2.

Cobra can be used as a Javascript-aware and CSS-aware HTML DOM parser, independently of the Cobra rendering engine. Javascript DOM modifications that occur during parsing (e.g. via document.write) will be reflected in the parsed DOM, unless Javascript is disabled.


http://html.xamjwg.org/cobra.jsp
10 . At 8:52 AM on Jan 18, 2008, Derek Smith wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

I use apache axiom.
Works very well for me.

http://ws.apache.org/commons/axiom/
http://www.programmersbible.com - Online resource for programmers
11 . At 6:54 AM on Jan 20, 2008, patence wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

There have too many parser, I don't know which is better.
Java Software
12 . At 8:09 AM on Jan 21, 2008, yves zoundi wrote:
  Click to reply to this thread Reply

Re: How do you parse HTML in Java?

I use JTidy or NekoHTML. Both of them can serialize the result to DOM.

http://jtidy.sourceforge.net/
http://nekohtml.sourceforge.net/

thread.rss_message