hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Okay, Give Hpricot 0.2 a Go #

by why in inspect

This time I’m giving a balloon out which can be used for quick testing.

http://balloon.hobix.com/hpricot

Or, if you want to install Hpricot 0.2:

gem install hpricot --source code.whytheluckystiff.net

So the Hpricot parser is basically complete. There’s still lots of fiddling ahead: it doesn’t handle Javascript whatsoever and it’s not yet as flexible as HTree. However, it does fix alot of HTML that RubyfulSoup and the htmltools won’t.

Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.

                     user     system      total        real
 hpricot:       10.515625   0.000000  10.515625 ( 10.610571)
 scrapi:        32.546875   0.093750  32.640625 ( 32.923535)
 htree:         56.609375   0.023438  56.632812 ( 57.096530)
 rubyfulsoup:   29.289062   0.046875  29.335938 ( 29.586510)
 mechanize:(*) 148.132812   1.101562 149.234375 (150.621922)
 htmltok:(*)    19.632812   0.007812  19.640625 ( 19.795446)

(*) These libs are a bit more primitive, focusing only on reading documents, no calls are given for modifying documents.

The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.

I didn’t include scrapi because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all p.posted elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.

Update: Thanks to assaf, I got scrapi working with libtidy and reporting back the right answers. Thankya! Update #2: An htmltokenizer benchmark.

said on

Excellent work. Who’s going to be the first to make a Ruby version of Pornolize then?

said on

Sounds great, I’ll give it a try on http://news.bbc.co.uk – RubyfulSoup handles that so slowly last time I tried.

said on

I don’t know what I’ll use it for… but I have a deep desire to play with this. Much love for the JQuery style expresions! They made me want to use JQuery… but I couldn’t figure out what I’d use it for.

said on

W00t! Balloon is truley usefull!

said on

So, it works if I pass in an IO object to Hpricot.parse (from either a file or a url, like the balloon). It doesn’t work if I pass in a string (i.e., the name of the file to parse, like the example.)

said on

For scrapi, you’ll have to use Tidy for now. The non-tidy parser doesn’t deal with bad HTML , which is why I’m looking for an alternative that can clean HTML well and fast.

With today’s code drop you can do something like:
# Set it to use Tidy.
Scraper::Base.tidy_options({})

# Define a scraper.
boing_boing = Scraper.define do
  array :posts
  process "p.posted", :posts=>:node
  result :posts
end

# Scrape away!
puts boing_boing.scrape(html).size
said on

seg: Ohhh, you’re right. That example was totally wrong! Hpricot.parse takes an HTML string or an IO object containing HTML .

assaf: Hurray, that works.

said on

Just to ask a really dumb question, but if I have an Element, how can I get the text found within the tag?

Thanks for the library and sorry for the question.

said on

For now, you’ll need to loop through the children of the Element. Some of those will be Text objects which have a content property containing the string.

The next version will have an innerHTML property on every element.

said on

Is there a reason there was no comparison with HTMLTokenizer? Or is that because it’s not even in the same ballpark as the rest?

said on
Okay, for anyone else searching for some straight example code from the Hpricot posts, here’s some that works:
wget http://redhanded.hobix.com/index.html

require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
doc = Hpricot.parse(open("index.html"))
(doc/:p/:a).each do |link|
  p link.attributes
end

said on

Jerome: Okay. Htmltokenizer is pretty quick, but read-only. But I’m really glad you mentioned this one, because I could offer access to the Hpricot tokenizer, which would speed things up by literally an order of magnitude.

In fact, you can already get access to this by using Hpricot.scan.

 doc = Hpricot.scan(open("index.html)) do |token|
   p token
 end

Which give you back:

 [:doctype, "html", {"system_id"=>"\"DTD/xhtml1-transitional.dtd\"", "publid_id"=>"PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" 
"}, "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"DTD/xhtml1-transitional.dtd\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "html", {"xml:lang"=>"en", "lang"=>"en", "xmlns"=>"http://www.w3.org/1999/xhtml"}, "<html xmlns=\"http://www.w3.org/1999/xh
tml\" lang=\"en\" xml:lang=\"en\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "head", nil, "<head>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "meta", {"content"=>"text/html; charset=utf-8", "http-equiv"=>"Content-Type"}, "<meta http-equiv=\"Content-Type\" conte
nt=\"text/html; charset=utf-8\" />"]
 [:text, "\n", nil, "\n"]
 [:stag, "title", nil, "<title>"]
 [:text, "RedHanded &raquo; sneaking Ruby through the system", nil, "RedHanded &raquo; sneaking Ruby through the system"]
 [:etag, "title", nil, "</title>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "link", {"href"=>"http://redhanded.hobix.com/index.xml", "title"=>"RSS", "rel"=>"alternate", "type"=>"application/rss+x
ml"}, "<link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://redhanded.hobix.com/index.xml\" />"]
 [:text, "\n", nil, "\n"]
 ...

Basically: (1) a symbol describing the element type, (2) the tag name or text content, (3) an attributes hash, and (4) the raw string which formed this token.

The scanning stage is easy. It’s the figuring out the layout of the document and coercing wellformedness that’s the spiny one.

said on

anon: news.bbc.co.uk was broken (for me) in Hpricot 0.2, but it’s working in trunk. So is McSweeney’s (awful HTML .) More, more, anymore really really bad HTML sites I can use?

said on

Found a “bug”. The Scanner fails when it encounters &lt;!----&gt;

msg = “negative string size (or size too big) (ArgumentError)”

said on

hey Preview shows something else than the actual Comment! Well anyways, the scanner fails when it encounters an empty HTML Comment. See if this is right <!---->

said on

thomas: That little oddity is fixed in trunk. McSweeney’s has one of those suckers.

said on

Tried trunk, didnt work. @ svn co https://code.whytheluckystiff.net/svn/hpricot/trunk hpricot cd hpricot rake install

“Successfully installed hpricot, version 0.2”

require ‘rubygems’ require_gem ‘hpricot’, ”>=0.2”

doc = Hpricot.parse(”<!>”) @ Fails .. missing something?

said on

sorry these comments are killing me, I should RTFM

said on

Oh, do:

 cd hpricot
 rake ragel
 rake install

You’ll need Ragel installed to build the new scanner.

said on

Thanks, that did it.

Not sure if that is of any use to you. But I needed it: http://rafb.net/paste/results/bVlGWd11.html

doc.get_elements_by_tag_name('h3').each { |tag| puts tag.inner_text }

said on

Great little library – love it. I’ve written a small extension for Test::Unit that lets you test your Rails views using hpricot instead of the clunky assert_tag function.

Hpricot Test Extension for Rails

said on

Cleary Hpricot is for HTML , but how might it fair with strict XML ?

said on

In the benchmark i miss a comparison with ruby-libxml!

said on

so does this not work on windows?

said on

If you can parse this site then Hpricot is the magical!

said on

probablyCorey: Oh, wow, that is hideous. Three nested HTML pages. Hpricot does it, but I really don’t know what’s correct in this case.

said on

Any chance of making a “pure” ruby version?

said on

Binaries will be out in 0.5. Watch the map.

said on

I found a borken page for you. At least it broke Hpricot 0.3.

Broken

`build_node': [bug] unknown structure: [:xmlprocins, "@include(\"ocregister/includes/global/login_table.php\");", nil, nil] (Exception)
said on

will hpricot work in ruby 1.8.2?

said on DD Mon YYYY at HH:MM

* do fancy stuff in your comment.

PREVIEW PANE