Extract HTML Structure From Page With Simple Script

I often want to extract parts of HTML page. Often I find myself parsing a doc with Nokogiri and extracting it with a CSS selector. It’s pretty straight-forward from scratch.

To follow the Don’t Repeat Yourself principle, I created a script to make it a one-liner. Just run ruby parsepage.rb [url] [css_selector]

Feel free to use and modify the code (which i put on GitHub so you can fork and pull from it): http://gist.github.com/199558

I created the script because I wanted to extract headings from a recent article. You could also use it for creating table of contents or similar.

Usage examples


$ U=http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/
$ parsepage.rb $U "h3" --html-list
#=>
  <li>1. Form Labels Work Best Above The Field<li>
  <li>2. Users Focus On Faces<li>
  <li>3. Quality Of Design Is An Indicator Of Credibility<li>
  <li>4. Most Users Do Not Scroll<li>
  <li>5. Blue Is The Best Color For Links<li>
  <li>6. The Ideal Search Box Is 27-Characters Wide<li>
  <li>7. White Space Improves Comprehension<li>
  <li>8. Effective User Testing Doesn’t Have To Be Extensive<li>
  <li>9. Informative Product Pages Help You Stand Out<li>
  <li>10. Most Users Are Blind To Advertising<li>
  <li>Bonus: Findings From Our Case-Studies<li>
  <li>Other Resources<li>
  <li>Sponsors<li>
  <li>Smashing Links<li>
  <li>Stay in touch<li>
  <li>Popular Posts<li>
  <li>All Posts<li>
  <li>Blogroll<li>

$ parsepage.rb $U "h1" --text-only
#=>
  Smashing Magazine we smash you with the information that will make your life easier. really.
  10 Useful Usability Findings and Guidelines

$ parsepage.rb $U "h1:first-child"
#=>
<h1 class="title"><a href="http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/" rel="bookmark" title="10 Useful Usability Findings and Guidelines">10 Useful Usability Findings and Guidelines</a></h1>

Let me know if you find it useful. I’ll happily update script with any general contributions.

3 Responses to “Extract HTML Structure From Page With Simple Script”

  1. Spyros Says:

    Thanx for the script, it will definately come in handy.

  2. TechnoKyle Says:

    Cool! Definitely useful if you want to copy the site design by knowing the structure of the page

  3. Scott Roberts Says:

    Now, will this simply make the bulleted list shown above, or give you the underlying HTML code behind it?