acheter sildenafil milligrams 
achat de viagra mg 
boutiques viagra 
viagra pharmacyed 
acheter viagra livraison h 
acheter viagra en france 
viagra suisse pharmacie 
viagra site fiable 
viagra vente libre 
cialis remboursement secu 
generique du cialis 
acheter viagra belgique sans ordonnance 
cialis interdit belgique 
viagra toulouse 
sildenafil moins cher 
comprare viagra in farmacia diff c arence viagra cialis levitra kamagra kamagra apotek viagra indio achat de cialis viagra avec ou sans ordonnance viagra kautabletten niederlande cialis achat cialis en europe vente viagra pfizer cialis sans ordonnance belgique prix viagra suisse comprare viagra in india levitra bayer cialis generico venta espana

Extract HTML Structure From Page With Simple Script

I often want to extract parts of HTML page. Often I find myself parsing a doc with Nokogiri and extracting it with a CSS selector. It’s pretty straight-forward from scratch.

To follow the Don’t Repeat Yourself principle, I created a script to make it a one-liner. Just run

ruby parsepage.rb [url] [css_selector]

Feel free to use and modify the code (which i put on GitHub so you can fork and pull from it): http://gist.github.com/199558

I created the script because I wanted to extract headings from a recent article. You could also use it for creating table of contents or similar.

Usage examples

$ U=http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/ $ parsepage.rb $U "h3" --html-list #=> <li>1. Form Labels Work Best Above The Field<li> <li>2. Users Focus On Faces<li> <li>3. Quality Of Design Is An Indicator Of Credibility<li> <li>4. Most Users Do Not Scroll<li> <li>5. Blue Is The Best Color For Links<li> <li>6. The Ideal Search Box Is 27-Characters Wide<li> <li>7. White Space Improves Comprehension<li> <li>8. Effective User Testing Doesn’t Have To Be Extensive<li> <li>9. Informative Product Pages Help You Stand Out<li> <li>10. Most Users Are Blind To Advertising<li> <li>Bonus: Findings From Our Case-Studies<li> <li>Other Resources<li> <li>Sponsors<li> <li>Smashing Links<li> <li>Stay in touch<li> <li>Popular Posts<li> <li>All Posts<li> <li>Blogroll<li>
$ parsepage.rb $U "h1" --text-only #=> Smashing Magazine we smash you with the information that will make your life easier. really. 10 Useful Usability Findings and Guidelines $ parsepage.rb $U "h1:first-child" #=> <h1 class="title"><a href="http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/" rel="bookmark" title="10 Useful Usability Findings and Guidelines">10 Useful Usability Findings and Guidelines</a></h1>

Let me know if you find it useful. I’ll happily update script with any general contributions.

3 Responses to “Extract HTML Structure From Page With Simple Script”

  1. Spyros Says:

    Thanx for the script, it will definately come in handy.

  2. TechnoKyle Says:

    Cool! Definitely useful if you want to copy the site design by knowing the structure of the page

  3. Scott Roberts Says:

    Now, will this simply make the bulleted list shown above, or give you the underlying HTML code behind it?