kamagra oral jelly farmacia tadalafil para mujeres levitra prescrizione comprar viagra em portugal cialis presentacion y precio cialis generico opiniones cialis bestellen ohne rezept cialis preis schweiz cialis sito sicuro viagra lille viagra preise apotheke sildenafil prezzo acquistare viagra sicuro achat viagra paris viagra billig online bestellen
tarif viagra pilule cialis tadalafil generico mexico viagra ordonnance france levitra kaufen clomid ordonnance acheter sildenafil viagra doping tadalafil colombia accutane 10 mg super kamagra billig clomid prezzo viagra au luxembourg priligy kaufen schweiz cialis berlin

Extract HTML Structure From Page With Simple Script

I often want to extract parts of HTML page. Often I find myself parsing a doc with Nokogiri and extracting it with a CSS selector. It’s pretty straight-forward from scratch.

To follow the Don’t Repeat Yourself principle, I created a script to make it a one-liner. Just run

ruby parsepage.rb [url] [css_selector]

Feel free to use and modify the code (which i put on GitHub so you can fork and pull from it): http://gist.github.com/199558

I created the script because I wanted to extract headings from a recent article. You could also use it for creating table of contents or similar.

Usage examples

$ U=http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/ $ parsepage.rb $U "h3" --html-list #=> <li>1. Form Labels Work Best Above The Field<li> <li>2. Users Focus On Faces<li> <li>3. Quality Of Design Is An Indicator Of Credibility<li> <li>4. Most Users Do Not Scroll<li> <li>5. Blue Is The Best Color For Links<li> <li>6. The Ideal Search Box Is 27-Characters Wide<li> <li>7. White Space Improves Comprehension<li> <li>8. Effective User Testing Doesn’t Have To Be Extensive<li> <li>9. Informative Product Pages Help You Stand Out<li> <li>10. Most Users Are Blind To Advertising<li> <li>Bonus: Findings From Our Case-Studies<li> <li>Other Resources<li> <li>Sponsors<li> <li>Smashing Links<li> <li>Stay in touch<li> <li>Popular Posts<li> <li>All Posts<li> <li>Blogroll<li>
$ parsepage.rb $U "h1" --text-only #=> Smashing Magazine we smash you with the information that will make your life easier. really. 10 Useful Usability Findings and Guidelines $ parsepage.rb $U "h1:first-child" #=> <h1 class="title"><a href="http://www.smashingmagazine.com/2009/09/24/10-useful-usability-findings-and-guidelines/" rel="bookmark" title="10 Useful Usability Findings and Guidelines">10 Useful Usability Findings and Guidelines</a></h1>

Let me know if you find it useful. I’ll happily update script with any general contributions.

3 Responses to “Extract HTML Structure From Page With Simple Script”

  1. Spyros Says:

    Thanx for the script, it will definately come in handy.

  2. TechnoKyle Says:

    Cool! Definitely useful if you want to copy the site design by knowing the structure of the page

  3. Scott Roberts Says:

    Now, will this simply make the bulleted list shown above, or give you the underlying HTML code behind it?