[#323782] Help with HTML parsing — Vivek Netha <vivnet@...>

Hello,

13 messages 2009/01/01

[#323881] Default values of hashes — Glenn <glenn_ritz@...>

Hi,

16 messages 2009/01/03

[#323906] VERY simple question about "?" — Tom Cloyd <tomcloyd@...>

I absolutely love Ruby, but...I've always found the subject of Ruby

25 messages 2009/01/04
[#323908] Re: VERY simple question about "?" — "F. Senault" <fred@...> 2009/01/04

Le 4 janvier 2009 12:58, Tom Cloyd a 馗rit :

[#323910] Re: VERY simple question about "?" — "Yaser Sulaiman" <yaserbuntu@...> 2009/01/04

On Sun, Jan 4, 2009 at 3:34 PM, F. Senault <fred@lacave.net> wrote:

[#323913] Re: VERY simple question about "?" — Tom Cloyd <tomcloyd@...> 2009/01/04

Yaser Sulaiman wrote:

[#323920] Re: VERY simple question about "?" — Robert Klemme <shortcutter@...> 2009/01/04

On 04.01.2009 14:48, Tom Cloyd wrote:

[#323923] Re: VERY simple question about "?" — "Michael Guterl" <mguterl@...> 2009/01/04

On Sun, Jan 4, 2009 at 9:48 AM, Robert Klemme

[#323924] Re: VERY simple question about "?" — "Robert Klemme" <shortcutter@...> 2009/01/04

2009/1/4 Michael Guterl <mguterl@gmail.com>:

[#323944] Re: VERY simple question about "?" — Tom Cloyd <tomcloyd@...> 2009/01/04

Robert Klemme wrote:

[#323956] How can I prevent require duplicate files — Zhao Yi <youhaodeyi@...>

In a big ruby project, how to prevent requiring a file multiple times?

12 messages 2009/01/05

[#324027] WANTED: need a real web API for rubyforge.org — Ryan Davis <ryand-ruby@...>

I just released version 1.0.2 of the rubyforge command line client. It

38 messages 2009/01/05
[#324040] Re: WANTED: need a real web API for rubyforge.org — Trans <transfire@...> 2009/01/06

[#324065] Re: WANTED: need a real web API for rubyforge.org — Tiago Nogueira <tjnogueira@...> 2009/01/06

Trans escreveu:

[#324145] Re: WANTED: need a real web API for rubyforge.org — "Gregory Brown" <gregory.t.brown@...> 2009/01/07

On Tue, Jan 6, 2009 at 6:08 AM, Tiago Nogueira <tjnogueira@oomaster.com> wrote:

[#324146] Re: WANTED: need a real web API for rubyforge.org — Tiago Nogueira <tjnogueira@...> 2009/01/07

Gregory Brown escreveu:

[#324152] Re: WANTED: need a real web API for rubyforge.org — "Gregory Brown" <gregory.t.brown@...> 2009/01/07

On Wed, Jan 7, 2009 at 11:07 AM, Tiago Nogueira <tjnogueira@oomaster.com> wrote:

[#324155] Re: WANTED: need a real web API for rubyforge.org — Tiago Nogueira <tjnogueira@...> 2009/01/07

Gregory Brown escreveu:

[#324163] Re: WANTED: need a real web API for rubyforge.org — "Gregory Brown" <gregory.t.brown@...> 2009/01/07

On Wed, Jan 7, 2009 at 11:38 AM, Tiago Nogueira <tjnogueira@oomaster.com> wrote:

[#324159] Re: WANTED: need a real web API for rubyforge.org — Marcelo <marcelo.magallon@...> 2009/01/07

On Wed, Jan 7, 2009 at 10:07 AM, Tiago Nogueira <tjnogueira@oomaster.com> wrote:

[#324127] a good career choice ? — waterinmylungss@...

Hello, I am from the USA and I graduated last year with a BS. I've

23 messages 2009/01/07

[#324194] functional programming — "Haris Bogdanovic" <fbogdanovic@...>

Hi.

87 messages 2009/01/08
[#324498] Re: functional programming — pjb@... (Pascal J. Bourguignon) 2009/01/10

Brian Candler <b.candler@pobox.com> writes:

[#324502] Re: functional programming — Brian Candler <b.candler@...> 2009/01/10

Pascal J. Bourguignon wrote:

[#324340] How Ruby — Dhushyanth Ramasamy <r.dushyanth@...>

Well i read the posts on "Why ruby" now I wanted to know "How Ruby":

22 messages 2009/01/09
[#324341] Re: How Ruby — Mike Stephens <rubfor@...> 2009/01/09

My view is you start out simple and straightforward. There are no Police

[#324398] Q: most efficient way to remove duplicate spaces in a string? — Mark Watson <mark.watson@...>

I don't usually worry too much about efficiency unless runtime

14 messages 2009/01/09

[#324410] Behavior of 0 and 0.0... — Raphael Clancy <raphael.clancy@...>

I was playing around with the basic math functions, and I had some

16 messages 2009/01/09

[#324420] value of an expression? — Kedar Mhaswade <kedar.mhaswade@...>

Sorry if this is asked before and I could not find its answer. Take a

20 messages 2009/01/09

[#324519] Binding.of_caller examples don't work. — Doug <doug14@...>

I'm trying to use the Binding class in the extensions library

17 messages 2009/01/10

[#324561] reccommended work flow for unit tests and databases in ruby — Adam Akhtar <adamtemporary@...>

How does one go around creating unit tests for database tables during

15 messages 2009/01/11

[#324692] what's the rules re whether a Hash can use either a Symbol or String to reference the value??? — "Greg Hauptmann" <greg.hauptmann.ruby@...>

Hi,

9 messages 2009/01/12

[#324793] Returning a duplicate from an Array — Jeff Miller <loadeddesigns@...>

Hey guys,

12 messages 2009/01/13

[#324830] Higher order ruby — zslevi <zslevi@...>

foo = lambda {|x| lambda {|y| return x+y}}

13 messages 2009/01/14

[#324844] Why do true and false have separate classes — Ruby Rabbit <sentinel.2001@...>

This has puzzled me a bit. I googled and came up with responses like --

18 messages 2009/01/14

[#324870] Ncurses like library? — Tim Mcd <tmcdowell@...>

Excuse me, but does anyone know of an Ncurses-like library for Ruby? For

20 messages 2009/01/14

[#324898] 10 things to be aware of in 1.8 -> 1.9 transition — "David A. Black" <dblack@...>

Hi --

23 messages 2009/01/14

[#324935] Quizmaster Retiring: Revenge of the Sith — Matthew Moss <matt@...>

Sorry for the dorky subject line...

38 messages 2009/01/14
[#325027] Re: Quizmaster Retiring: Revenge of the Sith — "Martin DeMello" <martindemello@...> 2009/01/15

On Thu, Jan 15, 2009 at 5:07 AM, Matthew Moss <matt@moss.name> wrote:

[#325055] Re: Quizmaster Retiring: Revenge of the Sith — "Robert Dober" <robert.dober@...> 2009/01/16

Sorry Matthew if I let some of our private discussions out here but I

[#325121] Re: Quizmaster Retiring: Revenge of the Sith — Matthew Moss <matt@...> 2009/01/16

> But it is interesting to note that we had quite some discussions and I

[#325123] Re: Quizmaster Retiring: Revenge of the Sith — "Robert Dober" <robert.dober@...> 2009/01/16

On Fri, Jan 16, 2009 at 5:05 PM, Matthew Moss <matt@moss.name> wrote:

[#325144] Re: Quizmaster Retiring: Revenge of the Sith — "Daniel Moore" <yahivin@...> 2009/01/16

Hello Everyone,

[#325149] Re: Quizmaster Retiring: Revenge of the Sith — Matthew Moss <matt@...> 2009/01/16

[#325021] Desktop <-> Web — Trans <transfire@...>

I want to interface a desktop application to a backend web

19 messages 2009/01/15
[#325066] Re: Desktop <-> Web — Martin DeMello <martindemello@...> 2009/01/16

On Thu, Jan 15, 2009 at 10:39 PM, Trans <transfire@gmail.com> wrote:

[#325046] RubyGem, find path of installed gem through ruby. — Aaron Smith <beingthexemplary@...>

Hey All,

11 messages 2009/01/16

[#325070] Describing degerate dna strings — George George <george.githinji@...>

I am working with strings of 4 letter alphabet a,c,t,g that describe

12 messages 2009/01/16

[#325114] How to refresh Image through ajax request — Kumar Saurav <saurav@...>

Hi all ,

13 messages 2009/01/16

[#325217] 1.8.6 OCI binary extension question — Tim Hunter <TimHunter@...>

I'm trying to build a new release of RMagick that is compatible with the

10 messages 2009/01/17

[#325218] Re: reading file to list — Xah Lee <xahlee@...>

comp.lang.lisp,comp.lang.scheme,comp.lang.functional,comp.lang.python,comp.=

18 messages 2009/01/17

[#325234] If you use PDF::Writer, read this post! — Gregory Brown <gregory.t.brown@...>

I'm getting very few requests for features to add to Prawn that exist

19 messages 2009/01/18
[#325235] Re: If you use PDF::Writer, read this post! — Stefan Lang <perfectly.normal.hacker@...> 2009/01/18

2009/1/18 Gregory Brown <gregory.t.brown@gmail.com>:

[#325236] Re: If you use PDF::Writer, read this post! — Gregory Brown <gregory.t.brown@...> 2009/01/18

On Sat, Jan 17, 2009 at 9:00 PM, Stefan Lang

[#325242] Re: If you use PDF::Writer, read this post! — Trans <transfire@...> 2009/01/18

[#325243] Re: If you use PDF::Writer, read this post! — Gregory Brown <gregory.t.brown@...> 2009/01/18

On Sat, Jan 17, 2009 at 10:12 PM, Trans <transfire@gmail.com> wrote:

[#325254] Re: If you use PDF::Writer, read this post! — "Redd Vinylene" <reddvinylene@...> 2009/01/18

Does anybody use this stuff as an alternative to LaTeX?

[#325353] Converting binary image file to bmp file using RMagick2.0 — Kamaljeet Saini <kamaljeet_singh_saini@...>

We are trying to convert "image1.txt" file which is a binary file to

18 messages 2009/01/19
[#325495] Re: Converting binary image file to bmp file using RMagick2.0 — Heesob Park <phasis@...> 2009/01/21

Hi,

[#325629] Re: Converting binary image file to bmp file using RMagick2.0 — Kamaljeet Saini <kamaljeet_singh_saini@...> 2009/01/22

The above posting code worked fine for 704/480 binary to image file but

[#325417] Is it possible to install exe via ruby code ? — jazzez ravi <jazzezravi@...>

I have a exe file in c:/test.exe

13 messages 2009/01/20
[#325418] Re: Is it possible to install exe via ruby code ? — jazzez ravi <jazzezravi@...> 2009/01/20

Sorry for the wrong code in previous post

[#325542] String doesnt auto dup on modification — RK Sentinel <sentinel.2001@...>

I'm writing my first largeish app. One issue that gets me frequently is

34 messages 2009/01/21

[#325602] Separate random number generators? — Bart Braem <bart.braem@...>

For simulation work, I want to use multiple, independent random number

19 messages 2009/01/22

[#325649] Choosing the most appropiate Ruby version and programming model to develop a SIP server — Iñaki Baz Castillo <ibc@...>

Hi, I need to do a choice between the various Ruby versions (1.8, 1.9, JRub=

11 messages 2009/01/22
[#325675] Re: Choosing the most appropiate Ruby version and programming model to develop a SIP server — Brian Candler <b.candler@...> 2009/01/23

I単aki Baz Castillo wrote:

[#325683] Re: Choosing the most appropiate Ruby version and programming model to develop a SIP server — Iñaki Baz Castillo <ibc@...> 2009/01/23

2009/1/23 Brian Candler <b.candler@pobox.com>:

[#325652] How to receive data using socket programming — Kamaljeet Saini <kamaljeet_singh_saini@...>

Hi,

11 messages 2009/01/22

[#325668] Gathering Ruby Quiz 2 Data (#189) — Daniel Moore <yahivin@...>

Greetings!

12 messages 2009/01/23

[#325870] Need help for Ruby DBI and PostgreSQl — Manisha Tripathy <pujari.manisha@...>

Hi,

14 messages 2009/01/26

[#325921] nokogirl on ubuntu: failed to build gem native extension — Edouard Dantes <edrd.dantes@...>

Hi,

13 messages 2009/01/27

[#325971] Ruby interpreter not working — Will Dresh <w.dresh@...>

Hello,

14 messages 2009/01/27
[#325972] Re: Ruby interpreter not working — Stefano Crocco <stefano.crocco@...> 2009/01/27

Alle marted=C3=AC 27 gennaio 2009, Will Dresh ha scritto:

[#325973] Re: [initialize keyword] Ruby interpreter not working — Rodrigo Bermejo <rodrigo.bermejo@...> 2009/01/27

Stefano Crocco wrote:

[#325974] replacing callcc by catch/throw — Thomas Hafner <thomas@...>

Hello,

16 messages 2009/01/27

[#325983] Super User Can't Change UID? — James Gray <james@...>

Why would the super user not be able to switch UID's?

14 messages 2009/01/27

[#326070] Ruby on Solaris 10 performance problems — Colin Mackenzie <colmac@...>

We just installed ruby on a

18 messages 2009/01/28

[#326084] Bitwise question — "Andrew Barringer" <abarringer@...>

I'm working on a project that has a bitmap of permissions and I need to

14 messages 2009/01/28

[#326101] proper use of classes — Tom Cloyd <tomcloyd@...>

Greetings...

32 messages 2009/01/29
[#326439] Its a Free Language — Mike Stephens <rubfor@...> 2009/01/31

An attractive aspect of Ruby is how it is usually presented as agnostic

[#326461] Re: Its a Free Language — "Sean O'Halpin" <sean.ohalpin@...> 2009/02/01

On Sat, Jan 31, 2009 at 11:00 PM, Mike Stephens <rubfor@recitel.net> wrote:

[#326106] RDoc 2.3 now with Darkfish, without CHM and extra HTML templates — Eric Hodel <drbrain@...7.net>

This release of RDoc brings some big changes. Most notably Michael =20

39 messages 2009/01/29
[#326250] Re: RDoc 2.3 now with Darkfish, without CHM and extra HTML templates — Clifford Heath <no@...> 2009/01/30

Eric Hodel wrote:

[#326262] Re: RDoc 2.3 now with Darkfish, without CHM and extra HTML templates — James Gray <james@...> 2009/01/30

On Jan 30, 2009, at 5:27 AM, Clifford Heath wrote:

[#326196] How to do a for loop...and iterate a set number of times? — Dan No <dan.cao.nguyen@...>

So painfully basic, but I'm just starting Ruby and am coming to it from

9 messages 2009/01/29

[#326241] Object#singleton_class in Ruby 1.9? — Suraj Kurapati <snk@...>

Hello,

48 messages 2009/01/30
[#326249] Re: Object#singleton_class in Ruby 1.9? — "David A. Black" <dblack@...> 2009/01/30

Hi --

[#326270] Re: Object#singleton_class in Ruby 1.9? — Yukihiro Matsumoto <matz@...> 2009/01/30

Hi,

[#326400] Re: Object#singleton_class in Ruby 1.9? — Robert Dober <robert.dober@...> 2009/01/31

On Fri, Jan 30, 2009 at 4:28 PM, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

[#326415] Re: Object#singleton_class in Ruby 1.9? — Thomas Sawyer <transfire@...> 2009/01/31

Robert Dober wrote:

[#326460] Re: Object#singleton_class in Ruby 1.9? — "Sean O'Halpin" <sean.ohalpin@...> 2009/02/01

On Sat, Jan 31, 2009 at 8:03 PM, Thomas Sawyer <transfire@gmail.com> wrote:

[#326465] Re: Object#singleton_class in Ruby 1.9? — "David A. Black" <dblack@...> 2009/02/01

Hi --

[#326526] Re: Object#singleton_class in Ruby 1.9? — Thomas Sawyer <transfire@...> 2009/02/02

David A. Black wrote:

[#326276] Ruby 1.9.1 is released — "Yugui (Yuki Sonoda)" <yugui@...>

-----BEGIN PGP SIGNED MESSAGE-----

51 messages 2009/01/30

[#326412] Array#to_h — Roger Pack <rogerpack2005@...>

Not that I would find it useful at all, but is there is a Hash#to_a

19 messages 2009/01/31

[SUMMARY] Gathering Ruby Quiz 2 Data (#189)

From: Daniel Moore <yahivin@...>
Date: 2009-01-31 19:36:35 UTC
List: ruby-talk #326410
This quiz was an exercise in Web Scraping
[http://en.wikipedia.org/wiki/Web_scraping]. As more and more
information becomes available on the internet it is useful to have a
programatic way to access it. This can be done through web APIs, but
not all websites have such APIs available or not all information is
available via the APIs. Scraping may be against the terms of use for
some sites and smaller sites may suffer if large amounts of data are
being pulled, so be sure to ask permission and be prudent!

The one solution to this week's quiz come from Peter Szinek using
scRUBYt [http://scrubyt.org/]. Despite being just over fifty lines
long there is a lot packed in here, so let's dive in.

Here we begin by seting up a scRUBYt Extractor and set it to get the
main Ruby Quiz 2 page.

  #scrape the stuff with sRUBYt!
  data = Scrubyt::Extractor.define do
    fetch 'http://splatbang.com/rubyquiz/'

The 'quiz' sets up a node in the XML document, retrieving elements
that match the XPath. This yields all the links in the side area, that
is, links to all the quizzes.

    quiz "//div[@id='side']/ol/li/a[1]" do
      link_url do
        quiz_id /id=(\d+)/
        quiz_link /id=(.+)/ do

These next two sections download the description and summary for each
quiz. They are saved into temporary files to be loaded into the XML
document at the end. Notice the use of lambda, it takes in the match
from /id=(.+)/ in the quiz_link. So for example when the link is
'quiz.rhtml?id=157_The_Smallest_Circle' it matches
'157_The_Smallest_Circle' and passes it into the lambda which returns
it as "http://splatbang.com/rubyquiz/157_The_Smallest_Circle/quiz.txt"
which is the text for the quiz. The summary is gathered in a likewise
fashion.

          quiz_desc_url(lambda {|quiz_dir|
"http://splatbang.com/rubyquiz/#{quiz_dir}/quiz.txt"}, :type =>
:script) do
            quiz_dl 'descriptions', :type => :download
          end
          quiz_summary_url(lambda {|quiz_dir|
"http://splatbang.com/rubyquiz/#{quiz_dir}/summ.txt"}, :type =>
:script) do
            quiz_dl 'summaries', :type => :download
          end
        end
      end

This next part gets all the solutions for each quiz. It follows the
link_url from the side area. Once on the new page it creates a node
for each solution, again by using XPath to get all the links in the
list on the side. It populates each solution with an author: the text
from the html anchor tag. It populates the ruby_talk_reference with
the href attribute of the tag. In order to get the solution text it
follows (resolves) the link and returns the text within the "//pre[1]"
element, again using XPath to specify. The text node is added as a
child node to the solution.

      quiz_detail :resolve => "http://splatbang.com/rubyquiz" do
        solution "/html/body/div/div[2]/ol/li/a" do
          author lambda {|solution_link_text| solution_link_text},
:type => :script
          ruby_talk_reference "href", :type => :attribute
          solution_detail :resolve => :full do
            text "//pre[1]"
          end
        end
      end

This select_indices limits the scope of the quiz gathering to just the
first three, usefull for testing since we don't want to have to
traverse the entire site to see if code works. I removed it when
gathering the full dataset.

    end.select_indices(0..2)
  end

This next part, using Nokogiri, loads the files that were saved
temporarily and inserts them into the XML document. It also removes
the link_url nodes to clean up the final output to match the output
specified in the quiz.

  result = Nokogiri::XML(data.to_xml)

  (result/"//quiz").each do |quiz|
    quiz_id = quiz.text[/\s(\d+)\s/,1].to_i
    file_index = quiz_id > 157 ? "_#{(quiz_id - 157)}" : ""
    (quiz/"//link_url").first.unlink

    desc = Nokogiri::XML::Element.new("description", quiz.document)
    desc.content =open("descriptions/quiz#{file_index}.txt").read
    quiz.add_child(desc)

    summary = Nokogiri::XML::Element.new("summary", quiz.document)
    summary.content =open("summaries/summ#{file_index}.txt").read
    quiz.add_child(summary)
  end

And finally save the result to an xml file on the filesystem:

  open("ruby_quiz_archive.xml", "w") {|f| f.write result}

This was my first experience with scRUBYt and it took me a little
while to "get it". It packs a lot of power into a concise syntax and
is definitely worth considering for your next web scraping needs.

-- 
-Daniel
http://rubyquiz.strd6.com

In This Thread