[#35036] Intentional Programming — "John" <nojgoalbyspam@...>

Hi all

17 messages 2002/03/01

[#35112] RDoc question — Michael Davis <mdavis@...>

I have a question about RDoc. I would like to reference an external

17 messages 2002/03/02

[#35162] string to array and back — Ron Jeffries <ronjeffries@...>

I am needing to convert strings to arrays of bytes and back. I see pack and

19 messages 2002/03/03

[#35364] file reading impossibly slow? — Ron Jeffries <ronjeffries@...>

So I'm doing this benchmark to work with my set program. Part of the problem is

18 messages 2002/03/07

[#35429] Interesting link on static/dynamic typing... — Robert Feldt <feldt@...>

...relevant to Ruby compared to other languages discussion:

25 messages 2002/03/08
[#35441] Re: Interesting link on static/dynamic typing... — Paul Brannan <paul@...> 2002/03/08

On Fri, Mar 08, 2002 at 05:34:43PM +0900, Robert Feldt wrote:

[#35460] Spam, ruby-talk, and me — Dave Thomas <Dave@...>

14 messages 2002/03/08

[#35537] Confusion — David Corbin <dcorbin@...>

The following is from my debugging through xmlc.rb

16 messages 2002/03/10

[#35579] RE: WIN32OLE and LDAP — "Morris, Chris" <chris.morris@...>

> The new version 0.4.2 of Win32OLE has WIN32OLE.bind method.

16 messages 2002/03/11

[#35652] Method type 'abstract' — Peter Hickman <peter@...>

The one thing I miss in Ruby is the abstract class method to go along

15 messages 2002/03/12

[#35653] Some potential RCRs — "Bob Alexander" <bobalex@...>

Here are a few thing I am considering submitting as RCRs. I'm looking for comments to help decide whether to make them official, so please let know what you think is good and bad about these...

50 messages 2002/03/12
[#35672] Re: Some potential RCRs — matz@... (Yukihiro Matsumoto) 2002/03/12

Hi,

[#35683] Re: Some potential RCRs — Massimiliano Mirra <list@...> 2002/03/12

On Wed, Mar 13, 2002 at 03:58:01AM +0900, Yukihiro Matsumoto wrote:

[#35697] Re: Some potential RCRs — David Alan Black <dblack@...> 2002/03/13

Hello --

[#35694] rpkg 0.3 — Massimiliano Mirra <list@...>

14 messages 2002/03/13
[#35699] RE: [ANN] rpkg 0.3 — <james@...> 2002/03/13

>

[#35787] testunit - setup -> set_up ? — "Morris, Chris" <chris.morris@...>

I'm just starting to use testunit instead of rubyunit ... I noticed with an

21 messages 2002/03/13
[#35793] RE: testunit - setup -> set_up ? — "Nathaniel Talbott" <nathaniel@...> 2002/03/13

Morris, Chris [mailto:chris.morris@snelling.com] wrote:

[#35796] Re: testunit - setup -> set_up ? — Dave Thomas <Dave@...> 2002/03/13

"Nathaniel Talbott" <nathaniel@talbott.ws> writes:

[#35797] RE: testunit - setup -> set_up ? — "Nathaniel Talbott" <nathaniel@...> 2002/03/13

dave@thomases.com [mailto:dave@thomases.com] wrote:

[#35898] camelCase and underscore_style — "Morris, Chris" <chris.morris@...>

First, a question. If underscore_style is the Ruby norm for methods and the

20 messages 2002/03/15
[#35924] Re: camelCase and underscore_style — "Guy N. Hurst" <gnhurst@...> 2002/03/15

Phil Tomson wrote:

[#35930] RE: camelCase and underscore_style — "Nathaniel Talbott" <nathaniel@...> 2002/03/16

Guy N. Hurst [mailto:gnhurst@hurstlinks.com] wrote:

[#35989] ANN: Locana GUI and GUI Builder version 0.81 — Michael Davis <mdavis@...>

I am pleased to announce release 0.81 of Locana. Locana is a GUI

16 messages 2002/03/16

[#35992] XPath — Michael Schuerig <schuerig@...>

27 messages 2002/03/16

[#36034] Mini Rant: Indenting — Thomas Hurst <tom.hurst@...>

Why is it that I see *so* much code like:

14 messages 2002/03/17

[#36049] web templating for static sites? — Massimiliano Mirra <list@...>

I'm using the Template Toolkit for generating static web sites and I

42 messages 2002/03/17
[#36426] web standars (was: web templating for static sites?) — Tobias Reif <tobiasreif@...> 2002/03/20

Albert Wagner wrote:

[#36052] Xml Serialization for Ruby — "Chris Morris" <chrismo@...>

=Xml Serialization for Ruby

20 messages 2002/03/17
[#36059] Re: [ANN] Xml Serialization for Ruby — Massimiliano Mirra <list@...> 2002/03/17

On Mon, Mar 18, 2002 at 05:20:56AM +0900, Chris Morris wrote:

[#36067] eval/Module question — David Corbin <dcorbin@...>

If I have a String src that is similar to the following:

13 messages 2002/03/18

[#36157] Development of Windows version of Ruby — ptkwt@...1.aracnet.com (Phil Tomson)

Now that we've dumped the cygwin requirement for the Windows version of

63 messages 2002/03/18
[#36330] Re: Development of Windows version of Ruby — Ron Jeffries <ronjeffries@...> 2002/03/19

On Tue, 19 Mar 2002 14:05:27 GMT, "Albert L. Wagner" <alwagner@uark.edu> wrote:

[#36431] Re: Development of Windows version of Ruby — Dennis Newbold <dennisn@...> 2002/03/20

[#36458] Windows version of Ruby (proposals) — ptkwt@... (Phil Tomson) 2002/03/21

Dennis Newbold <dennisn@pe.net> wrote in message news:<Pine.GSO.3.96.1020320113603.22242B-100000@shell2>...

[#36482] RE: Windows version of Ruby (proposals) — "Christian Boos" <cboos@...> 2002/03/21

Some thoughts on the 2 first Windows issues, plus a 4th one...

[#36496] Re: Windows version of Ruby (proposals) — Dave Thomas <Dave@...> 2002/03/21

"Christian Boos" <cboos@bct-technology.com> writes:

[#36510] Re: Windows version of Ruby (proposals) — nobu.nokada@... 2002/03/21

Hi,

[#36514] Re: Windows version of Ruby (proposals) — Dave Thomas <Dave@...> 2002/03/21

nobu.nokada@softhome.net writes:

[#36518] Re: Windows version of Ruby (proposals) — nobu.nokada@... 2002/03/21

Hi,

[#36211] dots in Dir.entries — matz@... (Yukihiro Matsumoto)

Hi,

22 messages 2002/03/19

[#36231] style choice — Ron Jeffries <ronjeffries@...>

A style question for the community ... which of the following do you prefer, and

18 messages 2002/03/19

[#36345] ANN: REXML 2.0 — Sean Russell <ser@...>

I have a feeling there will only be three major revisions of REXML. Version

19 messages 2002/03/20

[#36610] Re: Windows version of Ruby (proposals) — Ron Jeffries <ronjeffries@...>

On Thu, 21 Mar 2002 14:11:55 GMT, Dave Thomas <Dave@PragmaticProgrammer.com> wrote:

16 messages 2002/03/22

[#36645] Ruby for Mac OS 10.1 — Jim Freeze <jim@...>

Hi:

28 messages 2002/03/23

[#36768] Re: Difference between 'do' and 'begin' — Clemens Hintze <c.hintze@...>

In <slrna9ulvi.f2h.mwg@fluffy.isd.dp.ua> Wladimir Mutel <mwg@fluffy.isd.dp.ua> writes:

23 messages 2002/03/26
[#36783] RE: Difference between 'do' and 'begin' — <james@...> 2002/03/26

[#36792] Re: Difference between 'do' and 'begin' — Kent Dahl <kentda@...> 2002/03/26

james@rubyxml.com wrote:

[#36808] Error calling Tk in a loop — <james@...>

I'm trying to write some code that pops up a Tk window when for certain

15 messages 2002/03/26

[#36841] RE: Windows version of Ruby (proposals) — "Andres Hidalgo" <sol123@...>

I believe that Ruby has a place in windows (Office), I happened to have

14 messages 2002/03/27

[#36863] Hash.new(Hash.new) doesn't use Hash.new as default value — "Jonas Delfs" <jonas@...>

Hi -

18 messages 2002/03/27

[#37080] Why isn't Math object-oriented? — Bil Kleb <W.L.Kleb@...>

So I'm reading along in the Pixaxe book (yet again), and I am told

15 messages 2002/03/30

[#37121] String#begins?(s) — timsuth@... (Tim Sutherland)

class String

24 messages 2002/03/31

Why is Ruby so slow?

From: Venherm.Borchers@... (Venherm Borchers)
Date: 2002-03-18 18:41:31 UTC
List: ruby-talk #36142
WHY IS RUBY SO SLOW?

I implemented a _DataReader_ class in Ruby and Python. The reader:

  - reads in a CSV file, in this case tab-separated,
  - gets variable names from the header line,
  - splits up each row into single items,
  - checks for and counts missing values,
  - determines the type of the item - using regular expressions -
    (integer, float, or else classified as string), and
  - counts the number of unique items in each column

finally outputting a short report on what it found. And this result is
quite useful even if you later on perform data mining tasks on these
data utilizing other tools.

The implementation is straightforward with no attempts to optimize in
the first run. I tested it on a quite large data file with 4.3 MB and
1.6 Mill. data items, most of them integers.

Here are the running times for some available Ruby implementations
under Windows:

        ________data items______1,600,000_________320,000_______

Ruby 1.6.5-2                    17:10 min          46 sec
Ruby 1.6.6-0                    18:43 min          58 sec
Ruby 1.7.2 (i586-mswin32)       18:05 min          54 sec

As a comparision, I implemented the method in Python too with the
following results:

Python 2.1.1 (Zope)                58 sec          10 sec
Python 2.2                         49 sec           9 sec
Active State Python 2.2a           49 sec          11 sec

And I also tested the data with the _read.table_ function of the
public domain statistical package *R* that has a almost the same
functionality (in a way I tried to model it)

R::read.table                      30 sec           2 sec

One can see that the Python implementation compares reasonably with
such a well-known package.  Unfortunately, the Ruby implementation of
the same method is *unacceptably* slow.

I had experiences with some text analysis functionalities where I did
split some 5,000 news messages into words and then counted and stored
these words for retrieval and for determining similarity between the
news articles.

Ruby was 20-30% slower than Python in this task, which I could really
accept because Ruby is such a nice language. But the time differences
above will kill my project, I'm afraid.

The tests were done on a 1.1 GHz Pentium III PC under Windows 2000 and
with 512 MB main memory. I didn't try Linux for that because the final
application has to run under MS Windows anyway.

So for me the question remains: Why is Ruby so unbelievably slow (more
than 5-20 times slower than Python) in this task -- esp. for larger
data sets?

Many thanks,  Hans Werner.
______________________________________________________________________

Loading data set test.dat...
10001 rows loaded, of required length 32.
2.824 secs needed.

  0              Id:	TYPE Integer (10000 items, 0 missing).
  1              V1:	TYPE Set (2 items, 0 missing).
  2              V2:	TYPE Integer (75 items, 0 missing).
  3              V3:	TYPE Set (2 items, 0 missing).
  4              V4:	TYPE Set (6 items, 0 missing).
  5              V5:	TYPE Integer (885 items, 0 missing).
  6              V6:	TYPE Integer (467 items, 0 missing).
  7              V7:	TYPE Integer (402 items, 0 missing).
  8              V8:	TYPE Set (9 items, 0 missing).
  9              V9:	TYPE Integer (19 items, 0 missing).
 10             V10:	TYPE Integer (70 items, 0 missing).
 11             V11:	TYPE Integer (1653 items, 0 missing).
 12             V12:	TYPE Integer (1316 items, 0 missing).
 13             V13:	TYPE Integer (52 items, 0 missing).
 14             V14:	TYPE Set (6 items, 0 missing).
 15             V15:	TYPE Set (2 items, 0 missing).
 16             V16:	TYPE Integer (29 items, 0 missing).
 17             V17:	TYPE Integer (49 items, 0 missing).
 18             V18:	TYPE Integer (69 items, 0 missing).
 19             V19:	TYPE Integer (13 items, 0 missing).
 20             V20:	TYPE Set (11 items, 0 missing).
 21             V21:	TYPE Set (9 items, 0 missing).
 22             V22:	TYPE Integer (15 items, 0 missing).
 23             V23:	TYPE Integer (19 items, 0 missing).
 24             V24:	TYPE Set (10 items, 0 missing).
 25             V25:	TYPE Integer (15 items, 0 missing).
 26             V26:	TYPE Set (12 items, 0 missing).
 27             V27:	TYPE Integer (17 items, 0 missing).
 28             V28:	TYPE Integer (15 items, 0 missing).
 29             V29:	TYPE Integer (25 items, 0 missing).
 30             V30:	TYPE Set (2 items, 0 missing).
 31          Target:	TYPE Set (2 items, 0 missing).

48.655 secs needed.
______________________________________________________________________

module CSV

def parse_line(line, sep="\t", missing='?', comment='#')
    line.chomp!
    if line == '' or line[0] == comment
        fields  = []
        nfields = 0
    else
        fields  = line.split(sep)
        nfields = fields.length
    end
    return nfields, fields
end

end #module

### --  c l a s s  DataReader  ---------------------------------------

class DataReader

include CSV

def initialize(fname, header=true, sep="\t", missing="?", comment="#")
### ------------------------------------------------
    @fname    = fname;
    @header   = header;         @hfields = []
    @dtypes   = [];             @dfields = []
    @nrows    = 0;              @ncols   = 0
    @sep      = sep;            @missing  = missing
    @comment  = comment
### ------------------------------------------------
end

def load(logging=false)
    t1 = Time.now
    if logging
        puts
        puts "---------------------------------------------- LOADING DATA ----"
        puts "Loading data set #{@fname}..."
    end
    csvFile = File.open(@fname, 'r')

    if @header
        @ncols, @hfields = parse_line(csvFile.gets, \
                            sep=@sep, missing=@missing, comment=@comment)
    else
        raise "Not Implemented Error."
    end
    @row = []; @col = []
    @row[0] = @hfields
    (0...@ncols).each { |j| @col << [] }

    no_short = 0;  no_long = 0
    ln_short = []; ln_long = []

    n = 0
    while line = csvFile.gets
        n += 1
        m, fields = parse_line(line, \
                            sep=@sep, missing=@missing, comment=@comment)
        if m == 0 then next end
        # fill row up with NA character or cut if too long
        if m < @ncols
            no_short +=1; ln_short << n+1
            (@ncols - m).times { fields << @missing }
        elsif m > @ncols
            no_long += 1; ln_long << n+1
            fields = fields[0...@ncols]
        end

        @row[n] = fields
        (0...@ncols).each { |j| @col[j] << fields[j] }
    end
    csvFile.close
    @nrows = @row.size

    t2 = Time.now
    if logging
        puts "#{@nrows} rows loaded, of required length #{@ncols}."
        if no_short > 0
            puts "#{no_short} rows too short: #{ln_short[0]}, ..."
        end
        if no_long > 0
            puts "#{no_long} rows too long: #{ln_long[0]}, ..."
        end
        puts "#{t2 - t1} secs needed."
        puts
    end
    
end

def prelyze(logging=false, missing=@missing)
    t1 = Time.now
    dtypes = {0 => 'NA', 1 => 'Integer', 2 => 'Continuous',
              3 => 'String', 4 => 'Set'}
    @dtypes = []
    for j in (0...@ncols) do
        ctype = 0; mitms = 0
        @col[j].each { |item|
            if item == missing
                ctype = [ctype, 0].max
                mitms += 1
            elsif item =~ /^\s*[+\-]?\d+\s*$/
                ctype = [ctype, 1].max
            elsif item =~ /^\s*[+\-]?(?:\d+\.\d*|\d*\.\d+)\s*$/
                ctype = [ctype, 2].max
            else
                ctype = [ctype, 3].max
            end
        }

        nitms = (@col[j]-['']).nitems
        if 0 < nitms and nitms <= 12 and nitms <= 0.1*(@nrows-mitms) then ctype 
= 4 end
        ctype = dtypes[ctype]
        @dtypes << ctype

        if logging
            puts "#{j.to_s.rjust(3)} #{(@row[0][j]).rjust(15)}:\tTYPE #{ctype} 
(#{nitms} items, #{mitms} missing)."
        end
    end

    t2 = Time.now
    if logging
        puts
        puts "#{t2 - t1} secs needed."
        puts "----------------------------------------------------------------"
        puts "    Copyright (C) 2001, Data Mining Center."
        puts
    end
    
end

### --  accessor functions --

attr_reader :nrows, :ncols
attr_reader :dtypes

def nrow(); @nrows; end
def ncol(); @ncols; end
def hfields(); @row[0]; end
def [](i, j); @row[i][j]; end
def col(j); @col[j]; end
def row(i); @row[i]; end

end #class

### --  m a i n ( )  ------------------------------------------------#

    tData = DataReader.new("test2.dat", header=true, \
                sep="\t", missing="", comment="%")
    tData.load(logging=true)
    tData.prelyze(logging=true)

In This Thread

Prev Next