[#6143] — Christophe Poucet <christophe.poucet@...>

Hello,

17 messages 2005/10/04
[#6147] Re: patch.tgz — nobu.nokada@... 2005/10/04

Hi,

[#6199] Kernel rdoc HTML file not being created when rdoc is run on 1.8.3 — James Britt <ruby@...>

When 1.8.3 came out, I grabbed the source and ran rdoc on it. After

9 messages 2005/10/08

[#6251] RubyGems, upstream releases and idempotence of packaging — Mauricio Fern疣dez <mfp@...>

[sorry for the very late reply; I left this message in +postponed and forgot

14 messages 2005/10/12

[#6282] Wilderness: Need Code to invoke ELTS_SHARED response — "Charles E. Thornton" <ruby-core@...>

Testing the My Object Dump and I am trying to cause creation

13 messages 2005/10/14
[#6283] Re: Wilderness: Need Code to invoke ELTS_SHARED response — Mauricio Fern疣dez <mfp@...> 2005/10/14

On Fri, Oct 14, 2005 at 05:04:59PM +0900, Charles E. Thornton wrote:

[#6288] Re: Wilderness: Need Code to invoke ELTS_SHARED response — "Charles E. Thornton" <ruby-core@...> 2005/10/14

Mauricio Fern疣dez wrote:

[#6365] Time for built-in Rational and Complex classes? — Gavin Sinclair <gsinclair@...>

There has been some support for, but no comment on, RCR #260 ("Make

12 messages 2005/10/24
[#6366] Re: Time for built-in Rational and Complex classes? — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/24

On Mon, 24 Oct 2005, Gavin Sinclair wrote:

[#6405] Re: [PATCH] Pathname.exists?() — "Berger, Daniel" <Daniel.Berger@...>

12 messages 2005/10/25
[#6406] Re: [PATCH] Pathname.exists?() — TRANS <transfire@...> 2005/10/25

On 10/25/05, Berger, Daniel <Daniel.Berger@qwest.com> wrote:

[#6408] Re: [PATCH] Pathname.exists?() — Gavin Sinclair <gsinclair@...> 2005/10/25

On 10/26/05, TRANS <transfire@gmail.com> wrote:

[#6442] Wilderness: I Have formatted README.EXT into an HTML Document — "Charles E. Thornton" <ruby-core@...>

I have taken README.EXT (English Version Only) and have reformatted

14 messages 2005/10/27

[#6469] csv.rb a start on refactoring. — Hugh Sasse <hgs@...>

For a database application I found using CSV to be rather slow.

50 messages 2005/10/28
[#6470] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/28

[#6471] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/28

On Oct 28, 2005, at 8:53 AM, Ara.T.Howard wrote:

[#6474] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/28

On Fri, 28 Oct 2005, James Edward Gray II wrote:

[#6484] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/29

On Oct 28, 2005, at 9:58 AM, Ara.T.Howard wrote:

[#6485] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/29

On Sat, 29 Oct 2005, James Edward Gray II wrote:

[#6486] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/29

On Oct 28, 2005, at 8:25 PM, Ara.T.Howard wrote:

[#6487] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/29

On Sat, 29 Oct 2005, James Edward Gray II wrote:

[#6491] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/29

On Oct 28, 2005, at 8:43 PM, Ara.T.Howard wrote:

[#6493] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/29

On Oct 28, 2005, at 10:06 PM, James Edward Gray II wrote:

[#6496] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/29

On Sun, 30 Oct 2005, James Edward Gray II wrote:

[#6502] Re: csv.rb a start on refactoring. — James Edward Gray II <james@...> 2005/10/30

On Oct 29, 2005, at 12:11 PM, Ara.T.Howard wrote:

[#6505] Re: csv.rb a start on refactoring. — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/30

On Mon, 31 Oct 2005, James Edward Gray II wrote:

[#6511] Planning FasterCSV (was Re: csv.rb a start on refactoring.) — James Edward Gray II <james@...> 2005/10/30

I've decided to create a FasterCSV library, based on the code we

[#6516] Re: Planning FasterCSV (was Re: csv.rb a start on refactoring.) — "Ara.T.Howard" <Ara.T.Howard@...> 2005/10/31

On Mon, 31 Oct 2005, James Edward Gray II wrote:

[#6518] Re: Planning FasterCSV (was Re: csv.rb a start on refactoring.) — "NAKAMURA, Hiroshi" <nakahiro@...> 2005/10/31

-----BEGIN PGP SIGNED MESSAGE-----

Re: csv.rb a start on refactoring.

From: "Ara.T.Howard" <Ara.T.Howard@...>
Date: 2005-10-28 13:53:58 UTC
List: ruby-core #6470
wow - your mailer sent everything as attachments, including your message....
strange.  maybe it's on my end...  so sorry for not quoting context here.

anyhow - thanks for doing this.  fyi, i've used the following approach many
times for loading huge csv files in an attempt to squeeze out speed - it
works.  the approach is simple:

   - parse the first line __only__ using the built-in csv class, note the
     number of columns as n_columns.

   - for each subsequent line

       fields = row.strip.split(%r/\s*,\s*/)

       if fields.size == n_columns
         yield fields
       else
         fields = CSV::parse row
         yield fields
       end

this apprach would need to be expanded slightly to deal with awful csv lines
like

     foo, "bar
   and more 
and more and more bar"

but it can still be done.  essentially the parser must remain optimistic at
all times - assuming a simple split and stripping of all fields is sufficient.
__only__ upon finding it not so should it degrade to the slow, but very
complete, built-in csv parser.  tweaking my appoach to be buffer-of-lines
based rather than lines based should do the trick.

food for thought.

regards.

-a
-- 
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| anything that contradicts experience and logic should be abandoned.
| -- h.h. the 14th dalai lama
===============================================================================
For a database application I found using CSV to be rather slow.
I have made an attempt to speed it up, which, frankly has not been
very successuful.  However I think I have improved the code a
little.

I've added an error for if the state machine should get into an
unknown state, rather than assuming that "it must be in the other
one" after several tests.  This slows things down, but is probably
safer, given how hard state machines can be to debug.

I've changed if block to if block_given?  -- more idiomatic,
possibly faster??

I've tried to simplify 
  if something.is_a?(Klass)
    something.method
  end
to 
  if something.respond_to?(:method)
    something.method
  end

I think its faster, and its more duck-typing style.

I've tried to handle state transitions with a hash, as it is
probably clearer than if...elsif...else...end.  Likely to be faster,
too.

I've changed 
  if expr
    singlestatement
  end

to 
   singlestatement if expr

to shorten the code.  Don't know how that impacts speed.

I've changed if...elsif...else... to case statements where I can.  I
think that ought to be faster, but it is clearer code, I think.

There's a while statement with an if immediately inside it.  Since
thre is no else for that if, and the loop condition doesn't change
if the if-condition is false, then the condition must always apply.
For, if it did not apply, the while loop would never exit.  I have
commented out the condition and corresponding end.

idx_is_eos? is private, and thus not really part of the external
interface.  It takes a parameter that it doesn't use. I have removed
that and made it parameterless.  


I have tested the code and it still works.  I could not find
exhaustive test cases for CSV (looked in Rubycon and on the web) so
hope I haven't broken anything.  It does seem to be a tiny bit
faster, for the given case.  My tests are attached.  The test
program takes the file to test (csv or nova_csv) as an argument, to
allow comparison.  The name?  nova_csv because new_csv was already
being used for doc patches.

        HTH
        Hugh

Attachments (2)

csv_code_diffs (8.86 KB, text/plain)
--- csv.rb	2004-05-27 15:39:10.000000000 +0100
+++ nova_csv.rb	2005-10-28 11:05:48.751478000 +0100
@@ -10,6 +10,7 @@
   
 class CSV
   class IllegalFormatError < RuntimeError; end
+  class UnknownState < RuntimeError; end
 
   # deprecated
   class Cell < String
@@ -118,7 +119,7 @@
         "  Use CSV.open(filename, 'r') instead.")
       return open_reader(str_or_readable, 'r', fs, rs, &block)
     end
-    if block
+    if block_given?
       CSV::Reader.parse(str_or_readable, fs, rs) do |row|
         yield(row)
       end
@@ -136,10 +137,10 @@
   # not, use CSV.parse_row instead of this method.
   def CSV.parse_line(src, fs = nil, rs = nil)
     fs ||= ','
-    if fs.is_a?(Fixnum)
+    if fs.respond_to? :chr
       fs = fs.chr
     end
-    if !rs.nil? and rs.is_a?(Fixnum)
+    if rs.respond_to? :chr
       rs = rs.chr
     end
     idx = 0
@@ -162,10 +163,10 @@
       return ''
     end
     fs ||= ','
-    if fs.is_a?(Fixnum)
+    if fs.respond_to? :chr
       fs = fs.chr
     end
-    if !rs.nil? and rs.is_a?(Fixnum)
+    if rs.respond_to? :chr
       rs = rs.chr
     end
     res_type = :DT_COLSEP
@@ -213,10 +214,10 @@
   #
   def CSV.parse_row(src, idx, out_dev, fs = nil, rs = nil)
     fs ||= ','
-    if fs.is_a?(Fixnum)
+    if fs.respond_to? :chr
       fs = fs.chr
     end
-    if !rs.nil? and rs.is_a?(Fixnum)
+    if rs.respond_to? :chr
       rs = rs.chr
     end
     idx_backup = idx
@@ -270,10 +271,10 @@
   #
   def CSV.generate_row(src, cells, out_dev, fs = nil, rs = nil)
     fs ||= ','
-    if fs.is_a?(Fixnum)
+    if fs.respond_to :chr
       fs = fs.chr
     end
-    if !rs.nil? and rs.is_a?(Fixnum)
+    if rs.respond_to :chr
       rs = rs.chr
     end
     src_size = src.size
@@ -303,6 +304,9 @@
   # Private class methods.
   class << self
   private
+    DATA_TRANSITION1 = Hash.new{|h,k| k}.merge!(:ST_START => :ST_DATA,
+                                                :ST_QUOTE => :ILLEGAL_FORMAT)
+    # return the key as default
 
     def open_reader(path, mode, fs, rs, &block)
       file = File.open(path, mode)
@@ -340,6 +344,13 @@
       end
     end
 
+
+    def to_data_unless_quote(state)
+      state = DATA_TRANSITION1[state]
+      raise IllegalFormatError if state == :ILLEGAL_FORMAT
+      return state
+    end
+
     def parse_body(src, idx, fs, rs)
       fs_str = fs
       fs_size = fs_str.size
@@ -359,54 +370,37 @@
           if !fschar and c == fs_str[0]
             fs_idx = 0
             fschar = true
-            if state == :ST_START
-              state = :ST_DATA
-            elsif state == :ST_QUOTE
-              raise IllegalFormatError
-            end
+            state = to_data_unless_quote(state)
           end
           if !rschar and c == rs_str[0]
             rs_idx = 0
             rschar = true
-            if state == :ST_START
-              state = :ST_DATA
-            elsif state == :ST_QUOTE
-              raise IllegalFormatError
-            end
+            state = to_data_unless_quote(state)
           end
         end
         if c == ?"
           fs_idx = rs_idx = 0
-          if cr
-            raise IllegalFormatError
-          end
+          raise IllegalFormatError if cr
           cell << src[last_idx, (idx - last_idx)]
-          last_idx = idx
-          if state == :ST_DATA
-            if quoted
-              last_idx += 1
-              quoted = false
-              state = :ST_QUOTE
-            else
-              raise IllegalFormatError
-            end
-          elsif state == :ST_QUOTE
+          last_idx = idx + 1
+          case state 
+          when :ST_DATA
+            raise IllegalFormatError unless quoted
+            quoted = false  # Redundant, surely?
+            state = :ST_QUOTE
+          when :ST_QUOTE
             cell << c.chr
-            last_idx += 1
             quoted = true
             state = :ST_DATA
-          else  # :ST_START
+          when :ST_START
             quoted = true
-            last_idx += 1
             state = :ST_DATA
+          else
+            raise UnknownState
           end
         elsif fschar or rschar
-          if fschar
-            fs_idx += 1
-          end
-          if rschar
-            rs_idx += 1
-          end
+          fs_idx += 1 if fschar
+          rs_idx += 1 if rschar
           sep = nil
           if fs_idx == fs_size
             if state == :ST_START and rs_idx > 0 and fs_idx < rs_idx
@@ -415,9 +409,7 @@
             cell << src[last_idx, (idx - last_idx - (fs_size - 1))]
             last_idx = idx
             fs_idx = rs_idx = 0
-            if cr
-              raise IllegalFormatError
-            end
+            raise IllegalFormatError if cr
             sep = :DT_COLSEP
           elsif rs_idx == rs_size
             if state == :ST_START and fs_idx > 0 and rs_idx < fs_idx
@@ -431,20 +423,19 @@
             sep = :DT_ROWSEP
           end
           if sep
-            if state == :ST_DATA
-              return sep, idx + 1, cell;
-            elsif state == :ST_QUOTE
-              return sep, idx + 1, cell;
-            else  # :ST_START
-              return sep, idx + 1, nil
+            return sep, idx + 1, case state
+            when :ST_DATA, :ST_QUOTE
+               cell
+            when :ST_START
+               nil
+            else  
+              raise UnknownState
             end
           end
         elsif rs.nil? and c == ?\r
           # special \r treatment for backward compatibility
           fs_idx = rs_idx = 0
-          if cr
-            raise IllegalFormatError
-          end
+          raise IllegalFormatError if cr
           cell << src[last_idx, (idx - last_idx)]
           last_idx = idx
           if quoted
@@ -454,13 +445,14 @@
           end
         else
           fs_idx = rs_idx = 0
-          if state == :ST_DATA or state == :ST_START
-            if cr
-              raise IllegalFormatError
-            end
+          case state 
+          when :ST_DATA, :ST_START
+            raise IllegalFormatError if cr
             state = :ST_DATA
-          else  # :ST_QUOTE
+          when :ST_QUOTE
             raise IllegalFormatError
+          else 
+            raise UnknownState
           end
         end
         idx += 1
@@ -471,9 +463,7 @@
         else
           return :DT_EOS, idx, nil
         end
-      elsif quoted
-        raise IllegalFormatError
-      elsif cr
+      elsif quoted or cr
         raise IllegalFormatError
       end
       cell << src[last_idx, (idx - last_idx)]
@@ -804,7 +794,7 @@
       if idx < 0
         return nil
       end
-      if (idx_is_eos?(idx))
+      if (idx_is_eos?)
         if n and (@offset + idx == buf_size(@cur_buf))
           # Like a String, 'abc'[4, 1] returns nil and
           # 'abc'[3, 1] returns '' not nil.
@@ -818,13 +808,12 @@
       next_idx = idx
       while (my_offset + next_idx >= buf_size(my_buf))
         if (my_buf == @buf_tail_idx)
-          unless add_buf
-            break
-          end
+          break unless add_buf
         end
         next_idx = my_offset + next_idx - buf_size(my_buf)
         my_buf += 1
         my_offset = 0
+        # i.e. loc = my_offset + next_idx because myoffset is 0
       end
       loc = my_offset + next_idx
       if !n
@@ -855,9 +844,9 @@
       if is_eos?
         return 0
       end
-      size_dropped = 0
+      bsc = size_dropped = 0
       while (n > 0)
-        if !@is_eos or (@cur_buf != @buf_tail_idx)
+        # if !@is_eos or (@cur_buf != @buf_tail_idx) # redundant? If it fails loop never exits
           if (@offset + n < buf_size(@cur_buf))
             size_dropped += n
             @offset += n
@@ -868,19 +857,17 @@
             n -= size
             @offset = 0
             unless rel_buf
-              unless add_buf
-                break
-              end
+              break unless add_buf
               @cur_buf = @buf_tail_idx
             end
           end
-        end
+        # end
       end
       size_dropped
     end
   
     def is_eos?
-      return idx_is_eos?(0)
+      return idx_is_eos?
     end
   
     # WARN: Do not instantiate this class directly.  Define your own class
@@ -929,13 +916,12 @@
       if str_read.nil?
         @is_eos = true
         @buf_list.push('')
-        @buf_tail_idx += 1
         false
       else
         @buf_list.push(str_read)
-        @buf_tail_idx += 1
         true
       end
+      @buf_tail_idx += 1
     end
   
     def rel_buf
@@ -952,8 +938,8 @@
       end
     end
   
-    def idx_is_eos?(idx)
-      (@is_eos and ((@cur_buf < 0) or (@cur_buf == @buf_tail_idx)))
+    def idx_is_eos?
+      (@is_eos and ((@cur_buf == @buf_tail_idx) or (@cur_buf < 0)))
     end
   
     BufSize = 1024 * 8
test_csv.rb (938 Bytes, application/x-sh)

In This Thread