ruby-core

Hello Brent,

Many thanks for your examples. I'm sure others will also have
a look at them.

At 15:21 08/01/07, Brent Roman wrote:
>
>Martin,
>
>The did some analysis of the log parsing application in which
>I observe a 5% slowdown under ruby 1.9.  It comes down
>to regex performance.  In fact, my log reader app spends > 50%
>of its runtime processing regex's.  The strings are all US-ASCII
>encoded.
>
>Per your request, I've distilled the most common regex into
>the attached simple benchmark:
>
>http://www.nabble.com/file/p14659564/regexbench.rb regexbench.rb 
>
>It merely scans a test string repeatedly for an escape sequence
>it does not contain.
>
>Ruby16 takes 9.3 seconds
>Ruby18 takes 9.7 seconds
>Ruby19 takes 12.8 seconds
>
>Ruby19 takes 18.0 seconds if the string encoding is forced to UTF-8
>
>So, in "US-ASCII" regexes are about 25% slower
>
>Are you able to confirm these benchmarks?
>
>Are you surprised by this?
>
>Would you agree that some of this slowdown is the result of the new 
>"encoding aware" regex engine in ruby19?
>Or, is it a "bug" that can be easily fixed?
>
>I included the UTF-8 case for comparison only.  
>It shows a 50% slowdown.

Your numbers for 1.9, both with and without UTF-8, are somewhat higher
than I would expect for simple encoding dispatching, but not exactly
high enough for calling this a bug.

When looking at an earlier example from Wolfgang, I confirmed my
suspicion expressed in my paper that using more, less primitive,
functions for encoding dispatching would improve performance.
I think I got about a 20% or so improvement in one case for
true UTF-8.

The example is as follows:
The function rb_enc_nth in encoding.c is used to find the n'th
character from a particular point in a string. It's used for
cases such as string[i] and so on. What this function currently
does is check for single-byte encodings first, then check for
fixed-width encoding, and in both cases use a simple multiplication.
For all other encodings, it repeatedly uses rb_enc_mbclen to get
the length (in bytes) of the next character, which then
somehow calls the actual primitive for the encoding.
Adding an additional (somewhat less) primitive to find
the n'th character directly per encoding may improve
performance somewhat. The way to implement these functions
per encoding is to implement them only for those encodings
that really matter, and to use a generic implementation
(going back to the lower primitive that's currently used)
for odd, rarely-used encodings.

I explained the general principle of this a few months ago
to Matz. I'm sure he wanted to concentrate on getting out
1.9 on time, so adding such (somewhat less) primitives wasn't
too urgent. It may be that they get picked up, or not,
depending on how much improvement it might be possible to
show. One problem with examples such as the above is that
it's not too difficult to tweak something for highest performance
for very very long strings. But most Ruby strings are very
short, and it's important to make sure that we don't
decrease short string performance when trying to increase
very long string performance.

>I suspect that the folks observing huge slowdowns in string
>performance are using UTF-8 or other multi-byte encodings.

Another aspect is that there should be a difference between
'real' UTF-8 strings and ASCII strings labeled as UTF-8.
There's a flag for strings that indicates whether they are
actually just all plain ASCII. But either that's not set
in your example, or it's not used, or both (I suspect at
least the later, because this flag is a Ruby mechanism,
and I don't know whether it's being used in Oniguruma or
not).

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Thread

Prev Next

In This Thread

Prev Next