ruby-core

David A. Black wrote:

> > > In the /abc.*def/ case, though, you'd always have to take all the
> > > input (at least up to the third-to-last character in the file),
> > > even if you had an intermediate match.  So "needs more data" would
> > > not be something the regex could tell you.  It would say, "Yes,
> > > there's a match", but you would have to know that the "yes" didn't
> > > mean you could stop.

Wow, I totally misread this paragraph.  But yes, that’s precisely the
point.  We don’t consider intermediate matches as a match.  Why should
we?  We don’t do that for a String, why would we do it for something
that builds a string?  (Retorical questions aside, I explain what I mean
below.)

> > .* needs more data until there is no more data (#read returns nil),
> > then it fails as it hasn’t been able to match 'def' and backtracks
> > until that part of the regex does.  Then it has a match.  (This
> > ignores newline conventions, but let’s ignore them for now.)  You
> > have the same problem when doing this on a regular string.

> I understand what .* means :-)  My point is that you would have to
> have some mechanism for telling the file stream that the regex was
> doing a greedy .* and therefore needed more.  I'm not saying it's
> impossible -- just that it changes the whole concept of the state of a
> regex.

I know you do, that’s why I was wondering whether I was understanding
you correctly.

The thing is, forget the file stream.  Just think of some source that
builds us the input as we go along.  As soon as the regex needs more
data to process (for a .*, for example) it’ll ask the source for some
more.  There doesn’t have to be any real complexity in this.  The regex
will be responsible for storing the data already processed (for
backtracking purposes, MatchData#pre_match, and so on).

> > > But if the regex were /abc.*?def/, then as soon as there was a
> > > "yes", you could stop.

> > > There's also a question of: if the first 4096 bytes started with
> > > "abc" and ended with "de", then you'd add the next 4096 -- but
> > > you'd have to perform the match again.  Or else you'd have to know
> > > to rewind by exactly two characters.  But if you're changing where
> > > you start the match, that could affect how anchors worked.

> > Why?  If the first character in the next 4096 bytes is a "f" we’d
> > have a match.

> Again, I understand that def matches /def/ :-)  But I'm
> troubleshooting the process by which you would determine that you have
> a match.  Illustrating with chunks of eight bytes:
> 
>   str = "abc123de"
>   /abc.*?def/.match(str)  # false
>   str = "abc123def1234567"
>   /abc.*?def/.match(str)  # true
> 
> My point is that you've done the match twice from the beginning.  This
> could get very inefficient, if you have to re-match constantly.
> 
> You'd probably have to arrange to store where the match failed (in the
> regex, not in the string) and resume from there.  I'm not sure whether
> that's possible in general, but I'm sure the answer is known.

The semantics I’m suggesting is something like:

class Source
  def initialize
    str = "abc123def1234567"
  end

  def read
    str.length > 0 ? str.slice!(0..3) : nil
  end
end

/abc.*?def/.match(Source.new) # true

The regexp object can’t stop scanning until Source#read returns nil.
There’s really nothing hard about that.  It’s like stopping when you
reach RSTRING(str)->len or however Oniguruma decides when it has reached
the end of a String.

> > If we’re using .*? we’re are done.  If we’re using .* we wouldn’t
> > have begun matching the "de" against the /de/.  What would be an
> > issue would be how to treat MatchData#post_match.  It’d have to be
> > the remaining data that wasn’t matched at the time of a match, not
> > all the possible data that may come from the source.

> That makes it a lot less transparent: post_match would be, in the def
> example, 4095 bytes from a file.  That seems very "implementation
> dependent".

Yes, precisely.  That’s the whole point.  The caller is responsible for
interpreting the MatchData.

        nikolai

-- 
Nikolai Weibull: now available free of charge at http://bitwi.se/!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

Thread

Prev Next

In This Thread

Prev Next