From: "zenspider (Ryan Davis)" <redmine@...>
Date: 2012-11-28T09:31:07+09:00
Subject: [ruby-core:50237] [ruby-trunk - Bug #7442] StringScanner#charpos vs StringScanner#pos


Issue #7442 has been updated by zenspider (Ryan Davis).


Committed revision 37916.

Please beat up on it.
----------------------------------------
Bug #7442: StringScanner#charpos vs StringScanner#pos
https://bugs.ruby-lang.org/issues/7442#change-34059

Author: zenspider (Ryan Davis)
Status: Feedback
Priority: Normal
Assignee: 
Category: ext
Target version: Next Major
ruby -v: 1.9.x


=begin
I talked to Matz at rubyconf and he agreed this was a bug I should file. Sorry I took so long to do so.

As mentioned in #3482, StringScanner#pos is byte-oriented even when scanning multibyte strings. The reasoning was that IO#pos is byte-oriented so this is to spec and functioning correctly. The problem is that StringScanner isn't _just_ an IO as it also represents a String and the progress scanning through it. Strings in 1.9+ must respect their encodings and with a few exceptions don't even support the idea of naked bytes. I think StringScanner must be able to respect that.

Given that `ss` is a StringScanner instance on a string with a valid encoding, getting the substring of the current progress via `ss.string[0..ss.pos]` can result in a String with _invalid_ encoding.  I propose that we add `#charpos` to make it possible to pull out a valid substring. This would also be useful towards being able to report proper offset or column information in the case of an error when you're using StringScanner as your lexer.

This is the code that I needed to get proper char-offsets (and substrings--I needed both for my purposes): 

    def string_to_pos
      string.byteslice(0, pos)
    end

    def charpos
      string_to_pos.length
    end

=end


-- 
http://bugs.ruby-lang.org/