From: shyouhei@... Date: 2016-11-25T06:37:04+00:00 Subject: [ruby-core:78302] [Ruby trunk Feature#12831][Assigned] /\X/ (extended grapheme cluster) can't pass unicode.org's GraphemeBreakTest Issue #12831 has been updated by Shyouhei Urabe. Tracker changed from Bug to Feature Status changed from Open to Assigned Assignee set to Yui NARUSE ---------------------------------------- Feature #12831: /\X/ (extended grapheme cluster) can't pass unicode.org's GraphemeBreakTest https://bugs.ruby-lang.org/issues/12831#change-61664 * Author: Fumiaki Matsushima * Status: Assigned * Priority: Normal * Assignee: Yui NARUSE ---------------------------------------- I'm trying to replace Rails's grapheme implementation (http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Unicode.html#method-i-unpack_graphemes) with Ruby's extended grapheme cluster (/X/). https://github.com/rails/rails/issues/26743 I noticed that Ruby's grapheme cluster can't pass unicode.org's GraphemeBreakTest. Following test script will fail on Ruby 2.2/2.3 ~~~ ruby require 'rubygems' require 'open-uri' require 'test/unit' UNICODE_VERSION = if Gem::Version.new(RUBY_VERSION) >= Gem::Version.new("2.3.0") "8.0.0" else "7.0.0" end class TestGrapheme < Test::Unit::TestCase # https://github.com/rails/rails/blob/v5.0.0.1/activesupport/test/multibyte_grapheme_break_conformance_test.rb#L37 def test_breaks each_line_of_break_tests do |*cols| *clusters, comment = *cols string = clusters.map {|c| c.pack("U*") }.join assert_equal clusters, string.scan(/\X/).map(&:codepoints), comment end end def each_line_of_break_tests(&block) lines = 0 max_test_lines = 0 # Don't limit below 21, because that's the header of the testfile URI.parse("http://www.unicode.org/Public/#{UNICODE_VERSION}/ucd/auxiliary/GraphemeBreakTest.txt").open do |f| until f.eof? || (max_test_lines > 21 && lines > max_test_lines) lines += 1 line = f.gets.chomp! next if line.empty? || line.start_with?("#") cols, comment = line.split("#") # Cluster breaks are represented by �� clusters = cols.split("��").map { |e| e.strip }.reject { |e| e.empty? } clusters = clusters.map do |cluster| # Codepoints within each cluster are separated by �� codepoints = cluster.split("��").map { |e| e.strip }.reject { |e| e.empty? } # codepoints are in hex in the test suite, pack wants them as integers codepoints.map { |codepoint| codepoint.to_i(16) } end # The tests contain a solitary U+D800 character, which Ruby does not allow to stand # alone in a UTF-8 string. So we'll just skip it. next if clusters.flatten.include?(0xd800) clusters << comment.strip yield(*clusters) end end end end ~~~ https://gist.github.com/mtsmfm/38f46882c3d4ccde35c269594fc24ebc I found an issue on Onigmo (https://github.com/k-takata/Onigmo/issues/46) but I couldn't on bugs.ruby-lang.org so I created this ticket. I'm unfamiliar with grapheme so please tell me if I get something wrong. -- https://bugs.ruby-lang.org/ Unsubscribe: