From: daniel@...42.com Date: 2019-08-08T20:10:19+00:00 Subject: [ruby-core:94206] [Ruby master Feature#13241] Method(s) to access Unicode properties for characters/strings Issue #13241 has been updated by Dan0042 (Daniel DeLorme). I had a go at this, and a naive implementation is quite simple. The only issue really is where to store the list of unicode properties. ```ruby class String def unicode_properties(*categs) @@props ||= Hash.new.tap do |hash| categ = nil #downloaded from https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/UnicodeProps.txt txt = File.read(File.expand_path('../UnicodeProps.txt',__FILE__)) txt.scan(/^\* (\S+)|^ (\S.*)/) do |c,prop| hash[categ=c.to_sym] = {} if c hash[categ][prop.to_sym] = /\p{#{prop}}/ rescue next if prop end end categs = @@props.keys - [:DerivedAges] if categs.empty? result = [] categs.each do |categ| @@props[categ]&.each do |prop,rx| result << prop if self =~ rx end end result end end "��".unicode_properties #=> [:Alpha, :Graph, :Lower, :Print, :Word, :Alnum, :Any, :Assigned, :L, :LC, :Ll, :Latin, :Alphabetic, :Cased, :Changes_When_Casefolded, :Changes_When_Casemapped, :Changes_When_Titlecased, :Changes_When_Uppercased, :Grapheme_Base, :ID_Continue, :ID_Start, :Lowercase, :XID_Continue, :XID_Start, :CWCF, :CWCM, :CWT, :CWU, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Latn, :In_Latin_Extended_A] "��".unicode_properties(:DerivedAges) #=> [:"Age=1.1", :"Age=10.0", :"Age=2.0", :"Age=2.1", :"Age=3.0", :"Age=3.1", :"Age=3.2", :"Age=4.0", :"Age=4.1", :"Age=5.0", :"Age=5.1", :"Age=5.2", :"Age=6.0", :"Age=6.1", :"Age=6.2", :"Age=6.3", :"Age=7.0", :"Age=8.0", :"Age=9.0"] "���".unicode_properties #=> [:Alpha, :Graph, :Print, :Word, :Alnum, :Any, :Assigned, :L, :Lo, :Hiragana, :Alphabetic, :Grapheme_Base, :ID_Continue, :ID_Start, :XID_Continue, :XID_Start, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Hira, :In_Hiragana] ``` ---------------------------------------- Feature #13241: Method(s) to access Unicode properties for characters/strings https://bugs.ruby-lang.org/issues/13241#change-80499 * Author: duerst (Martin D��rst) * Status: Open * Priority: Normal * Assignee: * Target version: ---------------------------------------- [This is currently an exploratory proposal.] Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana: ``` "ABC ��� DEF" =~ /\p{hiragana}/ ``` However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples: ``` "A������".script => :latin # returns script of first character only "A������".script => [:latin, :hiragana, :katakana] # returns array of property values "A������".property(:script) => :latin # returns specified property of first character only "A������".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values "A������".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]] # returns arrays of property values, one array per character ``` The interface is still in flux, comments welcome! Implementation depends on #13240. In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html). -- https://bugs.ruby-lang.org/ Unsubscribe: