[ruby-core:94206] [Ruby master Feature#13241] Method(s) to access Unicode properties for characters/strings
From:
daniel@...42.com
Date:
2019-08-08 20:10:19 UTC
List:
ruby-core #94206
Issue #13241 has been updated by Dan0042 (Daniel DeLorme).
I had a go at this, and a naive implementation is quite simple. The only issue really is where to store the list of unicode properties.
```ruby
class String
def unicode_properties(*categs)
@@props ||= Hash.new.tap do |hash|
categ = nil
#downloaded from https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/UnicodeProps.txt
txt = File.read(File.expand_path('../UnicodeProps.txt',__FILE__))
txt.scan(/^\* (\S+)|^ (\S.*)/) do |c,prop|
hash[categ=c.to_sym] = {} if c
hash[categ][prop.to_sym] = /\p{#{prop}}/ rescue next if prop
end
end
categs = @@props.keys - [:DerivedAges] if categs.empty?
result = []
categs.each do |categ|
@@props[categ]&.each do |prop,rx|
result << prop if self =~ rx
end
end
result
end
end
"ſ".unicode_properties #=> [:Alpha, :Graph, :Lower, :Print, :Word, :Alnum, :Any, :Assigned, :L, :LC, :Ll, :Latin, :Alphabetic, :Cased, :Changes_When_Casefolded, :Changes_When_Casemapped, :Changes_When_Titlecased, :Changes_When_Uppercased, :Grapheme_Base, :ID_Continue, :ID_Start, :Lowercase, :XID_Continue, :XID_Start, :CWCF, :CWCM, :CWT, :CWU, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Latn, :In_Latin_Extended_A]
"ſ".unicode_properties(:DerivedAges) #=> [:"Age=1.1", :"Age=10.0", :"Age=2.0", :"Age=2.1", :"Age=3.0", :"Age=3.1", :"Age=3.2", :"Age=4.0", :"Age=4.1", :"Age=5.0", :"Age=5.1", :"Age=5.2", :"Age=6.0", :"Age=6.1", :"Age=6.2", :"Age=6.3", :"Age=7.0", :"Age=8.0", :"Age=9.0"]
"あ".unicode_properties #=> [:Alpha, :Graph, :Print, :Word, :Alnum, :Any, :Assigned, :L, :Lo, :Hiragana, :Alphabetic, :Grapheme_Base, :ID_Continue, :ID_Start, :XID_Continue, :XID_Start, :Gr_Base, :IDC, :IDS, :XIDC, :XIDS, :Hira, :In_Hiragana]
```
----------------------------------------
Feature #13241: Method(s) to access Unicode properties for characters/strings
https://bugs.ruby-lang.org/issues/13241#change-80499
* Author: duerst (Martin Dürst)
* Status: Open
* Priority: Normal
* Assignee:
* Target version:
----------------------------------------
[This is currently an exploratory proposal.]
Onigmo allows Unicode properties in regular expressions. With this, it's e.g. possible to check whether a string contains some Hiragana:
```
"ABC あ DEF" =~ /\p{hiragana}/
```
However, it is currently impossible to ask for e.g. the script of a character. I propose to add a method (or some methods) to String to be able to get such properties. Various (to some extent conflicting) examples:
```
"Aあア".script => :latin # returns script of first character only
"Aあア".script => [:latin, :hiragana, :katakana] # returns array of property values
"Aあア".property(:script) => :latin # returns specified property of first character only
"Aあア".property(:script) => [:latin, :hiragana, :katakana] # returns array of specified properties' values
"Aあア".properties([:script, :general_category]) => [[:latin, :Lu], [:hiragana, :Lo], [:katakana, :Lo]]
# returns arrays of property values, one array per character
```
The interface is still in flux, comments welcome!
Implementation depends on #13240.
In Python, such functionality (however, quite limited in property coverage, and not directly on String) is available in the standard library (see https://docs.python.org/3/library/unicodedata.html).
--
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>