[ruby-talk:197946]で公開されたRubyでUnicodeを扱うライブラリ。
ダウンロードは<URL:ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2>から。
使い方はこんな感じ。
Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).
% irb -I. -runicode -Ku irb(main):001:0> ustr = +"π is pi" => +"π is pi"
Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.
irb(main):002:0> str = ustr.to_s => "π is pi" irb(main):003:0> str.encoding => Unicode::Encoding::UTF8
Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.
irb(main):004:0> ustr[0] => U+03C0 GREEK SMALL LETTER PI
Case conversion is handled as with native strings.
irb(main):005:0> ustr.upcase => +"Π IS PI"
Normalization is accomplished with the ~ unary operator.
irb(main):006:0> ustr = +"m,Am" => +"m,Am" irb(main):007:0> ustr.to_a => [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE] irb(main):008:0> (~ustr).each_char { |ch| p ch } U+006D LATIN SMALL LETTER M U+0069 LATIN SMALL LETTER I U+0301 COMBINING ACUTE ACCENT => +"m,Am"
実に面白い。
Ruby M17Nは、複数のエンコーディングを(できるだけ)変換なしで処理するのを主眼にした デザインになっているのだが、Cのlocaleモデルのような、 1プログラム1エンコーディングのようなケースはともかく、 複数エンコーディングが混在する場合には、 結局は統一的な内部文字集合(Universal Character Set - UCS)に変換して 処理する必要があるかな、と考えてきた。
というか、変換まわりにはあまり気を使ってこなかったというのが実情だ。 この辺が、「基本はUnicodeへの変換」という他の言語(PerlとかPythonとか)との違いだ。
とはいえ、実用のためには、どこかで変換は必要なわけで、 それはきっとIOで行うに違いないと考えてきた。
しかし、自動変換(coercing)を強く勧める意見が出た。[ruby-talk:198475]
自動変換は
などの理由で敬遠してきたのだけど、 今回の提案はちょっと具体的。
# # NOTES: # a) String#recode!(new_encoding) replaces current # internal byte representation with new byte sequence, # that is recoded current. must raise IncompatibleCharError, if # can't convert char to destination encoding # b) downgrading string from some stated encoding to "none" tag must # be done only explicitly. # it is not an option for implicit conversion # c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be # set once and only once per application run. # Intent: we want all strings which aren't raw bytes to be in one # single predefined encoding, # so all operations on string must return string in conformant encoding. # Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING. # If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy # mode", see below. # def coerce_encodings(str1, str2) enc1 = str1.encoding enc2 = str2.encoding # simple case, same encodings, will return fast in most cases return if enc1 == enc2 # another simple but rare case, totally incompatible encodings, as # they represent incompatible charsets if fully_incompatible_charsets?(enc1, enc2) raise(IncompatibleCharError, "incompatible charsets %s and %s", enc1, enc2) end # uncertainity, handling "none" and preset encoding if enc1 == "none" || enc2 == "none" raise(UnknownIntentEncodingError, "can't implicitly coerce encodings %s and %s, use explicit conversion", enc1, enc2) end # Tirany mode: # we want all strings which aren't raw bytes to be in one single # predefined encoding if $APPLICATION_UNIVERSAL_ENCODING str1.recode!($APPLICATION_UNIVERSAL_ENCODING) str2.recode!($APPLICATION_UNIVERSAL_ENCODING) return end # Democracy mode: # first try to perform non-loss conversion from one encoding to another: # 1) direct conversion, without loss, to another encoding, e.g. UTF8 + UTF16 if exists_direct_non_loss_conversion?(enc1, enc2) if exists_direct_non_loss_conversion?(enc2, enc1) # performance hint if both available if str1.byte_length < str2.byte_length str1.recode!(enc2) else str2.recode!(enc1) end else str1.recode!(enc2) end return end if exists_direct_non_loss_conversion?(enc2, enc1) str2.recode!(enc1) return end # 2) non-loss conversion to superset # (I see no reason to raise exception on KOI8R + CP1251, # returning string in Unicode will be OK) if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2) str1.recode!(superset_encoding) str2.recode!(superset_encoding) return end # A case for incomplete compatibility: # Check if subset of enc1 is also subset of enc2, # so some strings in enc1 can be safely recoded to enc2, # e.g. two pure ASCII strings, whatever ASCII-compatible encoding # they have if exists_partial_loss_conversion?(enc1, enc2) if exists_partial_loss_conversion?(enc2, enc1) # performance hint if both available if str1.byte_length < str2.byte_length str1.recode!(enc2) else str2.recode!(enc1) end else str1.recode!(enc2) end return end # the last thing we can try str2.recode!(enc1) end
うーん、面白い(こればっかり)。
確かに通常のアプリケーションモデルは
が、ほとんどだと思うので、それを考えるとこの辺ってのはそんなに悪くないのかも。 ただ、文字列の中身がいつの間にかすりかわるのはちょっと恐い。