hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Futurism: Unicode In Ruby #

by why in inspect

When asked about the future of Unicode in Ruby 1.9/2.0, Matz replied to Ruby-Core with the following laundry list of features he expects in Ruby’s multibyte character support:

  • characters are represented by single character strings.
  • so that "abc"[0] returns "a" instead of fixnum 97.
  • all string methods are aware of multibyte characters.
  • new method String#encoding gives character encoding name (e.g. "utf-8").
  • new method IO#encoding gives character encoding name for reading data.
  • new method IO#encoding= sets the character encoding for reading data.

A library which emulates this could be built, based on Ruby’s current iconv lib. Anybody want to take a stab at it?

said on

do we have a time frame for the release of ruby 2? Has Matz finished with the 1.8 release now I wonder?

said on

By the way, Why, how is chapter 6 of the poignant guide shaping up? Christmas has been and gone, you know…

said on

What exactly can I say to placate you? I can’t have you in despair.

said on

I would.

said on

Hows about ‘chapter 6 is being uploaded now.’ That’d do it.

said on

what about different charsets? Are Ruby strings going to be stored as Unicode (I assume not)? If not, then will Ruby have a pluggable charset handler or some such? Will there be a String#charset? What about String#language (I vaguely remember each string instance in Parrot will be tagged with charset, encoding, and language).

said on

isn’t charset==encoding ? could you elaborate on ruby-talk, maybe?

said on

charset and encoding are two different concepts. unicode is a charset (the supposedly be-all and end-all over all charsets). unicode can be encoded in UTF -8, UTF -16, etc.

said on

An encoding implies a character set, so that isn’t really necessary.

said on

Just a thought, how about a String#convert_char that takes a block with 3 parameters, the encoding of the original char, the desired target encoding, and the character itself?

The block converts the string one character at a time, each time returning the converted character in the target encoding. The results of this block are concatenated to form the target string. This could be used internally to support the UTF -8 encoding.

However, there’s still the issue of the byte order mark…

Comments are closed for this entry.