Character encoding in Ruby

About character encoding, why do we need encoding

Classic character encoding, (ANSIor ascii) only supports a few standard English characters, for non-English chars, encoding is necessary, the purpose is kind of expanding ANSI/ascii to use more than one byte to represent non-ASNI characters, like copyright sign.

Different encoding standard

UTF-8 supposed to be standard encoding, for Chinese characters GBK/GB2312 is still popularly using because it’s efficiency. UTF-8 will use 1-6 bytes to represent a char, most Chinese characters need 3 bytes storage, while GBK only take 2 bytes.

Given a bit stream like ‘111000111000’,  GBK will treat it as 11|10|00|11|10|00, while UTF-8 will treat it as 111|000|111|000,  knowing the right string encoding is a must to avoid messy display.

Unicode

When developers represent the actual character encoding in code, usually they can add a prefix U/u with the hex code. This is what unicode look like.

For example:  “中文”  and  “\\u4e2d\\u6587” are same thing while the latter is just using unicode representation.

Conversion

Ruby 1.9 does support unicode to utf-8 conversion, by simply calling to switch between

Iconv.iconv("utf-8","unicode",escaped)

in Ruby 1.8, there are a few solutions.

Solution 1) Using JSON library,

escaped = "\\u4e2d\\u6587"
 JSON.parse( %Q{["#{escaped}"]} )[0].should == "中文"

Solution 2) Manually convert

   escaped = "\\u4e2d\\u6587"
   unicode_utf8(escaped).should == "中文"

   def unicode_utf8(unicode_string)
    unicode_string.gsub(/\\u\w{4}/) do |s|
      str = s.sub(/\\u/, "").hex.to_s(2)
      if str.length < 8
        CGI.unescape(str.to_i(2).to_s(16).insert(0, "%"))
      else
        arr = str.reverse.scan(/\w{0,6}/).reverse.select{|a| a != ""}.map{|b| b.reverse}
        hex = lambda do |s|
          (arr.first == s ? "1" * arr.length + "0" * (8 - arr.length - s.length) + s : "10" + s).to_i(2).to_s(16).insert(0, "%")
        end
        CGI.unescape(arr.map(&hex).join)
      end
    end

Encoding in JSON

JSON doesn’t have a HEAD section so no where we can set charset meta, using unicode is recommended, otherwise client won’t know how to display. In JSON for Ruby library, this can be done by just turning on ascii_only option.

    json_string = JSON.fast_generate(@sut,
      :ascii_only => true
      )

The other way (not using JSON) to get unicode given utf-8 in RUBY 1.8?

p "\\u"+@sut.title.unpack("U*").map{|c|"%04x" %c}.join("\\u")

Complete code demo:

  it "should convert different encoding" do
    @sut.title = "中文"
    unicoded_title = "\\u4e2d\\u6587"

    utf8_to_unicode(@sut.title).should == unicoded_title

    json_string = JSON.fast_generate(@sut,
      :ascii_only => true
      )

    JSON.parse(json_string)['title'].should == @sut.title

    JSON.parse( %Q{["#{unicoded_title}"]} )[0].should == @sut.title

    unicode_to_utf8(@sut.title).should == @sut.title
  end

   def unicode_to_utf8(unicode_string)
    unicode_string.gsub(/\\u\w{4}/) do |s|
      str = s.sub(/\\u/, "").hex.to_s(2)
      if str.length < 8
        CGI.unescape(str.to_i(2).to_s(16).insert(0, "%"))
      else
        arr = str.reverse.scan(/\w{0,6}/).reverse.select{|a| a != ""}.map{|b| b.reverse}
        hex = lambda do |s|
          (arr.first == s ? "1" * arr.length + "0" * (8 - arr.length - s.length) + s : "10" + s).to_i(2).to_s(16).insert(0, "%")
        end
        CGI.unescape(arr.map(&hex).join)
      end
    end
  end

  def utf8_to_unicode(string) # :nodoc:
      '\\u'+string.unpack("U*").map{|c|"%04x" %c}.join('\\u')
  end

About GBK
When getting GBK encoded webpage in Ruby using net/http, sometimes it just mess up all the characters. It happens to cUrl as well.
Switch to wget to get page to file, then parsing file is OK.
Don’t know why wget is better in dealing with different encoding.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s