Convert charset from GB2312 to Unicode in Ruby

Every Moday, a radio station in Sigapore will upload a new epsiode of their Movie Review program, one of my favorate podcasts. Unfortunately, they only update the html page part, the podcast feed updating usually happens days later.

I can’t wait that long, so I wrote my Ruby program to rip their html page, get mp3 url, and then, to generate a latest podcast feed used by my itunes.

mc = MovieCafe.new
mc.get_mp3_list

newitem = mc.mp3_list.first

if newitem.mp3_url != mc.rss.items.first.enclosure.url
#item = RSS::Rss::Channel::Item.new
item = mc.rss.items.first
item.title = newitem.title
item.enclosure.url = newitem.mp3_url

item.description = newitem.description
item.pubDate = Time.now

mc.rss.items.push(item)

mc.rss.channel.lastBuildDate = Time.now
endputs mc.generate_feed

The problem I got was, the charset from html is gb2312, while the podcast feed is utf-8. I had to covert charset. Eventually, I made it:

require ‘iconv’

title = Iconv.new(“UTF-8″,”gbk”).iconv(title)

Here is my podcast feed of Movie Review channel, at least 2 days newer than the official one.

Advertisements

One thought on “Convert charset from GB2312 to Unicode in Ruby

  1. Sometimes the file might have some characters not in the standard GB2312 and GBK set, you want to try GB18030.

    I found this problem when moving my code from Chinese windows OS to English linux platform, by changing to GB18030, and using the new rss paser – Faster-RSS-Simple, the problem solved.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s