Jul 13 2012

Oh, the Files You'll Transcode!

I didn’t know much about file encodings prior to this week. I opened a file in Ruby for reading, or writing and the encoding defaulted to UTF-8. I was happily ignorant about directly interfacing with this layer of data storage. That is, I was happily ignorant until a project I am working on starting accepting file uploads from users.

What is an Encoding?

Yehuda Katz does a great job at summarizing what an encoding is:

On disk, all Strings are stored as a sequence of bytes. An encoding simply specifies how to take those bytes and convert them into “codepoints”. In some languages, such as English, a “codepoint” is exactly equivalent to “a character”. In most other languages, there is not a one-to-one correspondence. For example, a German codepoint might specify that the next codepoint should get an ümlaut.

Taken from http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/.

The Ruby Default Encoding

When reading file data from a disk (say, after a user uploads a file, and you do processing on that file), the default external encoding for strings is determined using the Encoding.default_external value. Further, when data is written to disk, it will be transcoded to the Encoding.default_external value. You can get and set this value in your project:

Encoding.default_external #=> #<Encoding:UTF-8>
Encoding.default_external = Encoding::UTF_16 # => #<Encoding:UTF-16 (dummy)>

When You Know the Encoding

Life is easier when someone gives you a file and you know the encoding. If a file is UTF-16BE encoded, you would could perform the following to read the file:

File.open(import_file_path, "r:UTF-16BE") do |f|
input = f.read
end

This will correctly read from a UTF-16BE encoded file without resulting in “garbage”, and transcode to a UTF-8 encoded string. If you prefer to keep the UTF-16BE data as UTF-16BE within Ruby, you can change the encoding on a string basis, by setting the mode’s internal_encoding value. For example:

File.open(import_file_path, external_encoding:"UTF-16BE", internal_encoding: "UTF16-BE") do |f|
input = f.read
end

f.encoding #=> #<Encoding:UTF-16BE>

See all the modes that can be specified when opening a file at http://www.ruby-doc.org/core-1.9.3/IO.html.

Guessing a File Encoding

BSD has a great utility for detecting the encoding of a given file. You can call file <filename> and it will guess the file encoding:

file <filename> # Big-endian UTF-16 Unicode text

You can shell out to this method using backticks, but file may not be available on the host system. If you don’t care about being platform agnostic mimer_plus is a good option (it actually uses the BSD file utility). Fortunately, the CMess Ruby library handles this guessing of encoding for us, while remaining platform agnostic. I had trouble with the default CMess gem hosted on Rubygems, but I found the fork at ~~http://github.com/fac/cmess.git~~ to work quite well. Once we install this, we can call it as follows:

require 'cmess/guess_encoding'

input = File.read(<filename>)
charset = CMess::GuessEncoding::Automatic.guess(input)
charset #=> "UTF16-BE"

Now that we have guessed at the encoding, we can read the file using the proper external encoding value:

File.open(<filename>, "r:#{charset}") do |f|
input = f.read
end

Integration with Paperclip

If you are using the Paperclip gem, you can utilize their callbacks for processing the file after it uploads. Here is an example of this implementation, storing the character encoding in an attribute named import_file_encoding:

require 'cmess/guess_encoding'

class Import

has_attached_file :import
before_import_post_process :set_file_import_encoding

def set_file_import_encoding
input = File.read(import_file_path)
charset = CMess::GuessEncoding::Automatic.guess(input)
self.import_file_encoding = charset
end
end

In subsequent file reads, you will need to explicitly specify the encoding that we calculated earlier. An example may be:

File.open(import_file_path, import_file_encoding) # or
CSV.foreach(import_file_path, import_file_encoding) # or
CSV.read(import_file_path, import_file_encoding)

Voila! Now the rest of your code can read from the file, no matter what the encoding is. You won’t have to check for the encoding again, since we have cached this data. Just remember that any subsequent calls will need to specifically set the character encoding if the file is not in UTF-8 format.

Final Thoughts and References

You will probably encounter lots of different encodings in your application—especially if your application is used by other parts of the world. You can’t assume that the files you get will be encoded in the format you expect. It’s better to check when in doubt.

EDIT: Thanks to James Sumners for persuading me that my earlier approach of transcoding the file was non-optimal. Transcoding can be error prone, and data accuracy is important. It is better to read the file in its native encoding than to attempt a transcoding

Note: Some encodings are not interchangable. For example, UTF-16 supports ASCII sequences that cannot be represented in UTF-8. When transcoding, these characters would be lost, as there is no equivalent. Be on guard for this behavior. Most users have come to expect replacing these unreadable characters with a question mark, or some other placeholder.

  • Ruby 1.9.3 Encoding Class Docs
  • Ruby 1.9.3 IO Class Docs
  • ~~Understanding M17N~~

—Ben Simpson

MojoTech

Share: