I didn’t know much about file encodings prior to this week. I opened a file in Ruby for reading, or writing and the encoding defaulted to UTF-8. I was happily ignorant about directly interfacing with this layer of data storage. That is, I was happily ignorant until a project I am working on starting accepting file uploads from users.
What is an Encoding?
Yehuda Katz does a great job at summarizing what an encoding is:
On disk, all Strings are stored as a sequence of bytes. An encoding simply specifies how to take those bytes and convert them into “codepoints”. In some languages, such as English, a “codepoint” is exactly equivalent to “a character”. In most other languages, there is not a one-to-one correspondence. For example, a German codepoint might specify that the next codepoint should get an ümlaut.
Taken from http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/.
The Ruby Default Encoding
When reading file data from a disk (say, after a user uploads a file, and you do processing on that file), the default external encoding for strings is determined using the
Encoding.default_external #=> #<Encoding:UTF-8>Encoding.default_external = Encoding::UTF_16 # => #<Encoding:UTF-16 (dummy)>
When You Know the Encoding
Life is easier when someone gives you a file and you know the encoding. If a file is
File.open(import_file_path, "r:UTF-16BE") do |f|input = f.readend
This will correctly read from a UTF-16BE encoded file without resulting in “garbage”, and transcode to a UTF-8 encoded string. If you prefer to keep the UTF-16BE data as UTF-16BE within Ruby, you can change the encoding on a string basis, by setting the mode’s
File.open(import_file_path, external_encoding:"UTF-16BE", internal_encoding: "UTF16-BE") do |f|input = f.readendf.encoding #=> #<Encoding:UTF-16BE>
See all the modes that can be specified when opening a file at http://www.ruby-doc.org/core-1.9.3/IO.html.
Guessing a File Encoding
BSD has a great utility for detecting the encoding of a given file. You can call
file <filename> # Big-endian UTF-16 Unicode text
You can shell out to this method using backticks, but
require 'cmess/guess_encoding'input = File.read(<filename>)charset = CMess::GuessEncoding::Automatic.guess(input)charset #=> "UTF16-BE"
Now that we have guessed at the encoding, we can read the file using the proper external encoding value:
File.open(<filename>, "r:#{charset}") do |f|input = f.readend
Integration with Paperclip
If you are using the Paperclip gem, you can utilize their callbacks for processing the file after it uploads. Here is an example of this implementation, storing the character encoding in an attribute named
require 'cmess/guess_encoding'class Importhas_attached_file :importbefore_import_post_process :set_file_import_encodingdef set_file_import_encodinginput = File.read(import_file_path)charset = CMess::GuessEncoding::Automatic.guess(input)self.import_file_encoding = charsetendend
In subsequent file reads, you will need to explicitly specify the encoding that we calculated earlier. An example may be:
File.open(import_file_path, import_file_encoding) # orCSV.foreach(import_file_path, import_file_encoding) # orCSV.read(import_file_path, import_file_encoding)
Voila! Now the rest of your code can read from the file, no matter what the encoding is. You won’t have to check for the encoding again, since we have cached this data. Just remember that any subsequent calls will need to specifically set the character encoding if the file is not in UTF-8 format.
Final Thoughts and References
You will probably encounter lots of different encodings in your application—especially if your application is used by other parts of the world. You can’t assume that the files you get will be encoded in the format you expect. It’s better to check when in doubt.
EDIT: Thanks to James Sumners for persuading me that my earlier approach of transcoding the file was non-optimal. Transcoding can be error prone, and data accuracy is important. It is better to read the file in its native encoding than to attempt a transcoding
Note: Some encodings are not interchangable. For example, UTF-16 supports ASCII sequences that cannot be represented in UTF-8. When transcoding, these characters would be lost, as there is no equivalent. Be on guard for this behavior. Most users have come to expect replacing these unreadable characters with a question mark, or some other placeholder.
—Ben Simpson