If ™ makes sense here, you could assume the input was in Windows-1252 and move on. In this example, the Windows-1252 chart shows that the byte 99 represents the “™” character.
On those tables, you can look up the characters referenced by the unknown numbers, and see if they make sense in context. I’ve also found it helpful to search for encoding tables, like the ones on those linked Wikipedia pages. Did it come from a file or did you pull it from an older website? It might be ISO-8859-1. Did someone paste it in from Word? It could be Windows-1252.
#RUBY CHANGE TEXT ENCODING SOFTWARE#
So how do you figure out the right encoding for your string?Ī lot of older software will stick to a single default encoding, so you can research where the input came from. That’s not right – if it was really UTF-8, it wouldn’t have that weird backslashed number in it. But just because a string says it’s some encoding, doesn’t mean it actually is: irb(main):078:0> "hi \x99 !". Discover which encoding your string is actually in. You can fix most encoding issues with three steps: 1. A three-step process for fixing encoding bugs The major difference between encode and force_encoding is that encode might change bytes, and force_encoding won’t. So far, you’ve seen three key string methods to help you understand encodings:Įncode, which translates a string to another encoding (converting characters to their equivalent in the new encoding)īytes, which will show you the bytes that make up a stringįorce_encoding, which will show you what those bytes would look like interpreted by a different encoding But if you need your data to be in that new encoding, losing data can be better than things being broken. You have no idea which bytes were replaced by ?. Unfortunately, when you replace characters with encode, you might lose information.
By default, that replacement character is ?. The invalid and undef options replace characters that can’t be translated with a different character. encode ( "Windows-1252", invalid: :replace, undef: :replace ) => "hi?" You can work around this error if you pass extra options into encode: irb(main):064:0> "hi∑".
#RUBY CHANGE TEXT ENCODING HOW TO#
You’ll see that error when a character in one encoding doesn’t exist in another, or when Ruby can’t figure out how to translate a character between two encodings. Most encodings are small, and can’t handle every possible character. encode ( "Windows-1252" ) Encoding::UndefinedConversionError: U+2211 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 Changing the encoding changed how the string printed, without changing the bytes.Īnd not all strings can be represented in all encodings: irb(main):006:0> "hi∑". bytes => # What would that string look like interpreted as ISO-8859-5 instead?
Take a look at what a single set of bytes looks like when you try different encodings: # Try an ISO-8859-1 string with a special character! And a string’s encoding defines that relationship. But there’s still a relationship between bytes and characters. Instead of one byte, ṏ is represented by the group of bytes. Now it’s harder to tell which number represents which character. It gets trickier when you use characters that are less common in English: irb(main):002:0> "hellṏ!". In this encoding, 104 means h, 33 means !, and so on. You can think of a string as an array of bytes, or small numbers: irb(main):001:0> "hello!". If you can imagine what encoding does to a string, these bugs are easier to fix. So, when you have a bad encoding, how do you figure out what broke? And how can you fix it? What is an encoding?
Or maybe “they’re” starts showing up as “they’re”. When you check your exception tracker and see Encoding::InvalidByteSequenceError: "\xFE" on UTF-8 You only really think about a string’s encoding when it breaks.