2/12/2023 0 Comments Unicode encoding in java![]() ![]() The basic java.io.InputStreamReader will happily translate byte streams to character streams, but it does not report encoding/decoding errors (also see ). ![]() Some tools were giving garbage results and I needed to figure out why. The task I had to accomplish was to check whether files contain valid UTF-8 byte sequences. ![]() And I have not yet accustomed myself to using the dePointAt(int) method. So iterating over the little parts that make up a String is not as simple as I once thought. What I learned, and what really bothers me still, is that a Java String may contain parts (avoiding the “C” word here) that cannot be represented using a single Java Character type! In other words, using the String.charAt(int) method gives back something that must be handled with extreme care. This information is all well documented at the Sun (oops I mean Oracle) Java API web site but it took several reads before it started sink in for me. and now it takes 32 bits to represent a character properly. Then after Sun’s people cast that 16-bit assumption into concrete deep within and other classes, the Unicode committee had a change of heart. tells me that Java’s support for international character sets was written about the time the Unicode committee finalized a standard that allowed a character to be represented in 16 bits. I also learned that using the word “character” all too often leads to confusion! After botching it a few times I finally forced myself to learn a bit about encoding systems and Java’s representation of Unicode code points. I was using Apache Tika to extract text from files containing all sorts of content, and write that text to files using UTF-8 encoding. ![]() Validating Files for Unicode/UTF-8 Character Sets with Java ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |