Encoding and Charset

2022-04-02

 Often we can see in files this encoding, for example one of the frequently used encoding value we often see in messaging systems using xml is encoding="utf-16" or encoding="utf-8"

When we transfer the messages over the wire, the receiving system(i.e. your application may be expecting xml message in a specific encoding) should have an understanding how to process the content, this is done based on the  encoding value found in the message.


  

  
    
      
    
  
  
    
      
        ....
      
    
  

          
  
To understand the "encoding" attribute, you have to understand the difference between bytes and characters.

Think of bytes as numbers between 0 and 255, whereas characters are things like "a", "1" and "Ä". The set of all characters that are available is called a character set.

Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.

Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.

For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.

In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).

However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.

Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.

One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.

For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.

Think of XML as not a sequence of characters but a sequence of bytes.

Imagine the system receiving the XML sees the bytes 195, 162. How does it know what characters these are?

In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used in the XML.

Since most common encodings are compatible with ASCII, as far as basic alphabetic characters and symbols go, in these cases, the declaration itself can get away with using only the ASCII characters to say what the encoding is. In other cases, the parser must try and figure out the encoding of the declaration. Since it knows the declaration begins with <?xml it is a lot easier to do this.

Finally, the version attribute specifies the XML version, of which there are two at the moment (see Wikipedia XML versions. There are slight differences between the versions, so an XML parser needs to know what it is dealing with. In most cases (for English speakers anyway), version 1.0 is sufficient.

0 comments: