Progress 4GL and the Unicode Byte Order Mark (BOM)
Developers using Progress 4GL to process text encoded in Unicode, should understand the purpose of the Unicode Byte Order Mark (BOM), when to prepend it to text and when to remove it from text. This page provides an overview of the BOM so that Progress developers can understand its treatment.
The BOM character
The Unicode Character Standard designated two characters as an aid to distinguish big-endian data from little-endian data. "Endianness" is not a problem for UTF-8 since it is a serialized byte stream. However, to process data encoded in UTF-16 or UTF-32, an application must first determine if the data being read is in the same or different "endianness" from the architecture that the application runs on.
Unicode designated the character U+FEFF as the "Byte Order Mark" (BOM) and reserved U+FFFE as an illegal character. If an application detects a U+FFFE it can therefore presume that the data is in the opposite endianness of the architecture and that the data should be byte swapped. (A 32-bit architecture should also be word swapped.)
Plain text Unicode data
Protocols or file formats that contain text can specify the placement of this character anywhere or in any way they like. For plain text, which has no protocol or structure, it was considered that the BOM could be the first character in the file. An application detecting either U+FFFE or U+FEFF as the first character, could presume this is a BOM and having noted the endianness of the data, remove it, since it carries no content.
Unfortunately, the BOM character U+FEFF was overloaded with another meaning: Zero Width Non-Breaking Space (ZWNBSP). Therefore, there can be times when it should not be removed and it does not represent a BOM. (If U+FEFF is detected in plain text in other than the first position, it is never a BOM and always a ZWNBSP.)
Even though UTF-8 does not need a BOM to indicate endianness, Microsoft Notepad began prepending a BOM to its UTF-8 text files. Actually, it is a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used as a file signature, indicating the plain text file is encoded as Unicode UTF-8, as opposed to some other code page. That particular 3-byte sequence is unlikely to represent data in any other code page, given the text is supposed to be human readable in some language. However, there is some small possibility that it represents some string in some code page... Because Microsoft did it, and there is so much Notepad data out there, the UTF-8 BOM became a de facto standard and then a de jure standard. (Although the BOM is optional.)
Becoming a BOMbadier
Sorry, I couldn't resist the pun. The title should really be "Working with the BOM" or something similar. Working with plain text, you do need to consider recognizing the BOM and (if UTF-16 or UTF-32) the endianness of the data, removing the BOM upon importing data and potentially generating it when exporting data.
The reason for removing the BOM is that you do not want to be splitting and concatenating strings within your application and removing and adding BOMs as you go along. It is error prone (remember it can be confused with ZWNSP) and a performance drain. There is also the confusion that comparing two strings which are identical except for the presence of a BOM, will fail, although semantically they are equivalent.
Progress, UTF-8 and the BOM
Progress 4GL does not do anything with the BOM character. It is treated as any other character for reading, writing, conversion, etc. The BOM will be passed along with the rest of the text. Conversion operations will convert the BOM among the Unicode encodings, UTF-8, UTF-16 (called UCS2 in 4GL). The BOM will become a "?" (Question Mark, not the Unknown value) if the conversion is to a code page other than Unicode.
When importing plain text files, you should decide whether you need to filter out BOM characters. You make the decision based on the possible sources of the files and whether they may provide files with BOM characters. Most Progress applications and certainly Progress utility files (e.g. dump files) are using UTF-8 encoding, and do not include a BOM character and so nothing needs to be done. However, files from other applications (e.g. Notepad) may have a BOM (even if they are in UTF-8 encoding) and you should consider deleting the first character.
When generating files, if the encoding is UTF-8, there is no need to generate a BOM, unless you are exporting to an application that expects a BOM as a file signature indicating the file is encoded in UTF-8 instead of another code page. It is rare that a BOM is *required*. Most of the natural languages that are represented in Progress applications do not use the ZWNSP character so it will not cause any confusion by appearing the first position in the file.
Progress, the BOM, UTF-16 and UTF-32
Life is more complicated with the other Unicode encodings. You will need to identify the places where you import/export Unicode text, whether it is plain text or a higher level protocol (e.g. XML), and how you will decide whether or not to remove or prepend a BOM. For internal processing, text should never have BOMs, since then it is very difficult to concatenate strings while knowing precisely whether BOMs should be added or removed. For Progress applications, the general answer is to remove the BOM. However...
UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32
There is always a "but". If your application was given an indication the plain-text was UTF-16 encoding, the application doesn't know if the text is big-endian, little-endian, and with or without a BOM. To eliminate the ambiguities, there is a naming convention such that if the Unicode encoding ends in "LE" it IS little-endian and does NOT have a BOM. "BE", similarly IS big-endian and does NOT have a BOM. So as part of your deciding what to do about the BOM, take into account this encoding naming convention and rely on BE and LE for endianness and to prevent unnecessary removal of the BOM character (which in this case really should only be there if it is representing a ZWNSP character not a BOM).
Remember, I said Plain Text
Not all text is plain text. Protocols such as HTTP can provide information about the encoding of the text being transferred. File formats (e.g. Microsoft Word) can also have ways to specify the file's encoding. Or they may have a convention of only being little-endian or big-endian. Markup languages are not plain text since they have structure and ways to indicate encoding and endianness.
XML for example, makes use of the BOM character as a file signature to
indicate the encoding and endianness. The BOM cannot be confused with ZWNSP in XML, since it is located in a particular position
where other text cannot be used, just
ahead of the file declaration: