Tip! Progress Unicode Database vs. the Competition

Progress databases are more efficient at using Unicode than most other databases. Here is information you can use for promoting Progress and making a case for Progress requiring less resources than the competition. This tip explains why Progress is more efficient and gives an example calculation of the benefit. It is a technical argument so it is more likely to influence technologists than business managers.

If you are already using Unicode with Progress, or considering doing so, you should also answer for yourself the question:
Are you a man or a machine?
If you are uncertain as to how this is relevant to Progress, check out the XenPUC web page.

If you want to know more about the Unicode Character Standard, you can look at the Progress Internationalization Guide, or some of the presentations and other resources listed on the XenCraft Resources web page. XenCraft can also provide you with Unicode and other technical training or consulting services at locations around the world.

Progress Software RDBMS makes efficient use of Unicode

The Unicode Character Standard has grown to 95,000 characters in version 3.2. The standard has room for more than a million characters to be encoded. To encode so many characters, Unicode defines several formats, UTF-8, UTF-16, UTF-32. Progress uses UTF-8, as do most other databases. To accommodate so many characters, UTF-8 represents characters using 1, 2, 3, or 4 bytes. (Until recently, 3 bytes was the maximum number of bytes used for the largest characters.)

Now we come to the tricky part. Oracle, and other databases that support the SQL standard, define fields of the CHARACTER data type to be fixed width.
(To eliminate confusion between fields with the CHARACTER data type, and the individual characters used in the field, I will refer to the former as CHAR fields and use the term "character" for the units of data in strings.)

For example, when defining a CHAR field, the size of the field must be declared. Let's pick a moderately sized field of 30 characters. Every record that is created, then allocates storage space for 30 characters.

The Progress database, of course, uses variable width fields, and only allocates as much space as the string stored in the field actually requires.

For single-byte code pages (e.g. ISO 8859-1, Windows 1252), Progress databases have some storage advantage. If the field on average has 20 characters, the gain is 10 characters per record, and at one byte per character, that becomes 10 bytes per record. If the database had 1 million records, the savings would be 10 million bytes. Of course, there may be other CHAR fields in the record and so the savings can be even larger.

	Oracle	Progress
CHAR Storage type	Fixed Width	Variable length
Field Width (Characters)	30	any
Average width (Characters)	20	20
Average Storage used (Characters)	30	20
Bytes per Character	1	1
Average Storage used (Bytes)	30	20
Average Storage Savings per record (bytes)	0	10
Storage Savings in 1 million record database	0	10,000,000
Storage Savings if each record has 3 character fields	0	30,000,000

However, suppose we change from a single-byte code page to Unicode UTF-8. The Oracle (or other SQL language) database will now allocate enough space for each Unicode character. Older versions will use 3 bytes per character. Newer versions will use 4 byte per character. A 30 character CHAR field will therefore allocate as much as 120 bytes in each record, even if the field has a string with fewer than 30 characters. These databases also allocate the full 4 bytes per character, even if the actual character data is any of 1, 2, or 3 bytes instead of 4 bytes. We can repeat the calculations we made above using the assumption that we had only English letters, digits, and punctuation (i.e. just ASCII characters). These characters require only 1 byte each in UTF-8. This of course will make a best case for Progress, but it is not so unfair when you consider that most European languages have a high percentage of English letters in their words. (e.g. "résumé" has 4 "English" letters and only 2 "international" characters.) Most European characters will use only 2 bytes in UTF-8. Chinese, Japanese, and Korean characters are likely to take 3 bytes. Only rarely-used characters will take the full 4 bytes. So, for example, "résumé", requires 2 bytes for each "é" and only one byte each for the "r", "s", "u", and "m", for a total of 8 bytes. In comparison, 6 bytes are needed for "resume".

So repeating the calculations, the savings due to the varying field width of 10 characters per record with one field of character data, now saves 40 bytes. (10 unused characters in the field per record at 4 bytes each.). There is however also a savings of 3 bytes per character on the data that is actually stored! The total savings is 100 bytes per CHAR field per record!

	Oracle	Progress
CHAR Storage type	Fixed Width	Variable length
Field Width (Characters)	30	any
Average width (Characters)	20	20
Average Storage used (Characters)	30	20
Bytes per Character	4	1
Average Storage used (Bytes)	120	20
Average Storage Savings per record (Bytes)	0	100
Storage Savings in 1 million record database (Bytes)	0	100,000,000
Storage Savings if each record has 3 character fields (Bytes)	0	300,000,000

Often databases have more than 3 CHAR fields, so the savings can be much greater. Of course, saving storage space reduces your hardware costs, but more significant is the impact on performance, when you consider the records are now spread over many more blocks reducing the effectiveness of caching and resulting in more frequent direct and indirect I/O.

Do you need to optimize your international application or development processes? XenCraft can help you increase the efficiency of your development, localization, and deployment processes for international or Unicode-based applications, as well as making the application itself more efficient.