Switching an existing extended ASCII based system to UTF8

Started by Arno de Greef, January 14, 2024, 09:51:37 AM

Previous topic - Next topic

Arno de Greef

Hello,

I'm new at UTF8 encoding.

My company uses an existing extended ASCII based system written in PxPlus. For the most part the system is still character based. The system is running on Linux and users logon using a terminal emulator. A smaler part of the system are Nomads panels running through WinDx. Our system does not have the 'U8' parameter set.

Up until now in the character based screens users have only been able to use the characters in the extended ASCII set. In the WinDx screens users can enter UTF8 characters, but all progams handle data as extended ASCII only. On a few occasions this results in unwanted results, but we have come to accept that. Now our users would like to be able to use a few more characters in a limited number of Nomads screens. We are not planning on UTF8 encoded data being used in character based screens. All in all I'm reading up on my UTF8.

On the internet I found articles explaining the US ASCII characters (ASCII 000 through to 127) remain the same in UTF8 however the extended ASCII characters and all other are encoded differently. From the PxPlus manual I understand when using UTF8 the 'U8' parameter should be set system wide. Switching the parameter on and off where needed will probably have unwanted results. This makes me wonder what the impact of switching this parameter on system wide will have on existing programs and data.

For a test I set the 'U8' paramater to include returning an error in case of incorrect UTF8 data (value 2). After that test I ran the *UPB utility (string search and replace), resulting in quite a few errors. Something similar happened when interpreting the value of a FID function.

As another test I ran this bit of code :
a$=$C38AC38A$
for b$ from a$
print b$
next b$
With the 'U8' parameter set I expected the program to interpret the value of a$ as UTF8 and use $C38A$ as the separator returning two empty lines. The porgram however used $8A$ as separator returning a repesentation of $C3$ twice. This tells me non US ASCII characters will no longer be suitable separators in strings where UTF8 characters might appear.

These examples got me a bit worried we might me missing something.
For example :
- do we need to convert all data in our database to UTF8 ?
- could input statements start returning extended ASCII characters encoded differently than before ?
- will existing programs struggle finding extended ASCII characters in strings if they by mistake turn out to contain UTF8 content ? (return the position of that character if it's part of an UTF8 character)

All thoughts are very welcome !

Cheers,
Arno de Greef

Mike King

#1
While PxPlus does support UTF8 there are a significant number of things you need to consider and potentially change.  Here are a few of the major issues.

Field Separator:

One of the largest issues you will face is that the default field separator used by PxPlus is hex 8A.  This dates back to the original Business Basic from MAI and has been maintained over the years to avoid clients having to rebuild their existing files and to avoid existing logic that looks for $8A$ having to change.

The first thing you should do if you are going to use UTF-8 is to change the default field separator to a character in the range of $00$ thru $1F$.  Ideally avoid tab ($09$), LF ($0a$) CR ($0D$) and ESCAPE ($1b$) as these could cause long term problems when accessing text.

Data Files:

You will also need to convert ALL existing data file which may have text requiring UTF-8.  Basically, this is any file which might contain accented characters.  Files that use standard ASCII ($00$ thru $7F$) wont need migration however their field separators will need to be reloaded.

Different length data:

When using UTF-8 data length can vary.  For example, in normal extended ASCII 6 characters takes 6 bytes.  Consider the word "garçon", this would require 6 bytes in normal ASCII but seven in UTF8.  So, if you used this as a key to a file you would need to make sure you allocated 7 bytes for it, and if there were multiple accented characters the key size would need to be longer.

Also, in your code if you try to replace portions of string you need to consider that length of a string to display may not be the same as the length of a string in memory.  This can cause problems with existing code.

Substrings also pose a challenge.  For example, if you decided that the first 4 bytes of a name was to be used as some form of code.  If the name had a UTF code sequence that started at byte 4 and continued thru byte 5+, taking a substring would likely result in an invalid UTF8 sequence.


I short, it is doable and all the tools you need are in PxPlus but migrating an application that wasn't originally designed to handle multi-byte character is a challenge and involves more than just a character set change and system option setting.

Mike King
President - BBSysco Consulting
eMail: mike.king@bbsysco.com