Hello,
I'm new at UTF8 encoding.
My company uses an existing extended ASCII based system written in PxPlus. For the most part the system is still character based. The system is running on Linux and users logon using a terminal emulator. A smaler part of the system are Nomads panels running through WinDx. Our system does not have the 'U8' parameter set.
Up until now in the character based screens users have only been able to use the characters in the extended ASCII set. In the WinDx screens users can enter UTF8 characters, but all progams handle data as extended ASCII only. On a few occasions this results in unwanted results, but we have come to accept that. Now our users would like to be able to use a few more characters in a limited number of Nomads screens. We are not planning on UTF8 encoded data being used in character based screens. All in all I'm reading up on my UTF8.
On the internet I found articles explaining the US ASCII characters (ASCII 000 through to 127) remain the same in UTF8 however the extended ASCII characters and all other are encoded differently. From the PxPlus manual I understand when using UTF8 the 'U8' parameter should be set system wide. Switching the parameter on and off where needed will probably have unwanted results. This makes me wonder what the impact of switching this parameter on system wide will have on existing programs and data.
For a test I set the 'U8' paramater to include returning an error in case of incorrect UTF8 data (value 2). After that test I ran the *UPB utility (string search and replace), resulting in quite a few errors. Something similar happened when interpreting the value of a FID function.
As another test I ran this bit of code :
a$=$C38AC38A$
for b$ from a$
print b$
next b$
With the 'U8' parameter set I expected the program to interpret the value of a$ as UTF8 and use $C38A$ as the separator returning two empty lines. The porgram however used $8A$ as separator returning a repesentation of $C3$ twice. This tells me non US ASCII characters will no longer be suitable separators in strings where UTF8 characters might appear.
These examples got me a bit worried we might me missing something.
For example :
- do we need to convert all data in our database to UTF8 ?
- could input statements start returning extended ASCII characters encoded differently than before ?
- will existing programs struggle finding extended ASCII characters in strings if they by mistake turn out to contain UTF8 content ? (return the position of that character if it's part of an UTF8 character)
All thoughts are very welcome !
Cheers,
Arno de Greef
I'm new at UTF8 encoding.
My company uses an existing extended ASCII based system written in PxPlus. For the most part the system is still character based. The system is running on Linux and users logon using a terminal emulator. A smaler part of the system are Nomads panels running through WinDx. Our system does not have the 'U8' parameter set.
Up until now in the character based screens users have only been able to use the characters in the extended ASCII set. In the WinDx screens users can enter UTF8 characters, but all progams handle data as extended ASCII only. On a few occasions this results in unwanted results, but we have come to accept that. Now our users would like to be able to use a few more characters in a limited number of Nomads screens. We are not planning on UTF8 encoded data being used in character based screens. All in all I'm reading up on my UTF8.
On the internet I found articles explaining the US ASCII characters (ASCII 000 through to 127) remain the same in UTF8 however the extended ASCII characters and all other are encoded differently. From the PxPlus manual I understand when using UTF8 the 'U8' parameter should be set system wide. Switching the parameter on and off where needed will probably have unwanted results. This makes me wonder what the impact of switching this parameter on system wide will have on existing programs and data.
For a test I set the 'U8' paramater to include returning an error in case of incorrect UTF8 data (value 2). After that test I ran the *UPB utility (string search and replace), resulting in quite a few errors. Something similar happened when interpreting the value of a FID function.
As another test I ran this bit of code :
a$=$C38AC38A$
for b$ from a$
print b$
next b$
With the 'U8' parameter set I expected the program to interpret the value of a$ as UTF8 and use $C38A$ as the separator returning two empty lines. The porgram however used $8A$ as separator returning a repesentation of $C3$ twice. This tells me non US ASCII characters will no longer be suitable separators in strings where UTF8 characters might appear.
These examples got me a bit worried we might me missing something.
For example :
- do we need to convert all data in our database to UTF8 ?
- could input statements start returning extended ASCII characters encoded differently than before ?
- will existing programs struggle finding extended ASCII characters in strings if they by mistake turn out to contain UTF8 content ? (return the position of that character if it's part of an UTF8 character)
All thoughts are very welcome !
Cheers,
Arno de Greef