Converting UTF-8 data from web service

Loren Doornek · September 28, 2020, 06:07:01 PM

When pulling in orders from a popular shopping web site, some of the addresses may contain characters that aren't typical characters that can be used to print a mailing label. Our client has asked us to convert these characters based on a given translation table. For example, all characters that look like the letter A but with some accent mark should be converted to a simple A. It sounds easy enough, but finding the oddball characters is definitely a challenge.

Note that the data is being received in a JSON file, and is using charset=UTF-8, according to the http headers.

For example, suppose the customer entered the following in their order: |àáâãä|ÀÁÂÃÄ|ÿ|Ⓐ|

This all looks fine in a web browser. But, the raw JSON data in PXPlus looks like this: |Ã Ã¡Ã¢Ã£Ã¤|Ã€ÃÃ,ÃƒÃ,,|Ã¿|â'¶|

That doesn't display correctly here, so the actual HTA() of that string in PXPlus is: $7C C3A0C3A1C3A2C3A3C3A4 7C C380C381C382C383C384 7C C3BF 7C E292B6 7C$

Most of those characters are 2 bytes characters, but the last one is 3 bytes. Typically, we just do a CVS(x$,16) on these strings to remove non-printable characters, but the client wants to instead convert them to something similar. I've tried various CVS() functions such as CVS(x$,"ASCII:UTF8"), but no luck on making any sense of data so that I can convert it using some table. I haven't figure out how the characters are differentiated between 1-byte, 2-byte, and 3-byte characters.

Searching for each specific hex sequence and replacing it seems like a brute force approach (eg: X$=SUB(X$,$C3A0$,"A"),X$=SUB(X$,$C3A1$,"A"),X$=SUB(X$,$C3A2$,"A")...), and not something I really want to do!

Any suggestions on how to accomplish this?

Ken Sproul · September 28, 2020, 06:14:39 PM

Loren,

Try cvs(x$,"UTF8:ASCII"). Works for me on pxplus v15.10.

Devon Austen · September 29, 2020, 09:36:44 AM

The CVS conversion from UTF-8 to ASCII as Ken suggested tries to convert all of the UTF-8 characters to an ASCII equivalent and if not found converts it to a question mark. So in the example data given you would get:

|àáâãä|ÀÁÂÃÄ|ÿ|?|

All the symbols except the circled A have an ascii equivalent. If what your customer requires is to convert all the a with an accent to just an a then you probably need to write logic that finds the accented characters and converts them to the non-accented character.

I would start by converting to ASCII and then I would do a TRANSLATE to translate accented to normal letters.

Info on TRANSLATE directive format 4: https://manual.pvxplus.com/?directives/translate.htm#Mark14

TRANSLATE text$,$E00161$

The above would convert the à to a. The good thing about TRANSLATE is you can define as many translations in the one hex string

!lowercase accented a to a
convHex$+=$E00161E10161E20161E30161E40161$
!uppercase accented A to A
convHex$+=$C00141C10141C20141C30141C40141$
!lowercase y with accents
convHex$+=$FF0179$
TRANSLATE text$,convHex$

This would still leave the Ⓐ converted to a ? but I think that is accurate as converting it to an A isn't necessarily correct. You could always do a sub on the UTF-8 code for Ⓐ before converting to ASCII if you wanted that one to be an A.

You would also want to make sure you included all of the possible accented characters in your TRANSLATE I only included the ones in the sample data.

To answer your question about why some characters are two-bytes and dome are 3-bytes before conversion you have to read up on UTF-8 Unicode encoding (https://en.wikipedia.org/wiki/UTF-8). The short answer is that for the common English letters UTF-8 encodes them the same as ASCII. For symbols beyond that UTF-8 encodes them in either 2, 3 or 4 bytes. You can look around and play with this using a site like https://www.utf8-chartable.de/ You can use the drop down list to look at different Unicode blocks of symbols. The blocks at the top use less bytes and the blocks at the bottom of the list use more bytes to encode. For example you can find the Ⓐ in the U+2460 .. U+24FF Enclosed Alphanumerics Unicode block and see it's hex code there.

Loren Doornek · September 29, 2020, 07:23:53 PM

Well, that was too simple - all I had to do was switch the position of UTF8 and ASCII in the CVS function. I'm a victim of over-thinking the problem! Thanks for getting me turned around, Ken!

And thanks for the mock up of the TRANSLATE code, Devon. I had planned to use translate for the very reason that you mentioned (it allows multiple values to be translated), but just couldn't figure out how to get the source values.

Converting UTF-8 data from web service

Loren Doornek

Ken Sproul

Devon Austen

Loren Doornek