When pulling in orders from a popular shopping web site, some of the addresses may contain characters that aren't typical characters that can be used to print a mailing label. Our client has asked us to convert these characters based on a given translation table. For example, all characters that look like the letter A but with some accent mark should be converted to a simple A. It sounds easy enough, but finding the oddball characters is definitely a challenge.
Note that the data is being received in a JSON file, and is using charset=UTF-8, according to the http headers.
For example, suppose the customer entered the following in their order: |àáâãä|ÀÁÂÃÄ|ÿ|Ⓐ|
This all looks fine in a web browser. But, the raw JSON data in PXPlus looks like this: |à áâãä|ÀÃÂÃÄ|ÿ|â’¶|
That doesn't display correctly here, so the actual HTA() of that string in PXPlus is: $7C C3A0C3A1C3A2C3A3C3A4 7C C380C381C382C383C384 7C C3BF 7C E292B6 7C$
Most of those characters are 2 bytes characters, but the last one is 3 bytes. Typically, we just do a CVS(x$,16) on these strings to remove non-printable characters, but the client wants to instead convert them to something similar. I've tried various CVS() functions such as CVS(x$,"ASCII:UTF8"), but no luck on making any sense of data so that I can convert it using some table. I haven't figure out how the characters are differentiated between 1-byte, 2-byte, and 3-byte characters.
Searching for each specific hex sequence and replacing it seems like a brute force approach (eg: X$=SUB(X$,$C3A0$,"A"),X$=SUB(X$,$C3A1$,"A"),X$=SUB(X$,$C3A2$,"A")...), and not something I really want to do!
Any suggestions on how to accomplish this?