PCAN | Docs / Utfconv

Phix
Core Language
Library Routines
Other Libraries
- base64 ?
- cffi ?
- complex
- database ?
- dict ?
- ipc ?
- json ?
- libcurl ?
- librsvg ?
- LiteZip ?
- mpfr gmp
- pGUI ?
- ppp ?
- pqueue ?
- pSQLite ?
- regex.e ?
- serialize.e ?
- timedate ?
- utfconv
  - utf8_to_utf32 ?
  - utf32_to_utf8 ?
  - utf16_to_utf32 ?
  - utf32_to_utf16 ?
- xml
- deprecated ?
- utilities ?
Recommended Tools
Internals
Glossary

utfconv

The following routines allow simple conversion between UTF-8, UTF-16, and UTF-32.

They are intended for use more at the line/string level rather than whole file.

Note that no special handling of BOM (Byte Order Mark, aka "ZERO WIDTH NO-BREAK SPACE" prefix characters) is performed by any of these routines. My recommendation is that such should be handled at the read/write file level. For ease of reference, the following byte order marks are in common use (copy and paste these definitions as needed):

constant
UTF8 = "\#EF\#BB\#BF",
UTF16BE = "\#FE\#FF",
UTF16LE = "\#FF\#FE",
UTF32BE = "\#00\#00\#FE\#FF",
UTF32LE = "\#FF\#FE\#00\#00",

[$[Get Code]]

I would hesitantly suggest the following, more legacy byte order marks be detected and an error message shown, as opposed to blindly rushing in to support them before finding out at some later date that some minor error in their handling has quietly led to severe data corruption. Besides, the phix distribution does not contain any routines for converting any of these:

UTF7 = "\#2B\#2F\#76", -- (38|39|2B|2F|38 2D)
UTF1 = "\#F7\#64\#4C",
UTFEBCDIC = "\#DD\#73\#66\#73",
SCSU = "\#0E\#FE\#FF",
BOCU1 = "\#FB\#EE\#28",
GB18030 = "\#84\#31\#95\#33"

[$[Get Code]]

Also note these routines have no comprehension of any difference between LE and BE encodings: that is down to how the calling application reads/writes or peeks/pokes. Quite clearly by the time you pass a value/character/unicode point to these routines, it should be, well, a value, in the proper and expected endian-ness of the machine the program is running on, rather than (say) a sequence of (optionally) byte-swapped elements, which could only ever serve to make everything far harder than it needs to be.

The implementation of these routines can be found in builtins\utfconv.e (an autoinclude) and test\t62utf.exw has several tests, which should of course be extended if any glitches are found, and is obviously run as part of 'p -test'.

utf8_to_utf32 ?	-	convert a UTF-8 string to a UTF-32 sequence
utf32_to_utf8 ?	-	convert a UTF-32 sequence to a UTF-8 string
utf16_to_utf32 ?	-	convert a UTF-16 sequence to a UTF-32 sequence
utf32_to_utf16 ?	-	convert a UTF-32 sequence to a UTF-16 sequence

The routines utf16_to_utf8 ? and utf8_to_utf16 ? are simple nested wrappers of the above routines.

The above routines are not compatible with Euphoria.

You may of course pass basic latin ascii (#00..#7F) strings as if they were utf8, however if you have found encodings in ISO-8859, Windows-1252, or some other code page, for characters with cedilla, umlaut, etc, (ie any single-byte characters in the range #80..#FF), they will almost certainly cause problems. Use an editor which is utf8 compliant (such as Edix), rather than one that uses some legacy code page, and you should avoid such difficulties, or at least immediately see what needs to be changed. My commiserations if you have some database or similar chock full of such non-latin ascii characters from a legacy application, these routines are unlikely to help you. In such cases I would suggest performing the appropriate substitutions to the correct unicode points directly, there will be at most 128 of them, but it is up to you to find out what they are, and perhaps running that back through utf32_to_utf8. Alternatively, you can try using the standard iconv utility (which can be installed on Windows as part of MinGW), eg "iconv -f WINDOWS-1252 -t UTF-8 filename.txt"

< elapsed ? | Index | utf8_to_utf32 ? >