UTF-8 in OpenWatcom

Discussion:

UTF-8 in OpenWatcom

(too old to reply)

Leif Ekblad

2012-03-07 20:39:20 UTC

I've redesigned my font package, and will use UTF-8 throughout my OS and
applications. When I added RDOS support for libc, I made the choice not to
support wide-char character strings because I would eventually support only
UTF-8. Today, many applications, including web-pages use UTF-8, so I think
this is a reasonably choice.

However, there seem to be several issues with UTF-8 in OW:

1. The compiler cannot handle source files that MS tools has tagged as UTF-8
(the BOM sequence 0xEF, 0xBB, 0xBF gives a compilation error)

2. The Watcom editor (as well as IDE and debugger) does not handle UTF-8. It
doesn't display source files correctly, and it is not possible to insert
UTF-8 coded strings into the code with the editor

3. Because the UTF-8 BOM sequence cannot be used, source files cannot be
edited directly with notepad either, as notepad think it is plain text
rather than UTF-8.

4. I'm not sure about resource files yet. The resource records are coded in
UTF-16, but I don't know if the resource compiler correctly translates UTF-8
coded strings to UTF-16 resources.

Has anybody looked into this before? Is it reasonable to add UTF-8 support
to the Win32 tools?

Leif Ekblad

Peter C. Chapin

2012-03-08 15:09:22 UTC

Permalink

Post by Leif Ekblad
Has anybody looked into this before? Is it reasonable to add UTF-8 support
to the Win32 tools?

I doubt if it has been looked at in detail. It is reasonable but
certainly non-trivial I would imagine. Of course that doesn't mean it
shouldn't be done!

It seems to me that the change would require more than merely
understanding the BOM, etc. Accepting UTF-8 encoded files means, or at
least suggests, we are allowing fairly arbitrary Unicode text. I'm not
sure what the implications of that are but they might be far reaching.

I'd have to review the recent standards... is it now permitted to use
Unicode characters directly in identifier names in, for example, C and
C++ programs? I bet the compilers don't know how to handle that right now.

Peter

Leif Ekblad

2012-03-08 22:33:16 UTC

Permalink

I doubt if it has been looked at in detail. It is reasonable but certainly
non-trivial I would imagine. Of course that doesn't mean it shouldn't be
done!

At a minimum I need a resource compiler that understands UTF-8 strings for
target RDOS. I'll look into this tomorrow. The resources for Win32 (which
RDOS currently uses) are unicode, but I'm not sure if the resource compiler
understands UTF-8. I think it doesn't.

It seems to me that the change would require more than merely
understanding the BOM, etc. Accepting UTF-8 encoded files means, or at
least suggests, we are allowing fairly arbitrary Unicode text. I'm not
sure what the implications of that are but they might be far reaching.
I'd have to review the recent standards... is it now permitted to use
Unicode characters directly in identifier names in, for example, C and C++
programs? I bet the compilers don't know how to handle that right now.

I wouldn't take it that far. It is fine by me if the compiler ignores the
BOM. It would be even better of course if the editor displays UTF-8
character string correctly, and allows for inserting them, but that is not
necesary as I can use notepad if the compiler ignores the BOM. It would be
even better if the debugger could display the strings in the source-code
correctly as well, but that is not required either.

Leif Ekblad

Peter C. Chapin

2012-03-09 23:56:48 UTC

Permalink

Post by Leif Ekblad
I wouldn't take it that far. It is fine by me if the compiler ignores the
BOM. It would be even better of course if the editor displays UTF-8
character string correctly, and allows for inserting them, but that is not
necesary as I can use notepad if the compiler ignores the BOM. It would be
even better if the debugger could display the strings in the source-code
correctly as well, but that is not required either.

Well, it makes sense to tackle a subject such as this one small step at
a time. Doing everything would be a big job but that shouldn't stop us
(you) from doing something!

Peter

Leif Ekblad

2012-03-10 11:36:17 UTC

Permalink

Well, it makes sense to tackle a subject such as this one small step at a
time. Doing everything would be a big job but that shouldn't stop us (you)
from doing something!

Fine. I have checked the string resources on the RDOS platform and it
actually
works unmodified on RDOS as RDOS PE loader just ignores the high byte
stored.
That means the issue is less critical for me, but I still would want some
kind of
solution.

I think I know where this code is located. In sdk\rc\rc\c\leadbyte.c, there
is
a function called "NativeDBStringToUnicode", which uses the Windows function
MultiByteToWideChar with code page "CP_ACP", which means it assumes the
source of the string is the default Windows code page (ANSI). It is
possible to change this parameter to "CP_UTF8", and then UTF-8 will probably
work.

The primary issue is if a change from CP_ACP to CP_UTF8 would break
something
on Win32 or not? If somebody relies on the fact that resource files are
coded in ANSI,
changing this flag to UTF-8 will obviously break those. OTOH, they could fix
it by
converting the resource scripts to UTF-8, which is easy to do with for
instance
notepad.

So should I change this for all targets, or separate target RDOS and only
change it
for RDOS?

Or maybe a better solution would be to check the first 3 characters of the
current
source file, and if they are the BOM, then use UTF-8 instead of ANSI. That
should not
break anything.

Leif Ekblad

Paul S. Person

2012-03-10 18:30:27 UTC

Permalink

Post by Leif Ekblad

Well, it makes sense to tackle a subject such as this one small step at a
time. Doing everything would be a big job but that shouldn't stop us (you)
from doing something!

Fine. I have checked the string resources on the RDOS platform and it
actually
works unmodified on RDOS as RDOS PE loader just ignores the high byte
stored.
That means the issue is less critical for me, but I still would want some
kind of
solution.
I think I know where this code is located. In sdk\rc\rc\c\leadbyte.c, there
is
a function called "NativeDBStringToUnicode", which uses the Windows function
MultiByteToWideChar with code page "CP_ACP", which means it assumes the
source of the string is the default Windows code page (ANSI). It is
possible to change this parameter to "CP_UTF8", and then UTF-8 will probably
work.
The primary issue is if a change from CP_ACP to CP_UTF8 would break
something
on Win32 or not? If somebody relies on the fact that resource files are
coded in ANSI,
changing this flag to UTF-8 will obviously break those. OTOH, they could fix
it by
converting the resource scripts to UTF-8, which is easy to do with for
instance
notepad.
So should I change this for all targets, or separate target RDOS and only
change it
for RDOS?
Or maybe a better solution would be to check the first 3 characters of the
current
source file, and if they are the BOM, then use UTF-8 instead of ANSI. That
should not
break anything.

I may be wrong, but I thought that UTF-8 readers generally treated
ANSI as UTF-8 (that is, if the initial identifier is missing, the text
is still processed). ANSI text, IIRC, uses only the 7-bit character
set, and so is identical to UTF-8.

--
"Nature must be explained in
her own terms through
the experience of our senses."

Leif Ekblad

2012-03-10 21:12:15 UTC

Permalink

Post by Paul S. Person
I may be wrong, but I thought that UTF-8 readers generally treated
ANSI as UTF-8 (that is, if the initial identifier is missing, the text
is still processed). ANSI text, IIRC, uses only the 7-bit character
set, and so is identical to UTF-8.

As long as the text is in english, there is no problem, as all of those
characters
are 7-bit only. The problems appear for some other languages, like swedish
which has åäö, and ANSI maps those to 8-bit characters, and they become
two bytes in UTF-8.

Leif Ekblad

Peter C. Chapin

2012-03-11 22:14:34 UTC

Permalink

Post by Leif Ekblad
Or maybe a better solution would be to check the first 3 characters of the
current
source file, and if they are the BOM, then use UTF-8 instead of ANSI. That
should not
break anything.

One of the purposes of the BOM is to identify the encoding of a file. So
checking for it as you described here sounds like the best option.

Peter

Leif Ekblad

2012-03-13 21:59:51 UTC

Permalink

Half of the investigation is done. I've confirmed that it is the right file,
and that changing to CP_UTF8 indeed
interprets the file as UTF8 rather than "ANSI". I'm not sure where to check
if the file starts with a BOM,
but I'll figure that out eventually.

Leif Ekblad

Post by Peter C. Chapin

One of the purposes of the BOM is to identify the encoding of a file. So
checking for it as you described here sounds like the best option.
Peter

Leif Ekblad

2012-03-14 20:36:55 UTC

Permalink

I've implemented this in change #37181. I couldn't figure out a suitable
method to detect a BOM, so I added a new resource compiler option (-zu)
which turns on UTF-8 source file codings. Then it is just a matter of using
the UTF8 flag when converting to unicode. Additionally, I made the IDE turn
on this flag automatically for target RDOS, and changed the resource reader
to convert from UCS-2 to UTF-8.

Leif Ekblad

Peter C. Chapin

2012-03-15 16:39:18 UTC

Permalink

Post by Leif Ekblad
I've implemented this in change #37181. I couldn't figure out a suitable
method to detect a BOM, so I added a new resource compiler option (-zu)
which turns on UTF-8 source file codings. Then it is just a matter of using
the UTF8 flag when converting to unicode. Additionally, I made the IDE turn
on this flag automatically for target RDOS, and changed the resource reader
to convert from UCS-2 to UTF-8.

Okay, great. Sounds good to me. Thanks for your contribution!

Peter