Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> practically changing the execution character set isn't trivial, and it's not necessarily even a desirable feature

I think it should be unnecessary to convert the character set; the execution shouldn't care about the character set except for ASCII and for whatever you program yourself in what the specific program you are writing is doing. It should not need to be a subset of Unicode, either; you should be able to use any character set that is a superset of ASCII (and where bytes in the ASCII range always mean ASCII characters and bytes not in the ASCII range always mean non-ASCII characters) (UTF-8 has this property and therefore may be used, but it is not the only character encoding with this property).

The C preprocessor is limited in its capabilities, although it would be helpful to add extra steps both before and after the preprocessor runs, which can transform character encodings, but also can be useful for other purposes too. (With GCC, I think this could be done by -no-integrated-cpp and -wrapper; I don't know about doing with Clang.)

(GCC will convert input to UTF-8 during preprocessing, but at least with the version of GCC that I have does not actually care if it is valid UTF-8 (at least for C; maybe not for C++ but I have not tried it), which is fortunate, since this means that you can implement your own character code handling.)

In the case of C++, as described there, you can use user-defined literals. They shouldn't require user-defined literals to be UTF-8 (nor Unicode), although if you can do whatever calculation you want on them at compile-time, then you can treat them as UTF-8 if you want to, but shouldn't be required to do so. (Personally, I do not use C++, though; so I do not actually know all of the details about how it is working, so I may have made a mistake.)

(There are several reasons you might deliberately not want UTF-8. One of them is security issues with the complicated text rendering involved with Unicode. Another might be the way that character widths are working. And there are many other possibilities, too. You might also prefer to put all non-ASCII text in a separate file; the #embed command can be used if you want to embed it into the program anyways, I suppose.)

> Even with full -fexec-charset support, it still makes sense to provide compile-time translation from Unicode to target strings.

Maybe, but I should think that this compile-time translation should be done separately as described above, and to be programmable to not be limited to only Unicode. It should not be required; I think it would be sensible that by default it should just pass through directly without conversion regardless of what the character set is.

> For example, Commodore PETSCII makes vastly more sense as an execution character set than a source character set.

I agree, but that is because Commodore PETSCII is not a superset of ASCII which is encoded as a superset of ASCII. The reason for this has nothing to do with Unicode.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: