2

I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string and a std::wstring when defining a string with german non-ASCII characters in either type.

#include <iostream>
#include <iomanip>
#include <string>

int main(int, char**) {
    std::wstring wstr = L"äöüß";
    std::string str = "äöüß";

    for ( unsigned char c : str ) {
        std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    for ( wchar_t c : wstr ) {
        std::cout << std::setw(4) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    return 0;
}

The output of this snippet is

c3 a4 c3 b6 c3 bc c3 9f 
00c3 00a4 00c3 00b6 00c3 00bc 00c3 0178

I ran this on a PC running itself Windows 10 64-bit Pro, compiling with MSVC 2019 Community Edition in Version 16.8.1, using build system cmake with following CMakeLists.txt

cmake_minimum_required(VERSION 3.0.0)
project(wstring VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

add_executable(wstring main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

I read, that std::strings are based on char type which is a single byte. I see that the output of my snippet indicates that str (the std::string variable) is UTF-8 encoded. I read up, that Microsoft compilers use wchar_ts with 2 bytes to make up std::wstrings (instead of 4 byte wchar_ts by e.g. GNU gcc) and therefore would expect wstr (the std::wstring variable) to be (any kind of) UTF-16 encoded. But I cannot figure out why the "ß" (latin sharp s) is encoded as 0x00c30178 I had expected 0x00df instead. May somebody please tell me:

  • Why this is happening?
  • How may I end up with UTF-16 encoded std::wstrings (Big Endian would be fine, I do not mind a BOM)? Do I probably need to tell the compiler somehow?
  • What kind of encoding is this?

EDIT 1

changed title, as it did not fit the questions properly (and actually UTF-8 and UTF-16 are different encodings so the I my self new the answer already...)

EDIT 2

forgot to mention: I use the amd64 target of the mentioned compiler

EDIT 3

if adding the /utf-8 flag as pointed out in the comments by dxiv (see his linked SO-Post), I get the desired output

c3 a4 c3 b6 c3 bc c3 9f
00e4 00f6 00fc 00df

which looks like UTF-16-BE (no BOM) for me. As I had issues with the correct order of cmake commands this is my current CmakeLists.txt file. It is important to put the add_compile_options command before the add_executable command (I added the Notice for convenience)

cmake_minimum_required(VERSION 3.0.0)
project(enctest VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

if (MSVC)
  message(NOTICE "compiling with MSVC")
  add_compile_options(/utf-8)
endif()

add_executable(enctest main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

I find the if-endif way more readable, than the generator-syntax one, but writing add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>") instead would work as well.

Note: For Qt-Projects there is a nice switch for the .pro file (see this Qt-Form post)

win32 {
    QMAKE_CXXFLAGS += /utf-8
}

Still the first part of my question is open: What encoding is 0x00c30178 for "ß" (latin sharp s)?

8
  • 1
    How does you editor save the file? Have you looked in e.g. a hex-editor at your own source file to see? Nov 30, 2020 at 20:35
  • @Someprogrammerdude just did so, Notepad++ tells me that main.cpp is UTF-8 encoded, HxD shows me C3 A4 C3 B6 C3 BC C3 9F for both strings. I use Visual Studio Code with CMake Tools extension to create the project and edit it. But I get the same result using Qt Creator.
    – mjhalwa
    Nov 30, 2020 at 20:49
  • @Martin Is that UTF-8 with or without a BOM, and do you use /source-charset:utf-8?
    – dxiv
    Nov 30, 2020 at 20:55
  • @dxiv as far as I know, UTF-8 does not contain BOMs as these are only required to inform about the endian-ness if character types are made up of more than 1 byte. Anyways the file does not start with a BOM, but a 0x23 which is "#". Regarding the source-charset not, if cmake does not set it automatically. I use the CMakeLists.txt. How may I set this with cmake?
    – mjhalwa
    Nov 30, 2020 at 21:00
  • 2
    @Martin VS uses the BOM to identify the encoding of the source file. Without a BOM, it "assumes the source file is encoded using the current user code page" which is not what you want here. See also Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.
    – dxiv
    Nov 30, 2020 at 21:04

2 Answers 2

5

As clarified in the comments, the source .cpp file is UTF-8 encoded. Without a BOM, and without an explicit /source-charset:utf-8 switch, the Visual C++ compiler defaults to assuming the source file is saved in the active codepage encoding. From the Set Source Character Set documentation:

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you specify a character set name or code page by using the /source-charset option.

The UTF-8 encoding of äöüß is C3 A4 C3 B6 C3 BC C3 9F, and therefore the line:

    std::wstring wstr = L"äöüß";

is seen by the compiler as:

    std::wstring wstr = L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"`;

Assuming the active codepage to be the usual Windows-1252, the (extended) characters map as:

    win-1252    char    unicode

      \xC3       Ã       U+00C3
      \xA4       ¤       U+00A4
      \xB6       ¶       U+00B6
      \xBC       ¼       U+00BC
      \x9F       Ÿ       U+0178

Therefore L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F" gets translated to:

    std::wstring wstr = L"\u00C3\u00A4\u00C3\u00B6\u00C3\u00BC\u00C3\u0178"`;

To avoid such (mis)translation, Visual C++ needs to be told that the source file is encoded as UTF-8 by passing an explicit /source-charset:utf-8 (or /utf-8) compiler switch. For CMake based projects, this can be done using add_compile_options as shown at Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.

0
0

therefore would expect wstr (the std::wstring variable) to be (any kind of) UTF-16 encoded

std::wstring does not specify an encoding. It is a sequence of "wide characters", for some kind of wide characters (which are implementation defined).

There are conversion facets defined in the standard library for converting to/from different encodings.

2
  • I read that as well, but still the compiler should create at least some valid encoding from which I can convert then my wstring to UTF-16? Otherwise defining hardcoded text in my program is less compatible than an external file in any valid encoding.
    – mjhalwa
    Nov 30, 2020 at 20:53
  • That's what "implementation defined" means - but the point I was making is that different compilers may do different things. Nov 30, 2020 at 22:23

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.