How is const std::wstring encoded and how to change to UTF-16

Question

I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string and a std::wstring when defining a string with german non-ASCII characters in either type.

#include <iostream>
#include <iomanip>
#include <string>

int main(int, char**) {
    std::wstring wstr = L"äöüß";
    std::string str = "äöüß";

    for ( unsigned char c : str ) {
        std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    for ( wchar_t c : wstr ) {
        std::cout << std::setw(4) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    return 0;
}

The output of this snippet is

c3 a4 c3 b6 c3 bc c3 9f 
00c3 00a4 00c3 00b6 00c3 00bc 00c3 0178

I ran this on a PC running itself Windows 10 64-bit Pro, compiling with MSVC 2019 Community Edition in Version 16.8.1, using build system cmake with following CMakeLists.txt

cmake_minimum_required(VERSION 3.0.0)
project(wstring VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

add_executable(wstring main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

I read, that std::strings are based on char type which is a single byte. I see that the output of my snippet indicates that str (the std::string variable) is UTF-8 encoded. I read up, that Microsoft compilers use wchar_ts with 2 bytes to make up std::wstrings (instead of 4 byte wchar_ts by e.g. GNU gcc) and therefore would expect wstr (the std::wstring variable) to be (any kind of) UTF-16 encoded. But I cannot figure out why the "ß" (latin sharp s) is encoded as 0x00c30178 I had expected 0x00df instead. May somebody please tell me:

Why this is happening?
How may I end up with UTF-16 encoded std::wstrings (Big Endian would be fine, I do not mind a BOM)? Do I probably need to tell the compiler somehow?
What kind of encoding is this?

EDIT 1

changed title, as it did not fit the questions properly (and actually UTF-8 and UTF-16 are different encodings so the I my self new the answer already...)

EDIT 2

forgot to mention: I use the amd64 target of the mentioned compiler

EDIT 3

if adding the /utf-8 flag as pointed out in the comments by dxiv (see his linked SO-Post), I get the desired output

c3 a4 c3 b6 c3 bc c3 9f
00e4 00f6 00fc 00df

which looks like UTF-16-BE (no BOM) for me. As I had issues with the correct order of cmake commands this is my current CmakeLists.txt file. It is important to put the add_compile_options command before the add_executable command (I added the Notice for convenience)

cmake_minimum_required(VERSION 3.0.0)
project(enctest VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

if (MSVC)
  message(NOTICE "compiling with MSVC")
  add_compile_options(/utf-8)
endif()

add_executable(enctest main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

I find the if-endif way more readable, than the generator-syntax one, but writing add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>") instead would work as well.

Note: For Qt-Projects there is a nice switch for the .pro file (see this Qt-Form post)

win32 {
    QMAKE_CXXFLAGS += /utf-8
}

Still the first part of my question is open: What encoding is 0x00c30178 for "ß" (latin sharp s)?

How does you editor save the file? Have you looked in e.g. a hex-editor at your own source file to see? — Some programmer dude, Nov 30, 2020 at 20:35
@Someprogrammerdude just did so, Notepad++ tells me that main.cpp is UTF-8 encoded, HxD shows me C3 A4 C3 B6 C3 BC C3 9F for both strings. I use Visual Studio Code with CMake Tools extension to create the project and edit it. But I get the same result using Qt Creator. — mjhalwa, Nov 30, 2020 at 20:49
@Martin Is that UTF-8 with or without a BOM, and do you use /source-charset:utf-8? — dxiv, Nov 30, 2020 at 20:55
@dxiv as far as I know, UTF-8 does not contain BOMs as these are only required to inform about the endian-ness if character types are made up of more than 1 byte. Anyways the file does not start with a BOM, but a 0x23 which is "#". Regarding the source-charset not, if cmake does not set it automatically. I use the CMakeLists.txt. How may I set this with cmake? — mjhalwa, Nov 30, 2020 at 21:00
@Martin VS uses the BOM to identify the encoding of the source file. Without a BOM, it "assumes the source file is encoded using the current user code page" which is not what you want here. See also Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819. — dxiv, Nov 30, 2020 at 21:04

dxiv · Accepted Answer · 2020-12-01 16:41:34Z

As clarified in the comments, the source .cpp file is UTF-8 encoded. Without a BOM, and without an explicit /source-charset:utf-8 switch, the Visual C++ compiler defaults to assuming the source file is saved in the active codepage encoding. From the Set Source Character Set documentation:

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you specify a character set name or code page by using the /source-charset option.

The UTF-8 encoding of äöüß is C3 A4 C3 B6 C3 BC C3 9F, and therefore the line:

    std::wstring wstr = L"äöüß";

is seen by the compiler as:

    std::wstring wstr = L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"`;

Assuming the active codepage to be the usual Windows-1252, the (extended) characters map as:

    win-1252    char    unicode

      \xC3       Ã       U+00C3
      \xA4       ¤       U+00A4
      \xB6       ¶       U+00B6
      \xBC       ¼       U+00BC
      \x9F       Ÿ       U+0178

Therefore L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F" gets translated to:

    std::wstring wstr = L"\u00C3\u00A4\u00C3\u00B6\u00C3\u00BC\u00C3\u0178"`;

To avoid such (mis)translation, Visual C++ needs to be told that the source file is encoded as UTF-8 by passing an explicit /source-charset:utf-8 (or /utf-8) compiler switch. For CMake based projects, this can be done using add_compile_options as shown at Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.

Marshall Clow · Accepted Answer · 2020-11-30 20:50:09Z

0

therefore would expect wstr (the std::wstring variable) to be (any kind of) UTF-16 encoded

std::wstring does not specify an encoding. It is a sequence of "wide characters", for some kind of wide characters (which are implementation defined).

There are conversion facets defined in the standard library for converting to/from different encodings.

answered Nov 30, 2020 at 20:50

Marshall Clow

16.3k2 gold badges30 silver badges49 bronze badges

I read that as well, but still the compiler should create at least some valid encoding from which I can convert then my wstring to UTF-16? Otherwise defining hardcoded text in my program is less compatible than an external file in any valid encoding.
– mjhalwa
Nov 30, 2020 at 20:53
That's what "implementation defined" means - but the point I was making is that different compilers may do different things.
– Marshall Clow
Nov 30, 2020 at 22:23

Add a comment |

Collectives™ on Stack Overflow

How is const std::wstring encoded and how to change to UTF-16

EDIT 1

EDIT 2

EDIT 3

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
visual-c++
utf-8
c++17
utf-16
wstring
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

EDIT 1

EDIT 2

EDIT 3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged visual-c++utf-8c++17utf-16wstring or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
visual-c++
utf-8
c++17
utf-16
wstring
or ask your own question.