I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string
and a std::wstring
when defining a string with german non-ASCII characters in either type.
#include <iostream>
#include <iomanip>
#include <string>
int main(int, char**) {
std::wstring wstr = L"äöüß";
std::string str = "äöüß";
for ( unsigned char c : str ) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
}
std::cout << std::endl;
for ( wchar_t c : wstr ) {
std::cout << std::setw(4) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
}
std::cout << std::endl;
return 0;
}
The output of this snippet is
c3 a4 c3 b6 c3 bc c3 9f
00c3 00a4 00c3 00b6 00c3 00bc 00c3 0178
I ran this on a PC running itself Windows 10 64-bit Pro, compiling with MSVC 2019 Community Edition in Version 16.8.1, using build system cmake with following CMakeLists.txt
cmake_minimum_required(VERSION 3.0.0)
project(wstring VERSION 0.1.0)
set(CMAKE_CXX_STANDARD 17)
include(CTest)
enable_testing()
add_executable(wstring main.cpp)
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)
I read, that std::string
s are based on char
type which is a single byte. I see that the output of my snippet indicates that str
(the std::string
variable) is UTF-8 encoded. I read up, that Microsoft compilers use wchar_t
s with 2 bytes to make up std::wstring
s (instead of 4 byte wchar_t
s by e.g. GNU gcc) and therefore would expect wstr
(the std::wstring
variable) to be (any kind of) UTF-16 encoded. But I cannot figure out why the "ß" (latin sharp s) is encoded as 0x00c30178
I had expected 0x00df
instead. May somebody please tell me:
- Why this is happening?
- How may I end up with UTF-16 encoded
std::wstring
s (Big Endian would be fine, I do not mind a BOM)? Do I probably need to tell the compiler somehow? - What kind of encoding is this?
EDIT 1
changed title, as it did not fit the questions properly (and actually UTF-8 and UTF-16 are different encodings so the I my self new the answer already...)
EDIT 2
forgot to mention: I use the amd64
target of the mentioned compiler
EDIT 3
if adding the /utf-8
flag as pointed out in the comments by dxiv (see his linked SO-Post), I get the desired output
c3 a4 c3 b6 c3 bc c3 9f
00e4 00f6 00fc 00df
which looks like UTF-16-BE (no BOM) for me. As I had issues with the correct order of cmake commands this is my current CmakeLists.txt
file. It is important to put the add_compile_options
command before the add_executable
command (I added the Notice for convenience)
cmake_minimum_required(VERSION 3.0.0)
project(enctest VERSION 0.1.0)
set(CMAKE_CXX_STANDARD 17)
include(CTest)
enable_testing()
if (MSVC)
message(NOTICE "compiling with MSVC")
add_compile_options(/utf-8)
endif()
add_executable(enctest main.cpp)
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)
I find the if-endif
way more readable, than the generator-syntax one, but writing add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")
instead would work as well.
Note: For Qt-Projects there is a nice switch for the .pro
file (see this Qt-Form post)
win32 {
QMAKE_CXXFLAGS += /utf-8
}
Still the first part of my question is open: What encoding is 0x00c30178
for "ß" (latin sharp s)?
main.cpp
is UTF-8 encoded, HxD shows meC3 A4 C3 B6 C3 BC C3 9F
for both strings. I use Visual Studio Code with CMake Tools extension to create the project and edit it. But I get the same result using Qt Creator./source-charset:utf-8
?0x23
which is "#". Regarding thesource-charset
not, if cmake does not set it automatically. I use theCMakeLists.txt
. How may I set this with cmake?