[libcamera-devel] [PATCH 1/6] utils: Add function to convert string to UCS-2

Fri Jan 15 15:19:53 CET 2021

Hi Paul,

  I read a few things around, but character encoding seems a very
  complex subject, so I mostly have minor comments here

On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but

>From what I've read, even referring to Unicode might be mis-leading as
it includes a number of different encodings. Do the EXIF specification
mention Unicode or any other more specific standard ?

> are recommended to be in UCS-2. Add a function in utils to help with
> this.
>
> Signed-off-by: Paul Elder <paul.elder at ideasonboard.com>
> ---
>  include/libcamera/internal/utils.h |  2 ++
>  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> index f08134af..aa9cc236 100644
> --- a/include/libcamera/internal/utils.h
> +++ b/include/libcamera/internal/utils.h
> @@ -35,6 +35,8 @@ const char *basename(const char *path);
>  char *secure_getenv(const char *name);
>  std::string dirname(const std::string &path);
>
> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> +
>  template<typename T>
>  std::vector<typename T::key_type> map_keys(const T &map)
>  {
> diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> index e90375ae..89cb0f73 100644
> --- a/src/libcamera/utils.cpp
> +++ b/src/libcamera/utils.cpp
> @@ -17,6 +17,7 @@
>  #include <string.h>
>  #include <sys/stat.h>
>  #include <sys/types.h>
> +#include <uchar.h>
>  #include <unistd.h>
>
>  /**
> @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
>  	return path.substr(0, pos + 1);
>  }
>
> +/**
> + * \brief Convert string to byte array of UCS-2

a string to a byte array of UCS-2 encoded code point

But I wonder, the encoding used to represent the characters in the
string I assume depends on some locale, do they ?

> + * \param[in] str String to convert

The string to convert

> + * \param[in] le Little-endian (false for Big-endian)

The desired byte-endianess of the converted byte array.

An enum would not hurt, but it's not strictly required.

> + *
> + * \return Byte array of UCS-2 representation of \a str, without null-terminator

While it is still not clear to me the distinction between UTF-16 and
UCS-2 and I get the two are actually converging over time, the
documentaion of std::mbrtoc16 explicitely mentions UTF-16.

I guess it again depends on the encoding of \a str (which again
depends on the selected locale ?)

> + */
> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)

I wonder why we use snake_case in utils ? maybe to mimic STL ?

> +{
> +	std::mbstate_t state{};
> +	char16_t c16;
> +	const char *ptr = &str[0], *end = &str[0] + str.size();

One variable per line and maybe
        const char *end = &str.back()
> +
> +	std::vector<uint8_t> ret;

I would reserve str.size() * 2

Even if I get it's not necessarly that every char in str gets expanded
to two bytes

> +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {

std::mbrtoc16 ?
How come the compiler does not complain ?

> +		if (rc == static_cast<size_t>(-2) ||
> +		    rc == static_cast<size_t>(-1))
> +			break;
> +
> +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));

I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
automatically converted, does it ?

> +
> +		if (rc > 0)
> +			ptr += rc;
> +	}
> +
> +	return ret;
> +}
> +
>  /**
>   * \fn std::vector<typename T::key_type> map_keys(const T &map)
>   * \brief Retrieve the keys of a std::map<>
> --
> 2.27.0
>
> _______________________________________________
> libcamera-devel mailing list
> libcamera-devel at lists.libcamera.org
> https://lists.libcamera.org/listinfo/libcamera-devel