[libcamera-devel] [PATCH 1/6] utils: Add function to convert string to UCS-2

Fri Jan 15 18:01:35 CET 2021

Hi Paul,

Thank you for the patch.

On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> Hi Paul,
> 
>   I read a few things around, but character encoding seems a very
>   complex subject, so I mostly have minor comments here
> 
> On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> 
> From what I've read, even referring to Unicode might be mis-leading as
> it includes a number of different encodings. Do the EXIF specification
> mention Unicode or any other more specific standard ?

The Exif specification is fairly bad in that regard. It only says
"unicode", while what it means (or at least what is used in practice
today, I don't know if that's what was meant originally) is UCS-2. The
endianness is also not mentioned, and we believe it should match the
endianness of the TIFF container, but that's not documented.

> > are recommended to be in UCS-2. Add a function in utils to help with
> > this.

Unicode is a standard that contains (among other things) a set of
characters with their respective code points (a number that identifies
the character) and a set of character encodings. Encodings define how
to encode a code point into bytes. UCS-2 is one particular encoding.
Saying "can be in Unicode but are recommended to be in UCS-2" is thus
not quite correct.

> > Signed-off-by: Paul Elder <paul.elder at ideasonboard.com>
> > ---
> >  include/libcamera/internal/utils.h |  2 ++
> >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > index f08134af..aa9cc236 100644
> > --- a/include/libcamera/internal/utils.h
> > +++ b/include/libcamera/internal/utils.h
> > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> >  char *secure_getenv(const char *name);
> >  std::string dirname(const std::string &path);
> >
> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > +

I'd move this to the EXIF class in the HAL. There's hardly any chance
that libcamera will need to convert a string to UCS-2.

> >  template<typename T>
> >  std::vector<typename T::key_type> map_keys(const T &map)
> >  {
> > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > index e90375ae..89cb0f73 100644
> > --- a/src/libcamera/utils.cpp
> > +++ b/src/libcamera/utils.cpp
> > @@ -17,6 +17,7 @@
> >  #include <string.h>
> >  #include <sys/stat.h>
> >  #include <sys/types.h>
> > +#include <uchar.h>
> >  #include <unistd.h>
> >
> >  /**
> > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> >  	return path.substr(0, pos + 1);
> >  }
> >
> > +/**
> > + * \brief Convert string to byte array of UCS-2
> 
> a string to a byte array of UCS-2 encoded code point
> 
> But I wonder, the encoding used to represent the characters in the
> string I assume depends on some locale, do they ?

std::string has no concept of encoding. It's only an array of char. This
function assumes that the input is encoded in UTF-8, and outputs a
UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string
doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
is valid UTF-16 text.

It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
to use UTF-16.

> > + * \param[in] str String to convert
> 
> The string to convert
> 
> > + * \param[in] le Little-endian (false for Big-endian)
> 
> The desired byte-endianess of the converted byte array.
> 
> An enum would not hurt, but it's not strictly required.
> 
> > + *
> > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> 
> While it is still not clear to me the distinction between UTF-16 and
> UCS-2 and I get the two are actually converging over time, the
> documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> 
> I guess it again depends on the encoding of \a str (which again
> depends on the selected locale ?)
> 
> > + */
> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> 
> I wonder why we use snake_case in utils ? maybe to mimic STL ?

For some functions, we reuse names defined elsewhere (for instance our
custom implementation of secure_getenv() for platforms that don't
provide it). For other functions, it's to mimick STL indeed (for
instance set_overlap, there's a std::set_union() and
std::set_intersection()). Then we have duration_to_timespec() and
time_point_to_string() that should probably be in camelCase.

For this function, I'd use cameraCase. The function should likely be
named utf8ToUtf16 or something similar, to make it clear what the source
encoding is.

If we want to keep this function a generator UTF-8 to UTF-16 convertor,
I'd return a std::u16string instead of a std::vector<uint8_t>. This
would mean handling the endianness conversion in the caller. The
alternative is to turn this into a UTF-8 to Exif-Unicode function, in
which case you can keep the le argument, but I would then rename it to
endianness and use the Exif endianness macros as values instead of
making it a bool.

> > +{
> > +	std::mbstate_t state{};
> > +	char16_t c16;
> > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> 
> One variable per line and maybe
>         const char *end = &str.back()

back() isn't the same.

	const char *ptr = str.data();
	const char *end = ptr + str.size();

> > +
> > +	std::vector<uint8_t> ret;
> 
> I would reserve str.size() * 2
> 
> Even if I get it's not necessarly that every char in str gets expanded
> to two bytes
> 
> > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> 
> std::mbrtoc16 ?
> How come the compiler does not complain ?

With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
mbrtoc16() is fine, but the state variable above should be changed.

The C and C++ libraries are allowed to, but not required to, define
functions and types in both the std namespace and the unqualified
namespace. It's unsafe to rely on this though.

> > +		if (rc == static_cast<size_t>(-2) ||
> > +		    rc == static_cast<size_t>(-1))
> > +			break;
> > +
> > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> 
> I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> automatically converted, does it ?
> 
> > +
> > +		if (rc > 0)
> > +			ptr += rc;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> >   * \brief Retrieve the keys of a std::map<>

-- 
Regards,

Laurent Pinchart