[libcamera-devel] [PATCH 1/6] utils: Add function to convert string to UCS-2

Mon Jan 18 10:17:11 CET 2021

Hi Laurent,

On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:
> Hi Paul,
>
> Thank you for the patch.
>
> On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> > Hi Paul,
> >
> >   I read a few things around, but character encoding seems a very
> >   complex subject, so I mostly have minor comments here
> >
> > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> >
> > From what I've read, even referring to Unicode might be mis-leading as
> > it includes a number of different encodings. Do the EXIF specification
> > mention Unicode or any other more specific standard ?
>
> The Exif specification is fairly bad in that regard. It only says
> "unicode", while what it means (or at least what is used in practice
> today, I don't know if that's what was meant originally) is UCS-2. The
> endianness is also not mentioned, and we believe it should match the
> endianness of the TIFF container, but that's not documented.
>
>
> > > are recommended to be in UCS-2. Add a function in utils to help with
> > > this.
>
> Unicode is a standard that contains (among other things) a set of
> characters with their respective code points (a number that identifies
> the character) and a set of character encodings. Encodings define how
> to encode a code point into bytes. UCS-2 is one particular encoding.
> Saying "can be in Unicode but are recommended to be in UCS-2" is thus
> not quite correct.
>
> > > Signed-off-by: Paul Elder <paul.elder at ideasonboard.com>
> > > ---
> > >  include/libcamera/internal/utils.h |  2 ++
> > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> > >  2 files changed, 32 insertions(+)
> > >
> > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > > index f08134af..aa9cc236 100644
> > > --- a/include/libcamera/internal/utils.h
> > > +++ b/include/libcamera/internal/utils.h
> > > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> > >  char *secure_getenv(const char *name);
> > >  std::string dirname(const std::string &path);
> > >
> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > > +
>
> I'd move this to the EXIF class in the HAL. There's hardly any chance
> that libcamera will need to convert a string to UCS-2.
>
> > >  template<typename T>
> > >  std::vector<typename T::key_type> map_keys(const T &map)
> > >  {
> > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > > index e90375ae..89cb0f73 100644
> > > --- a/src/libcamera/utils.cpp
> > > +++ b/src/libcamera/utils.cpp
> > > @@ -17,6 +17,7 @@
> > >  #include <string.h>
> > >  #include <sys/stat.h>
> > >  #include <sys/types.h>
> > > +#include <uchar.h>
> > >  #include <unistd.h>
> > >
> > >  /**
> > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> > >  	return path.substr(0, pos + 1);
> > >  }
> > >
> > > +/**
> > > + * \brief Convert string to byte array of UCS-2
> >
> > a string to a byte array of UCS-2 encoded code point
> >
> > But I wonder, the encoding used to represent the characters in the
> > string I assume depends on some locale, do they ?
>
> std::string has no concept of encoding. It's only an array of char. This
> function assumes that the input is encoded in UTF-8, and outputs a
> UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string

Not sure I get the "concept of encoding". Each char is an 8-bit value
which represents a code point. Different encodings might associate the code
point to a different symbol.

Just after the first statement you said "UTF-8" is assumed, so there's
some assumed encoding.

Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16"
it sound a bit more clear.

> doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
> is valid UTF-16 text.
>
> It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
> to use UTF-16.
>
> > > + * \param[in] str String to convert
> >
> > The string to convert
> >
> > > + * \param[in] le Little-endian (false for Big-endian)
> >
> > The desired byte-endianess of the converted byte array.
> >
> > An enum would not hurt, but it's not strictly required.
> >
> > > + *
> > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> >
> > While it is still not clear to me the distinction between UTF-16 and
> > UCS-2 and I get the two are actually converging over time, the
> > documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> >
> > I guess it again depends on the encoding of \a str (which again
> > depends on the selected locale ?)
> >
> > > + */
> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> >
> > I wonder why we use snake_case in utils ? maybe to mimic STL ?
>
> For some functions, we reuse names defined elsewhere (for instance our
> custom implementation of secure_getenv() for platforms that don't
> provide it). For other functions, it's to mimick STL indeed (for
> instance set_overlap, there's a std::set_union() and
> std::set_intersection()). Then we have duration_to_timespec() and
> time_point_to_string() that should probably be in camelCase.
>
> For this function, I'd use cameraCase. The function should likely be
> named utf8ToUtf16 or something similar, to make it clear what the source
> encoding is.
>
> If we want to keep this function a generator UTF-8 to UTF-16 convertor,
> I'd return a std::u16string instead of a std::vector<uint8_t>. This
> would mean handling the endianness conversion in the caller. The
> alternative is to turn this into a UTF-8 to Exif-Unicode function, in
> which case you can keep the le argument, but I would then rename it to
> endianness and use the Exif endianness macros as values instead of
> making it a bool.
>
> > > +{
> > > +	std::mbstate_t state{};
> > > +	char16_t c16;
> > > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> >
> > One variable per line and maybe
> >         const char *end = &str.back()
>
> back() isn't the same.
>

Ups
std::string::back() : equivalent to operator[](size() - 1)

> 	const char *ptr = str.data();
> 	const char *end = ptr + str.size();
>
> > > +
> > > +	std::vector<uint8_t> ret;
> >
> > I would reserve str.size() * 2
> >
> > Even if I get it's not necessarly that every char in str gets expanded
> > to two bytes
> >
> > > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> >
> > std::mbrtoc16 ?
> > How come the compiler does not complain ?
>
> With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
> should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
> mbrtoc16() is fine, but the state variable above should be changed.

Uh, I assumed the C++ header having seen std::mbstate_t

>
> The C and C++ libraries are allowed to, but not required to, define
> functions and types in both the std namespace and the unqualified
> namespace. It's unsafe to rely on this though.
>
> > > +		if (rc == static_cast<size_t>(-2) ||
> > > +		    rc == static_cast<size_t>(-1))
> > > +			break;
> > > +
> > > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> >
> > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> > automatically converted, does it ?
> >
> > > +
> > > +		if (rc > 0)
> > > +			ptr += rc;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  /**
> > >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> > >   * \brief Retrieve the keys of a std::map<>
>
> --
> Regards,
>
> Laurent Pinchart