[libcamera-devel] [PATCH 1/6] utils: Add function to convert string to UCS-2

Mon Jan 18 10:26:04 CET 2021

Hi Jacopo,

On Mon, Jan 18, 2021 at 10:17:11AM +0100, Jacopo Mondi wrote:
> On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:
> > On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> > > Hi Paul,
> > >
> > >   I read a few things around, but character encoding seems a very
> > >   complex subject, so I mostly have minor comments here
> > >
> > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> > >
> > > From what I've read, even referring to Unicode might be mis-leading as
> > > it includes a number of different encodings. Do the EXIF specification
> > > mention Unicode or any other more specific standard ?
> >
> > The Exif specification is fairly bad in that regard. It only says
> > "unicode", while what it means (or at least what is used in practice
> > today, I don't know if that's what was meant originally) is UCS-2. The
> > endianness is also not mentioned, and we believe it should match the
> > endianness of the TIFF container, but that's not documented.
> >
> > > > are recommended to be in UCS-2. Add a function in utils to help with
> > > > this.
> >
> > Unicode is a standard that contains (among other things) a set of
> > characters with their respective code points (a number that identifies
> > the character) and a set of character encodings. Encodings define how
> > to encode a code point into bytes. UCS-2 is one particular encoding.
> > Saying "can be in Unicode but are recommended to be in UCS-2" is thus
> > not quite correct.
> >
> > > > Signed-off-by: Paul Elder <paul.elder at ideasonboard.com>
> > > > ---
> > > >  include/libcamera/internal/utils.h |  2 ++
> > > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> > > >  2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > > > index f08134af..aa9cc236 100644
> > > > --- a/include/libcamera/internal/utils.h
> > > > +++ b/include/libcamera/internal/utils.h
> > > > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> > > >  char *secure_getenv(const char *name);
> > > >  std::string dirname(const std::string &path);
> > > >
> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > > > +
> >
> > I'd move this to the EXIF class in the HAL. There's hardly any chance
> > that libcamera will need to convert a string to UCS-2.
> >
> > > >  template<typename T>
> > > >  std::vector<typename T::key_type> map_keys(const T &map)
> > > >  {
> > > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > > > index e90375ae..89cb0f73 100644
> > > > --- a/src/libcamera/utils.cpp
> > > > +++ b/src/libcamera/utils.cpp
> > > > @@ -17,6 +17,7 @@
> > > >  #include <string.h>
> > > >  #include <sys/stat.h>
> > > >  #include <sys/types.h>
> > > > +#include <uchar.h>
> > > >  #include <unistd.h>
> > > >
> > > >  /**
> > > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> > > >  	return path.substr(0, pos + 1);
> > > >  }
> > > >
> > > > +/**
> > > > + * \brief Convert string to byte array of UCS-2
> > >
> > > a string to a byte array of UCS-2 encoded code point
> > >
> > > But I wonder, the encoding used to represent the characters in the
> > > string I assume depends on some locale, do they ?
> >
> > std::string has no concept of encoding. It's only an array of char. This
> > function assumes that the input is encoded in UTF-8, and outputs a
> > UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string
> 
> Not sure I get the "concept of encoding". Each char is an 8-bit value
> which represents a code point. Different encodings might associate the code
> point to a different symbol.

In unicode a code point is a 32-bit value. The encoding then defines how
the code point is encoded in bytes (in a variable number of bytes for
UTF-8, a single byte that can only represent a subset of unicode for
ISO-8859-*, a variable number of u16 for UTF-16, a single u16 that can
only represent a subdev for unicode for UCS-2, ...). An std::string
doesn't specify any encoding. As it stores a string as a null-terminated
array of char, any char-based encoding is usable (UTF-8, ISO-8859-*,
...) but not "wide" char-based encodings (UTF-16, UCS-2, ...).

> Just after the first statement you said "UTF-8" is assumed, so there's
> some assumed encoding.
> 
> Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16"
> it sound a bit more clear.
> 
> > doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
> > is valid UTF-16 text.
> >
> > It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
> > to use UTF-16.
> >
> > > > + * \param[in] str String to convert
> > >
> > > The string to convert
> > >
> > > > + * \param[in] le Little-endian (false for Big-endian)
> > >
> > > The desired byte-endianess of the converted byte array.
> > >
> > > An enum would not hurt, but it's not strictly required.
> > >
> > > > + *
> > > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> > >
> > > While it is still not clear to me the distinction between UTF-16 and
> > > UCS-2 and I get the two are actually converging over time, the
> > > documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> > >
> > > I guess it again depends on the encoding of \a str (which again
> > > depends on the selected locale ?)
> > >
> > > > + */
> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> > >
> > > I wonder why we use snake_case in utils ? maybe to mimic STL ?
> >
> > For some functions, we reuse names defined elsewhere (for instance our
> > custom implementation of secure_getenv() for platforms that don't
> > provide it). For other functions, it's to mimick STL indeed (for
> > instance set_overlap, there's a std::set_union() and
> > std::set_intersection()). Then we have duration_to_timespec() and
> > time_point_to_string() that should probably be in camelCase.
> >
> > For this function, I'd use cameraCase. The function should likely be
> > named utf8ToUtf16 or something similar, to make it clear what the source
> > encoding is.
> >
> > If we want to keep this function a generator UTF-8 to UTF-16 convertor,
> > I'd return a std::u16string instead of a std::vector<uint8_t>. This
> > would mean handling the endianness conversion in the caller. The
> > alternative is to turn this into a UTF-8 to Exif-Unicode function, in
> > which case you can keep the le argument, but I would then rename it to
> > endianness and use the Exif endianness macros as values instead of
> > making it a bool.
> >
> > > > +{
> > > > +	std::mbstate_t state{};
> > > > +	char16_t c16;
> > > > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> > >
> > > One variable per line and maybe
> > >         const char *end = &str.back()
> >
> > back() isn't the same.
> 
> Ups
> std::string::back() : equivalent to operator[](size() - 1)
> 
> > 	const char *ptr = str.data();
> > 	const char *end = ptr + str.size();
> >
> > > > +
> > > > +	std::vector<uint8_t> ret;
> > >
> > > I would reserve str.size() * 2
> > >
> > > Even if I get it's not necessarly that every char in str gets expanded
> > > to two bytes
> > >
> > > > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> > >
> > > std::mbrtoc16 ?
> > > How come the compiler does not complain ?
> >
> > With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
> > should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
> > mbrtoc16() is fine, but the state variable above should be changed.
> 
> Uh, I assumed the C++ header having seen std::mbstate_t
> 
> > The C and C++ libraries are allowed to, but not required to, define
> > functions and types in both the std namespace and the unqualified
> > namespace. It's unsafe to rely on this though.
> >
> > > > +		if (rc == static_cast<size_t>(-2) ||
> > > > +		    rc == static_cast<size_t>(-1))
> > > > +			break;
> > > > +
> > > > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > > > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> > >
> > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> > > automatically converted, does it ?
> > >
> > > > +
> > > > +		if (rc > 0)
> > > > +			ptr += rc;
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  /**
> > > >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> > > >   * \brief Retrieve the keys of a std::map<>

-- 
Regards,

Laurent Pinchart