[libcamera-devel] softISP for libcamera

Wed Dec 6 19:15:06 CET 2023

Hi Pavel,

On 12/5/23 22:49, Pavel Machek wrote:
> Hi!
> 
>>> (I'd really like ability to collect statistics without doing the
>>> debayering,
>>
>> That is actually already a part of my refactoring, the stats
>> collection is now done separately (on a line by line basis
>> to keep the input data hot in the cache for debayering.
> 
> #define SWISP_LINARO_START_LINE_STATS()                 \
>         uint8_t r, g1, g2, b;                           \
> 
> ...
> 
> void SwIspLinaro::IspWorker::statsBGGR10PLine0(const uint8_t *src0)
> void SwIspLinaro::IspWorker::statsGBRG10PLine0(const uint8_t *src0)
> 
> ...
> 
> This is quite evil code. Could we have struct for statistics and
> inline functions instead?

I thought about this, but with a struct with say the R + G + B
sums used for white-balancing, the compiler will take the address
of the pointer passed to that struct and then deref that.

Where we really want these accumulators to be in registers
in the inner loop.

I guess the compiler might be smart enough to replace
a struct which is only used locally and passed around by reference
to static inline helpers with putting all the struct members into
registers and directly access them but I would rather not count
on this.

> There must be a better way then this.
> 
> What about
> 
> void inline statsBayer10PLine0(const uint8_t *src0, int bayer_type)
> 
> ... have a switch for BGGR/GBRG, ... variants passed as bayer_type. Then
> 
> void SwIspLinaro::IspWorker::statsBGGR10PLine0(const uint8_t *src0)
> {
>     statsBayer10PLine0(src0, 0);
> }
> 
> and rely on optimizer doing the right thing? Having 4 copies is
> ... not fun. Especially because I'll need to do one more copy for 8
> bit support...

Note that with the 3:

SWISP_LINARO_START_LINE_STATS()
SWISP_LINARO_ACCUMULATE_LINE_STATS()
SWISP_LINARO_FINISH_LINE_STATS()

helpers at least for the stats most of the work is done in
these 3 macros and those macros work for your 8 bits per
pixel bayer fmt unmodified. What remeans is actually reading
the r + g + b values from the bayer-grid which we need
separate code for for each variant regardless.

(using offsets is not a good idea because ideally we
want to read the bytes in memory order).

So in a way these ugly macros actually make handling different
bayer fmts easy, at least for the statistics gathering.

For the debayering itself having to type out everything
manually admittedly is a bit painful, but note how
with the 10bits / pixel packed fmt this allows skipping
the byte with the least-significant bits in an effecient
manner. Where as Andrey's generic debayer code was
full of offset calculations for each separate pixel
surrounding the currently processed pixel.

I agree having to type all this out this way is not
ideal, but it leads to significantly faster code and
we only need to write all this once. I plan to "hide"
all of this in a separate debayer class and then we
only need to write all this once and after that
I expect this to not need a lot of maintenance
(speaking from my experience with libv4lconvert).

I realize my approach leads to pretty verbose
code, but we only need to write + review this
once (per x-bits per pixel variant) and I would
rather focus on all the other work which needs
to be done then try to come up with a clever
solution for this.

With that said, if you find a clever solution
which is both equally readable and as fast as
my verbose solution I won't object against that.

A possible candidate for inspiration here is:

https://git.linuxtv.org/v4l-utils.git/tree/lib/libv4lconvert/bayer.c

which at least for 8 bits per pixel has some nice
pretty generic code. Although I would still prefer
separate functions for the inner loop for
blue first vs not blue first, removing the
if for this so that we don't mess up the branch
predictor and so that chances are better that
the entire inner-loop will fiy in the u-op cache.

Regards,

Hans