[libcamera-devel] softISP for libcamera

Wed Dec 6 22:12:37 CET 2023

Hi!

> > This is quite evil code. Could we have struct for statistics and
> > inline functions instead?
> 
> I thought about this, but with a struct with say the R + G + B
> sums used for white-balancing, the compiler will take the address
> of the pointer passed to that struct and then deref that.
> 
> Where we really want these accumulators to be in registers
> in the inner loop.
> 
> I guess the compiler might be smart enough to replace
> a struct which is only used locally and passed around by reference
> to static inline helpers with putting all the struct members into
> registers and directly access them but I would rather not count
> on this.

Any modern compiler should be smart enough to do that for static
inlines. And it is our best chance to avoid copy&paste or macro hell. 

> SWISP_LINARO_START_LINE_STATS()
> SWISP_LINARO_ACCUMULATE_LINE_STATS()
> SWISP_LINARO_FINISH_LINE_STATS()
> 
> helpers at least for the stats most of the work is done in
> these 3 macros and those macros work for your 8 bits per
> pixel bayer fmt unmodified. What remeans is actually reading
> the r + g + b values from the bayer-grid which we need
> separate code for for each variant regardless.
> 
> (using offsets is not a good idea because ideally we
> want to read the bytes in memory order).

Compiler is free to reorder the accesses, anyway. And while I see
there may be some benefit in accessing cache-lines in order, order of
access of bytes within cacheline should not really matter.

> So in a way these ugly macros actually make handling different
> bayer fmts easy, at least for the statistics gathering.
> 
> For the debayering itself having to type out everything
> manually admittedly is a bit painful, but note how
> with the 10bits / pixel packed fmt this allows skipping
> the byte with the least-significant bits in an effecient
> manner. Where as Andrey's generic debayer code was
> full of offset calculations for each separate pixel
> surrounding the currently processed pixel.

I'm not saying Andrey's code was perfect, but it should be possible to
get same performance while keeping the code readable.

> With that said, if you find a clever solution
> which is both equally readable and as fast as
> my verbose solution I won't object against that.

I believe I can do that. Lets say +-5% performance should be acceptable?

> which at least for 8 bits per pixel has some nice
> pretty generic code. Although I would still prefer
> separate functions for the inner loop for
> blue first vs not blue first, removing the
> if for this so that we don't mess up the branch
> predictor and so that chances are better that
> the entire inner-loop will fiy in the u-op cache.

Once we get sane code, making compiler generate 4 copies of it
depending on usage with slightly different inner loops should not
really be hard. Just make sure function is static inlines (or even
__always inline__or something), and value is constant at the caller.

What scares me is that you have 8 copies of the debayer code, at the
moment. I'd need at 8 more for 8-bit variants. And then there are
10-bit unpacked... just to cover Librem 5 and PinePhone. I believe
12-bit is in use, too.

I'd rather clean it up before adding the 8-bit variants...

Tell me when you believe basic architecture is stable enough to do
this...

Best regards,
     							Pavel
-- 
People of Russia, stop Putin before his war on Ukraine escalates.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.libcamera.org/pipermail/libcamera-devel/attachments/20231206/6b42e87b/attachment.sig>