SIMD black level correction -- Re: Soft ISP TODO list

Tue Apr 23 13:11:56 CEST 2024

Hi!

> > > > > > - If we split the processing to pre-bayer and post-bayer parts, we should
> > > > > >   probably work with uint16_t or float's, which may have impact on performance.
> > > > > > 
> > > > > > - Pavel couldn't get a better performance by using SIMD CPU instructions for
> > > > > >   debayering.  Applying a CCM matrix may be a different matter.  Anyway, SIMD on
> > > > > >   CPU is hard to use and may differ on architectures, so the question is whether
> > > > > >   it's worth to invest into it.
> > > > 
> > > > Good question :-)
> > > 
> > > Oh, so good news is you write SIMD code once with gcc intristics, and
> > > gcc does its magic. You don't have to know assembly for that, but it
> > > certainly helps to look at the assembly if it looks reasonable.
> > 
> > There are also potentially interesting helper libraries such as
> > https://github.com/vectorclass/version2 (I haven't checked the license
> > compatibility).
> 
> Ok, so I played with black level correction a bit and got pleasant
> surprise: [Ignore wrong name].
> 
> void debayer8(uint8_t *dst, const uint8_t *src)
> {
>         for (int x = 0; x < (int)WIDTH; x++)  {
>                 uint8_t v = src[x];
>                 if (v < 16)
>                         dst[x] = 0;
>                 else
>                         dst[x] = v-16;
> 	}
> }
> 
> gcc translates it to vector code automatically, and results is only
> 10% slower than plain memcpy. Test was done on thinkpad x60. If I
> disable vector instructions, result is 4x time of plain memcpy. I'm
> quite impressed both by vector unit and by the gcc :-).

Ok, disassembly below was for different function than benchmark was
running due to inlining, but you got the idea. Code is at

https://gitlab.com/tui/tui/-/blob/master/cam/blacklevel.c?ref_type=heads

if someone wants to play.

I tried to do matrix multiply, and while I do get small improvoement
from "tree-vectorize", it is from 1.05 sec to 0.94 sec... additional
improvement to cca 0.8 sec is possible with "fma". But this is still
many times slower than memcpy(), so I'm not sure if we can get good
performance there.

matmult.c code in same directory.

Best regards,
								Pavel
-- 
People of Russia, stop Putin before his war on Ukraine escalates.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.libcamera.org/pipermail/libcamera-devel/attachments/20240423/ec9318e4/attachment.sig>