[libcamera-devel] [PATCH 4/5] android: camera_worker: Notify fence wait failures

Mon Sep 6 18:36:43 CEST 2021

Hi Jacopo,

On Tue, Sep 7, 2021 at 1:19 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
>
> Hi Hiro,
>
> On Mon, Sep 06, 2021 at 11:55:15PM +0900, Hirokazu Honda wrote:
> > Hi Jacopo, thank you for the patch.
> >
> > On Mon, Sep 6, 2021 at 11:01 PM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > >
> > > The CameraDevice class stores one CaptureRequestDescriptor instance
> > > for each capture request queued to the Camera. Such descriptors are
> > > cleared when the Camera completes the Request in the
> > > CameraDevice::requestComplete() slot.
> > >
> > > The CameraDevice class relies on an internal worker to off-load
> > > waiting on synchronization fences, and does unconditionally store
> > > the descriptor in the descriptors_ class member map to handle its
> > > lifetime.
> > >
> > > If waiting on a fence fails, the created descriptor remains in the
> > > descriptors_ map until the next call to stop() or close(), as
> > > requestComplete() will never be called, and the camera
> > > service is never notified about the queueing failure.
> > >
> > > Fix that by allowing the CameraWorker to notify to the CameraDevice if
> > > waiting on a fence has failed, by installing a new requestQueueFailed
> > > Signal.
> > >
> > > The Signal is emitted by the CameraWorker internal worker if waiting on
> > > a fence has failed. The CameraDevice is augmented with a slot connected
> > > to the Signal that removes the descriptor from the map and notifies the
> > > camera framework about the failure.
> > >
> > > Signed-off-by: Jacopo Mondi <jacopo at jmondi.org>
> > > ---
> > >  src/android/camera_device.cpp | 22 +++++++++++++++++++++-
> > >  src/android/camera_device.h   |  1 +
> > >  src/android/camera_worker.cpp | 10 ++++++----
> > >  src/android/camera_worker.h   |  9 +++++++--
> > >  4 files changed, 35 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > index a0ea138d9499..0ce9b699bfae 100644
> > > --- a/src/android/camera_device.cpp
> > > +++ b/src/android/camera_device.cpp
> > > @@ -244,7 +244,7 @@ CameraDevice::Camera3RequestDescriptor::Camera3RequestDescriptor(
> > >          * Create the CaptureRequest, stored as a unique_ptr<> to tie its
> > >          * lifetime to the descriptor.
> > >          */
> > > -       request_ = std::make_unique<CaptureRequest>(camera);
> > > +       request_ = std::make_unique<CaptureRequest>(camera, camera3Request);
> > >  }
> > >
> > >  /*
> > > @@ -264,6 +264,7 @@ CameraDevice::CameraDevice(unsigned int id, std::shared_ptr<Camera> camera)
> > >         : id_(id), state_(State::Stopped), camera_(std::move(camera)),
> > >           facing_(CAMERA_FACING_FRONT), orientation_(0)
> > >  {
> > > +       worker_.requestQueueFailed.connect(this, &CameraDevice::requestQueueFailed);
> > >         camera_->requestCompleted.connect(this, &CameraDevice::requestComplete);
> > >
> > >         maker_ = "libcamera";
> > > @@ -1060,6 +1061,25 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > >         return 0;
> > >  }
> > >
> > > +void CameraDevice::requestQueueFailed(CaptureRequest *request)
> > > +{
> > > +       /*
> > > +        * Remove the descriptor from the map if the worker has failed
> > > +        * to process it and notify the camera service about the failure.
> > > +        */
> > > +       MutexLocker descriptorsLock(descriptorsMutex_);
> > > +       auto it = descriptors_.find(request->cookie());
> > > +       if (it == descriptors_.end()) {
> > > +               LOG(HAL, Fatal)
> >
> > If I understand correctly, this must not happen because
>
> Do you mean the it == end() with "this" ?
>
> It should not happen as requestQueueFail is called by the worker if it
> fails to wait on a fence and the descriptor is already part of
> descriptors.
>
> Actually, there's a small window for this to race now that I look at
> it again
>
> --------------------------------------------------------------------------
> CameraDevice                            CameraWorker
> --------------------------------------------------------------------------
>
> worker_.queueRequest()          ->
>
>                                         worker.processRequest
>                                         fence.wait() == error
>                                 <-      requestQueueFailed.emit()
> requestQueueFailed()
>         it == descriptors_.end()
>         abort();
>
> descriptors_[req] = desc;
> --------------------------------------------------------------------------
>
> While the intended order of execution is
>
> CameraDevice                            CameraWorker
> --------------------------------------------------------------------------
>
> worker_.queueRequest()
> descriptors_[req] = desc;
>                                 ->
>                                         worker.processRequest
>                                         fence.wait() == error
>                                 <-      requestQueueFailed.emit()
> requestQueueFailed()
>         it == descriptors_.end()
>         abort();
>
> --------------------------------------------------------------------------
>
> I'll make sure to add the descriptor to descriptors_ before calling
> worker_.queuRequest().
>
> > requestQueueFailed() is invoked by worker::stop() and it is executed
>
> Why do you say it is invoked by stop() ? The signal is emitted if
> waiting on a fence fails as shown...
>

I don't probably understand when fence waiting error happens.
Could you explain when it happens?

It looks like, in the next patch, descriptors_ must be empty in some places.
How is it guaranteed that requestQueueFailed() is executed before
descriptors_ is empty?

-Hiro
> > before descriptors_.clear() always.
> > Correct?
> >
> > Reviewed-by: Hirokazu Honda <hiroh at chromium.org>
>
> > >   */
> > > -CaptureRequest::CaptureRequest(libcamera::Camera *camera)
> > > -       : camera_(camera)
> > > +CaptureRequest::CaptureRequest(libcamera::Camera *camera,
> > > +                              const camera3_capture_request_t *camera3Request)
> > > +       : camera_(camera), camera3Request_(camera3Request)
> > >  {
> > >         request_ = camera_->createRequest(reinterpret_cast<uint64_t>(this));
> > >  }
> > > @@ -77,7 +78,7 @@ void CameraWorker::queueRequest(CaptureRequest *request)
> > >  {
> > >         /* Async process the request on the worker which runs its own thread. */
> > >         worker_.invokeMethod(&Worker::processRequest, ConnectionTypeQueued,
> > > -                            request);
> > > +                            this, request);
> > >  }
> > >
> > >  /*
> > > @@ -109,7 +110,7 @@ int CameraWorker::Worker::waitFence(int fence)
> > >         return -errno;
> > >  }
> > >
> > > -void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > +void CameraWorker::Worker::processRequest(CameraWorker *context, CaptureRequest *request)
> > >  {
> > >         /* Wait on all fences before queuing the Request. */
> > >         for (int fence : request->fences()) {
> > > @@ -121,6 +122,7 @@ void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > >                 if (ret < 0) {
> > >                         LOG(HAL, Error) << "Failed waiting for fence: "
> > >                                         << fence << ": " << strerror(-ret);
> > > +                       context->requestQueueFailed.emit(request);
>
> .... Here
>
> Is your question about stop() related to dispatching messages before
> the thread has actually been stopped ? Isn't that been fixed some time
> ago by overriding CameraWorker::exec() ?
>
> Thanks
>    j
>