[libcamera-devel] [PATCH 4/5] android: camera_worker: Notify fence wait failures

Tue Sep 7 15:52:44 CEST 2021

Hi Hiro,

On Tue, Sep 07, 2021 at 05:32:29PM +0900, Hirokazu Honda wrote:
> Hi Jacopo,
>
> On Tue, Sep 7, 2021 at 2:04 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
> >
> > Hi Hiro,
> >
> > On Tue, Sep 07, 2021 at 01:36:43AM +0900, Hirokazu Honda wrote:
> > > Hi Jacopo,
> > >
> > > On Tue, Sep 7, 2021 at 1:19 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > > >
> > > > Hi Hiro,
> > > >
> > > > On Mon, Sep 06, 2021 at 11:55:15PM +0900, Hirokazu Honda wrote:
> > > > > Hi Jacopo, thank you for the patch.
> > > > >
> > > > > On Mon, Sep 6, 2021 at 11:01 PM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > > > > >
> > > > > > The CameraDevice class stores one CaptureRequestDescriptor instance
> > > > > > for each capture request queued to the Camera. Such descriptors are
> > > > > > cleared when the Camera completes the Request in the
> > > > > > CameraDevice::requestComplete() slot.
> > > > > >
> > > > > > The CameraDevice class relies on an internal worker to off-load
> > > > > > waiting on synchronization fences, and does unconditionally store
> > > > > > the descriptor in the descriptors_ class member map to handle its
> > > > > > lifetime.
> > > > > >
> > > > > > If waiting on a fence fails, the created descriptor remains in the
> > > > > > descriptors_ map until the next call to stop() or close(), as
> > > > > > requestComplete() will never be called, and the camera
> > > > > > service is never notified about the queueing failure.
> > > > > >
> > > > > > Fix that by allowing the CameraWorker to notify to the CameraDevice if
> > > > > > waiting on a fence has failed, by installing a new requestQueueFailed
> > > > > > Signal.
> > > > > >
> > > > > > The Signal is emitted by the CameraWorker internal worker if waiting on
> > > > > > a fence has failed. The CameraDevice is augmented with a slot connected
> > > > > > to the Signal that removes the descriptor from the map and notifies the
> > > > > > camera framework about the failure.
> > > > > >
> > > > > > Signed-off-by: Jacopo Mondi <jacopo at jmondi.org>
> > > > > > ---
> > > > > >  src/android/camera_device.cpp | 22 +++++++++++++++++++++-
> > > > > >  src/android/camera_device.h   |  1 +
> > > > > >  src/android/camera_worker.cpp | 10 ++++++----
> > > > > >  src/android/camera_worker.h   |  9 +++++++--
> > > > > >  4 files changed, 35 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > > > > index a0ea138d9499..0ce9b699bfae 100644
> > > > > > --- a/src/android/camera_device.cpp
> > > > > > +++ b/src/android/camera_device.cpp
> > > > > > @@ -244,7 +244,7 @@ CameraDevice::Camera3RequestDescriptor::Camera3RequestDescriptor(
> > > > > >          * Create the CaptureRequest, stored as a unique_ptr<> to tie its
> > > > > >          * lifetime to the descriptor.
> > > > > >          */
> > > > > > -       request_ = std::make_unique<CaptureRequest>(camera);
> > > > > > +       request_ = std::make_unique<CaptureRequest>(camera, camera3Request);
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > @@ -264,6 +264,7 @@ CameraDevice::CameraDevice(unsigned int id, std::shared_ptr<Camera> camera)
> > > > > >         : id_(id), state_(State::Stopped), camera_(std::move(camera)),
> > > > > >           facing_(CAMERA_FACING_FRONT), orientation_(0)
> > > > > >  {
> > > > > > +       worker_.requestQueueFailed.connect(this, &CameraDevice::requestQueueFailed);
> > > > > >         camera_->requestCompleted.connect(this, &CameraDevice::requestComplete);
> > > > > >
> > > > > >         maker_ = "libcamera";
> > > > > > @@ -1060,6 +1061,25 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > +void CameraDevice::requestQueueFailed(CaptureRequest *request)
> > > > > > +{
> > > > > > +       /*
> > > > > > +        * Remove the descriptor from the map if the worker has failed
> > > > > > +        * to process it and notify the camera service about the failure.
> > > > > > +        */
> > > > > > +       MutexLocker descriptorsLock(descriptorsMutex_);
> > > > > > +       auto it = descriptors_.find(request->cookie());
> > > > > > +       if (it == descriptors_.end()) {
> > > > > > +               LOG(HAL, Fatal)
> > > > >
> > > > > If I understand correctly, this must not happen because
> > > >
> > > > Do you mean the it == end() with "this" ?
> > > >
> > > > It should not happen as requestQueueFail is called by the worker if it
> > > > fails to wait on a fence and the descriptor is already part of
> > > > descriptors.
> > > >
> > > > Actually, there's a small window for this to race now that I look at
> > > > it again
> > > >
> > > > --------------------------------------------------------------------------
> > > > CameraDevice                            CameraWorker
> > > > --------------------------------------------------------------------------
> > > >
> > > > worker_.queueRequest()          ->
> > > >
> > > >                                         worker.processRequest
> > > >                                         fence.wait() == error
> > > >                                 <-      requestQueueFailed.emit()
> > > > requestQueueFailed()
> > > >         it == descriptors_.end()
> > > >         abort();
> > > >
> > > > descriptors_[req] = desc;
> > > > --------------------------------------------------------------------------
> > > >
> > > > While the intended order of execution is
> > > >
> > > > CameraDevice                            CameraWorker
> > > > --------------------------------------------------------------------------
> > > >
> > > > worker_.queueRequest()
> > > > descriptors_[req] = desc;
> > > >                                 ->
> > > >                                         worker.processRequest
> > > >                                         fence.wait() == error
> > > >                                 <-      requestQueueFailed.emit()
> > > > requestQueueFailed()
> > > >         it == descriptors_.end()
> > > >         abort();
> > > >
> > > > --------------------------------------------------------------------------
> > > >
> > > > I'll make sure to add the descriptor to descriptors_ before calling
> > > > worker_.queuRequest().
> > > >
> > > > > requestQueueFailed() is invoked by worker::stop() and it is executed
> > > >
> > > > Why do you say it is invoked by stop() ? The signal is emitted if
> > > > waiting on a fence fails as shown...
> > > >
> > >
> > > I don't probably understand when fence waiting error happens.
> > > Could you explain when it happens?
> >
> > I wish that doesn't happen, but sometimes things break
> >
> > [1:44:39.146219593] [4470] ERROR HAL camera_worker.cpp:122 Failed waiting for fence: 80: Timer expired
> >
> > don't know why, we have an arbitrary timeout there, that should
> > probably be better characterized. But it guess protecting against
> > poll() errors on a fence is not crazy.
> >
> > >
> > > It looks like, in the next patch, descriptors_ must be empty in some places.
> >
> > Not in some places. after the camera has been stopped it must be
> > empty, in CameraDevice::stop() and CameraDevice::close(), for 2 reasons:
> >
> > 1) It validates the pipeline correctly returns all buffers at
> > Camera::stop() time. If that doesn't happen we'll be missing error
> > notifications to the camera service, I feel it's better to validate
> > that early and ASSERT() during development than chaising errors across
> > the IPC boundary
> >
> > 2) It validates that, if 1) holds, the HAL implementation
> > correctly returns all requests to the camera service. As soon as I
> > added the ASSERT() I stumbled upon a crash, due to the fence wait being
> > failed as shown above. That lead to this patch.
> >
> > > How is it guaranteed that requestQueueFailed() is executed before
> > > descriptors_ is empty?
> >
> > CameraDevice::processCaptureRequest          CameraWorker
> >      A) worker_.queueRequest(req);
> >      B) descriptors_[req] = move(desc)
> >
> >                                         ->      queueRequest()
> >                                                    processReques()
> >
> >                                                    wait(fence) == error
> >                                                    requestQueueFailed.emit()
> >                                         <-
> >     C)  descriptor.erase(req)
> >
> >
> > As said, A) and B) should probably be inverted to make sure the
> > descriptor is queued before calling worker_.queueRequest() (which can
> > lead to emitting requestQueueFailed if waiting on a fence fails).
> >
> > In this version there's a tiny window where the control could be given
> > to the CameraWorker and returned to the CameraDevice after the
> > requestQueueFailed slot has been executed. In that case we would have
> > A, C, B that would lead to the Fatal assertion being triggered.
> >
> > If we guarantee that (A, B) are executed atomically, or simply do B
> > before A, then when the slot is called the descriptor is guaranteed to
> > be in map.
> >
> > I might be missing some races though. Can you point me to the part
> > that doesn't work for you ?
> >
>
> I think, when CameraDevice::stop() and CameraDevice::flush() is
> called, there may be a pending request waiting on fence.
> So considering from the ASSERT, Worker::stop() causes
> requestQueueFailed() for the waiting request.
> But if I remember/understand correctly, Worker::stop() flushes pending
> requests in a worker queue, that is, requests for which
> processRequest() has not been called.
> Therefore descriptors_ might not be empty thanks to the request
> waiting on fence.
> I probably miss something here.

It is my understanding that since commit: 7de2daf653bc ("android:
camera_worker: Process all queued requests when stopping") all the
messages pending in the queue (the async processRequest() calls) are
dispatched to the Worker which processes them until its message queue
its empty. This, in my understanding means that all the requests
posted to the CameraWorker with queueRequest() are processed and
queued to Camera (if there are no isses waiting on the fence) or
deleted with the new requestQueueFailed signal if waiting on the fence
fails.

If Request are queued to the Camera, Camera::stop() will complete them
(and their descriptors deleted), if instead requestQueueFailed is
emitted, the descriptor is now deleted by the associated slot in
CameraDevice.

Now, that's my understanding and it's based on the assumptions that:

1) All pending messages are delivered (which should be guaranteed by 7de2daf653bc
2) The async processRequest() function is invoked for each message
dispatched after the thread has stopped
3) All the processRequest() async calls either succesfully wait on
the fence and queue the Request to the Camera, or signal a failure in
waiting and emit the requestQueueFailed signal
4) All of this happens before worker_.stop() returns

Explicitly Cc-ed Laurent to confirm or disprove my understanding.

Anyway, I went probably too far with this patch as it introduces
something that makes me feel more confortable (ASSERT no requests are
left in-flight when stopping the camera) but it is not a strict
requirement nor it fixes any immediate bug, so if it proves to be
controversial I'll be happy to drop and move forward with more
pressing issues.

Thanks
   j

>
> -Hiro
> > Thanks
> >   j
> >
> >
> >
> > > -Hiro
> > > > > before descriptors_.clear() always.
> > > > > Correct?
> > > > >
> > > > > Reviewed-by: Hirokazu Honda <hiroh at chromium.org>
> > > >
> > > > > >   */
> > > > > > -CaptureRequest::CaptureRequest(libcamera::Camera *camera)
> > > > > > -       : camera_(camera)
> > > > > > +CaptureRequest::CaptureRequest(libcamera::Camera *camera,
> > > > > > +                              const camera3_capture_request_t *camera3Request)
> > > > > > +       : camera_(camera), camera3Request_(camera3Request)
> > > > > >  {
> > > > > >         request_ = camera_->createRequest(reinterpret_cast<uint64_t>(this));
> > > > > >  }
> > > > > > @@ -77,7 +78,7 @@ void CameraWorker::queueRequest(CaptureRequest *request)
> > > > > >  {
> > > > > >         /* Async process the request on the worker which runs its own thread. */
> > > > > >         worker_.invokeMethod(&Worker::processRequest, ConnectionTypeQueued,
> > > > > > -                            request);
> > > > > > +                            this, request);
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > @@ -109,7 +110,7 @@ int CameraWorker::Worker::waitFence(int fence)
> > > > > >         return -errno;
> > > > > >  }
> > > > > >
> > > > > > -void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > > > +void CameraWorker::Worker::processRequest(CameraWorker *context, CaptureRequest *request)
> > > > > >  {
> > > > > >         /* Wait on all fences before queuing the Request. */
> > > > > >         for (int fence : request->fences()) {
> > > > > > @@ -121,6 +122,7 @@ void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > > >                 if (ret < 0) {
> > > > > >                         LOG(HAL, Error) << "Failed waiting for fence: "
> > > > > >                                         << fence << ": " << strerror(-ret);
> > > > > > +                       context->requestQueueFailed.emit(request);
> > > >
> > > > .... Here
> > > >
> > > > Is your question about stop() related to dispatching messages before
> > > > the thread has actually been stopped ? Isn't that been fixed some time
> > > > ago by overriding CameraWorker::exec() ?
> > > >
> > > > Thanks
> > > >    j
> > > >