[libcamera-devel] [PATCH 4/5] android: camera_worker: Notify fence wait failures

Hirokazu Honda hiroh at chromium.org
Tue Sep 7 10:32:29 CEST 2021


Hi Jacopo,

On Tue, Sep 7, 2021 at 2:04 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
>
> Hi Hiro,
>
> On Tue, Sep 07, 2021 at 01:36:43AM +0900, Hirokazu Honda wrote:
> > Hi Jacopo,
> >
> > On Tue, Sep 7, 2021 at 1:19 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > >
> > > Hi Hiro,
> > >
> > > On Mon, Sep 06, 2021 at 11:55:15PM +0900, Hirokazu Honda wrote:
> > > > Hi Jacopo, thank you for the patch.
> > > >
> > > > On Mon, Sep 6, 2021 at 11:01 PM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > > > >
> > > > > The CameraDevice class stores one CaptureRequestDescriptor instance
> > > > > for each capture request queued to the Camera. Such descriptors are
> > > > > cleared when the Camera completes the Request in the
> > > > > CameraDevice::requestComplete() slot.
> > > > >
> > > > > The CameraDevice class relies on an internal worker to off-load
> > > > > waiting on synchronization fences, and does unconditionally store
> > > > > the descriptor in the descriptors_ class member map to handle its
> > > > > lifetime.
> > > > >
> > > > > If waiting on a fence fails, the created descriptor remains in the
> > > > > descriptors_ map until the next call to stop() or close(), as
> > > > > requestComplete() will never be called, and the camera
> > > > > service is never notified about the queueing failure.
> > > > >
> > > > > Fix that by allowing the CameraWorker to notify to the CameraDevice if
> > > > > waiting on a fence has failed, by installing a new requestQueueFailed
> > > > > Signal.
> > > > >
> > > > > The Signal is emitted by the CameraWorker internal worker if waiting on
> > > > > a fence has failed. The CameraDevice is augmented with a slot connected
> > > > > to the Signal that removes the descriptor from the map and notifies the
> > > > > camera framework about the failure.
> > > > >
> > > > > Signed-off-by: Jacopo Mondi <jacopo at jmondi.org>
> > > > > ---
> > > > >  src/android/camera_device.cpp | 22 +++++++++++++++++++++-
> > > > >  src/android/camera_device.h   |  1 +
> > > > >  src/android/camera_worker.cpp | 10 ++++++----
> > > > >  src/android/camera_worker.h   |  9 +++++++--
> > > > >  4 files changed, 35 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > > > index a0ea138d9499..0ce9b699bfae 100644
> > > > > --- a/src/android/camera_device.cpp
> > > > > +++ b/src/android/camera_device.cpp
> > > > > @@ -244,7 +244,7 @@ CameraDevice::Camera3RequestDescriptor::Camera3RequestDescriptor(
> > > > >          * Create the CaptureRequest, stored as a unique_ptr<> to tie its
> > > > >          * lifetime to the descriptor.
> > > > >          */
> > > > > -       request_ = std::make_unique<CaptureRequest>(camera);
> > > > > +       request_ = std::make_unique<CaptureRequest>(camera, camera3Request);
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -264,6 +264,7 @@ CameraDevice::CameraDevice(unsigned int id, std::shared_ptr<Camera> camera)
> > > > >         : id_(id), state_(State::Stopped), camera_(std::move(camera)),
> > > > >           facing_(CAMERA_FACING_FRONT), orientation_(0)
> > > > >  {
> > > > > +       worker_.requestQueueFailed.connect(this, &CameraDevice::requestQueueFailed);
> > > > >         camera_->requestCompleted.connect(this, &CameraDevice::requestComplete);
> > > > >
> > > > >         maker_ = "libcamera";
> > > > > @@ -1060,6 +1061,25 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +void CameraDevice::requestQueueFailed(CaptureRequest *request)
> > > > > +{
> > > > > +       /*
> > > > > +        * Remove the descriptor from the map if the worker has failed
> > > > > +        * to process it and notify the camera service about the failure.
> > > > > +        */
> > > > > +       MutexLocker descriptorsLock(descriptorsMutex_);
> > > > > +       auto it = descriptors_.find(request->cookie());
> > > > > +       if (it == descriptors_.end()) {
> > > > > +               LOG(HAL, Fatal)
> > > >
> > > > If I understand correctly, this must not happen because
> > >
> > > Do you mean the it == end() with "this" ?
> > >
> > > It should not happen as requestQueueFail is called by the worker if it
> > > fails to wait on a fence and the descriptor is already part of
> > > descriptors.
> > >
> > > Actually, there's a small window for this to race now that I look at
> > > it again
> > >
> > > --------------------------------------------------------------------------
> > > CameraDevice                            CameraWorker
> > > --------------------------------------------------------------------------
> > >
> > > worker_.queueRequest()          ->
> > >
> > >                                         worker.processRequest
> > >                                         fence.wait() == error
> > >                                 <-      requestQueueFailed.emit()
> > > requestQueueFailed()
> > >         it == descriptors_.end()
> > >         abort();
> > >
> > > descriptors_[req] = desc;
> > > --------------------------------------------------------------------------
> > >
> > > While the intended order of execution is
> > >
> > > CameraDevice                            CameraWorker
> > > --------------------------------------------------------------------------
> > >
> > > worker_.queueRequest()
> > > descriptors_[req] = desc;
> > >                                 ->
> > >                                         worker.processRequest
> > >                                         fence.wait() == error
> > >                                 <-      requestQueueFailed.emit()
> > > requestQueueFailed()
> > >         it == descriptors_.end()
> > >         abort();
> > >
> > > --------------------------------------------------------------------------
> > >
> > > I'll make sure to add the descriptor to descriptors_ before calling
> > > worker_.queuRequest().
> > >
> > > > requestQueueFailed() is invoked by worker::stop() and it is executed
> > >
> > > Why do you say it is invoked by stop() ? The signal is emitted if
> > > waiting on a fence fails as shown...
> > >
> >
> > I don't probably understand when fence waiting error happens.
> > Could you explain when it happens?
>
> I wish that doesn't happen, but sometimes things break
>
> [1:44:39.146219593] [4470] ERROR HAL camera_worker.cpp:122 Failed waiting for fence: 80: Timer expired
>
> don't know why, we have an arbitrary timeout there, that should
> probably be better characterized. But it guess protecting against
> poll() errors on a fence is not crazy.
>
> >
> > It looks like, in the next patch, descriptors_ must be empty in some places.
>
> Not in some places. after the camera has been stopped it must be
> empty, in CameraDevice::stop() and CameraDevice::close(), for 2 reasons:
>
> 1) It validates the pipeline correctly returns all buffers at
> Camera::stop() time. If that doesn't happen we'll be missing error
> notifications to the camera service, I feel it's better to validate
> that early and ASSERT() during development than chaising errors across
> the IPC boundary
>
> 2) It validates that, if 1) holds, the HAL implementation
> correctly returns all requests to the camera service. As soon as I
> added the ASSERT() I stumbled upon a crash, due to the fence wait being
> failed as shown above. That lead to this patch.
>
> > How is it guaranteed that requestQueueFailed() is executed before
> > descriptors_ is empty?
>
> CameraDevice::processCaptureRequest          CameraWorker
>      A) worker_.queueRequest(req);
>      B) descriptors_[req] = move(desc)
>
>                                         ->      queueRequest()
>                                                    processReques()
>
>                                                    wait(fence) == error
>                                                    requestQueueFailed.emit()
>                                         <-
>     C)  descriptor.erase(req)
>
>
> As said, A) and B) should probably be inverted to make sure the
> descriptor is queued before calling worker_.queueRequest() (which can
> lead to emitting requestQueueFailed if waiting on a fence fails).
>
> In this version there's a tiny window where the control could be given
> to the CameraWorker and returned to the CameraDevice after the
> requestQueueFailed slot has been executed. In that case we would have
> A, C, B that would lead to the Fatal assertion being triggered.
>
> If we guarantee that (A, B) are executed atomically, or simply do B
> before A, then when the slot is called the descriptor is guaranteed to
> be in map.
>
> I might be missing some races though. Can you point me to the part
> that doesn't work for you ?
>

I think, when CameraDevice::stop() and CameraDevice::flush() is
called, there may be a pending request waiting on fence.
So considering from the ASSERT, Worker::stop() causes
requestQueueFailed() for the waiting request.
But if I remember/understand correctly, Worker::stop() flushes pending
requests in a worker queue, that is, requests for which
processRequest() has not been called.
Therefore descriptors_ might not be empty thanks to the request
waiting on fence.
I probably miss something here.

-Hiro
> Thanks
>   j
>
>
>
> > -Hiro
> > > > before descriptors_.clear() always.
> > > > Correct?
> > > >
> > > > Reviewed-by: Hirokazu Honda <hiroh at chromium.org>
> > >
> > > > >   */
> > > > > -CaptureRequest::CaptureRequest(libcamera::Camera *camera)
> > > > > -       : camera_(camera)
> > > > > +CaptureRequest::CaptureRequest(libcamera::Camera *camera,
> > > > > +                              const camera3_capture_request_t *camera3Request)
> > > > > +       : camera_(camera), camera3Request_(camera3Request)
> > > > >  {
> > > > >         request_ = camera_->createRequest(reinterpret_cast<uint64_t>(this));
> > > > >  }
> > > > > @@ -77,7 +78,7 @@ void CameraWorker::queueRequest(CaptureRequest *request)
> > > > >  {
> > > > >         /* Async process the request on the worker which runs its own thread. */
> > > > >         worker_.invokeMethod(&Worker::processRequest, ConnectionTypeQueued,
> > > > > -                            request);
> > > > > +                            this, request);
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -109,7 +110,7 @@ int CameraWorker::Worker::waitFence(int fence)
> > > > >         return -errno;
> > > > >  }
> > > > >
> > > > > -void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > > +void CameraWorker::Worker::processRequest(CameraWorker *context, CaptureRequest *request)
> > > > >  {
> > > > >         /* Wait on all fences before queuing the Request. */
> > > > >         for (int fence : request->fences()) {
> > > > > @@ -121,6 +122,7 @@ void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > >                 if (ret < 0) {
> > > > >                         LOG(HAL, Error) << "Failed waiting for fence: "
> > > > >                                         << fence << ": " << strerror(-ret);
> > > > > +                       context->requestQueueFailed.emit(request);
> > >
> > > .... Here
> > >
> > > Is your question about stop() related to dispatching messages before
> > > the thread has actually been stopped ? Isn't that been fixed some time
> > > ago by overriding CameraWorker::exec() ?
> > >
> > > Thanks
> > >    j
> > >


More information about the libcamera-devel mailing list