[libcamera-devel] [PATCH 4/5] android: camera_worker: Notify fence wait failures
Jacopo Mondi
jacopo at jmondi.org
Mon Sep 6 19:05:42 CEST 2021
Hi Hiro,
On Tue, Sep 07, 2021 at 01:36:43AM +0900, Hirokazu Honda wrote:
> Hi Jacopo,
>
> On Tue, Sep 7, 2021 at 1:19 AM Jacopo Mondi <jacopo at jmondi.org> wrote:
> >
> > Hi Hiro,
> >
> > On Mon, Sep 06, 2021 at 11:55:15PM +0900, Hirokazu Honda wrote:
> > > Hi Jacopo, thank you for the patch.
> > >
> > > On Mon, Sep 6, 2021 at 11:01 PM Jacopo Mondi <jacopo at jmondi.org> wrote:
> > > >
> > > > The CameraDevice class stores one CaptureRequestDescriptor instance
> > > > for each capture request queued to the Camera. Such descriptors are
> > > > cleared when the Camera completes the Request in the
> > > > CameraDevice::requestComplete() slot.
> > > >
> > > > The CameraDevice class relies on an internal worker to off-load
> > > > waiting on synchronization fences, and does unconditionally store
> > > > the descriptor in the descriptors_ class member map to handle its
> > > > lifetime.
> > > >
> > > > If waiting on a fence fails, the created descriptor remains in the
> > > > descriptors_ map until the next call to stop() or close(), as
> > > > requestComplete() will never be called, and the camera
> > > > service is never notified about the queueing failure.
> > > >
> > > > Fix that by allowing the CameraWorker to notify to the CameraDevice if
> > > > waiting on a fence has failed, by installing a new requestQueueFailed
> > > > Signal.
> > > >
> > > > The Signal is emitted by the CameraWorker internal worker if waiting on
> > > > a fence has failed. The CameraDevice is augmented with a slot connected
> > > > to the Signal that removes the descriptor from the map and notifies the
> > > > camera framework about the failure.
> > > >
> > > > Signed-off-by: Jacopo Mondi <jacopo at jmondi.org>
> > > > ---
> > > > src/android/camera_device.cpp | 22 +++++++++++++++++++++-
> > > > src/android/camera_device.h | 1 +
> > > > src/android/camera_worker.cpp | 10 ++++++----
> > > > src/android/camera_worker.h | 9 +++++++--
> > > > 4 files changed, 35 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > > index a0ea138d9499..0ce9b699bfae 100644
> > > > --- a/src/android/camera_device.cpp
> > > > +++ b/src/android/camera_device.cpp
> > > > @@ -244,7 +244,7 @@ CameraDevice::Camera3RequestDescriptor::Camera3RequestDescriptor(
> > > > * Create the CaptureRequest, stored as a unique_ptr<> to tie its
> > > > * lifetime to the descriptor.
> > > > */
> > > > - request_ = std::make_unique<CaptureRequest>(camera);
> > > > + request_ = std::make_unique<CaptureRequest>(camera, camera3Request);
> > > > }
> > > >
> > > > /*
> > > > @@ -264,6 +264,7 @@ CameraDevice::CameraDevice(unsigned int id, std::shared_ptr<Camera> camera)
> > > > : id_(id), state_(State::Stopped), camera_(std::move(camera)),
> > > > facing_(CAMERA_FACING_FRONT), orientation_(0)
> > > > {
> > > > + worker_.requestQueueFailed.connect(this, &CameraDevice::requestQueueFailed);
> > > > camera_->requestCompleted.connect(this, &CameraDevice::requestComplete);
> > > >
> > > > maker_ = "libcamera";
> > > > @@ -1060,6 +1061,25 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > > return 0;
> > > > }
> > > >
> > > > +void CameraDevice::requestQueueFailed(CaptureRequest *request)
> > > > +{
> > > > + /*
> > > > + * Remove the descriptor from the map if the worker has failed
> > > > + * to process it and notify the camera service about the failure.
> > > > + */
> > > > + MutexLocker descriptorsLock(descriptorsMutex_);
> > > > + auto it = descriptors_.find(request->cookie());
> > > > + if (it == descriptors_.end()) {
> > > > + LOG(HAL, Fatal)
> > >
> > > If I understand correctly, this must not happen because
> >
> > Do you mean the it == end() with "this" ?
> >
> > It should not happen as requestQueueFail is called by the worker if it
> > fails to wait on a fence and the descriptor is already part of
> > descriptors.
> >
> > Actually, there's a small window for this to race now that I look at
> > it again
> >
> > --------------------------------------------------------------------------
> > CameraDevice CameraWorker
> > --------------------------------------------------------------------------
> >
> > worker_.queueRequest() ->
> >
> > worker.processRequest
> > fence.wait() == error
> > <- requestQueueFailed.emit()
> > requestQueueFailed()
> > it == descriptors_.end()
> > abort();
> >
> > descriptors_[req] = desc;
> > --------------------------------------------------------------------------
> >
> > While the intended order of execution is
> >
> > CameraDevice CameraWorker
> > --------------------------------------------------------------------------
> >
> > worker_.queueRequest()
> > descriptors_[req] = desc;
> > ->
> > worker.processRequest
> > fence.wait() == error
> > <- requestQueueFailed.emit()
> > requestQueueFailed()
> > it == descriptors_.end()
> > abort();
> >
> > --------------------------------------------------------------------------
> >
> > I'll make sure to add the descriptor to descriptors_ before calling
> > worker_.queuRequest().
> >
> > > requestQueueFailed() is invoked by worker::stop() and it is executed
> >
> > Why do you say it is invoked by stop() ? The signal is emitted if
> > waiting on a fence fails as shown...
> >
>
> I don't probably understand when fence waiting error happens.
> Could you explain when it happens?
I wish that doesn't happen, but sometimes things break
[1:44:39.146219593] [4470] ERROR HAL camera_worker.cpp:122 Failed waiting for fence: 80: Timer expired
don't know why, we have an arbitrary timeout there, that should
probably be better characterized. But it guess protecting against
poll() errors on a fence is not crazy.
>
> It looks like, in the next patch, descriptors_ must be empty in some places.
Not in some places. after the camera has been stopped it must be
empty, in CameraDevice::stop() and CameraDevice::close(), for 2 reasons:
1) It validates the pipeline correctly returns all buffers at
Camera::stop() time. If that doesn't happen we'll be missing error
notifications to the camera service, I feel it's better to validate
that early and ASSERT() during development than chaising errors across
the IPC boundary
2) It validates that, if 1) holds, the HAL implementation
correctly returns all requests to the camera service. As soon as I
added the ASSERT() I stumbled upon a crash, due to the fence wait being
failed as shown above. That lead to this patch.
> How is it guaranteed that requestQueueFailed() is executed before
> descriptors_ is empty?
CameraDevice::processCaptureRequest CameraWorker
A) worker_.queueRequest(req);
B) descriptors_[req] = move(desc)
-> queueRequest()
processReques()
wait(fence) == error
requestQueueFailed.emit()
<-
C) descriptor.erase(req)
As said, A) and B) should probably be inverted to make sure the
descriptor is queued before calling worker_.queueRequest() (which can
lead to emitting requestQueueFailed if waiting on a fence fails).
In this version there's a tiny window where the control could be given
to the CameraWorker and returned to the CameraDevice after the
requestQueueFailed slot has been executed. In that case we would have
A, C, B that would lead to the Fatal assertion being triggered.
If we guarantee that (A, B) are executed atomically, or simply do B
before A, then when the slot is called the descriptor is guaranteed to
be in map.
I might be missing some races though. Can you point me to the part
that doesn't work for you ?
Thanks
j
> -Hiro
> > > before descriptors_.clear() always.
> > > Correct?
> > >
> > > Reviewed-by: Hirokazu Honda <hiroh at chromium.org>
> >
> > > > */
> > > > -CaptureRequest::CaptureRequest(libcamera::Camera *camera)
> > > > - : camera_(camera)
> > > > +CaptureRequest::CaptureRequest(libcamera::Camera *camera,
> > > > + const camera3_capture_request_t *camera3Request)
> > > > + : camera_(camera), camera3Request_(camera3Request)
> > > > {
> > > > request_ = camera_->createRequest(reinterpret_cast<uint64_t>(this));
> > > > }
> > > > @@ -77,7 +78,7 @@ void CameraWorker::queueRequest(CaptureRequest *request)
> > > > {
> > > > /* Async process the request on the worker which runs its own thread. */
> > > > worker_.invokeMethod(&Worker::processRequest, ConnectionTypeQueued,
> > > > - request);
> > > > + this, request);
> > > > }
> > > >
> > > > /*
> > > > @@ -109,7 +110,7 @@ int CameraWorker::Worker::waitFence(int fence)
> > > > return -errno;
> > > > }
> > > >
> > > > -void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > +void CameraWorker::Worker::processRequest(CameraWorker *context, CaptureRequest *request)
> > > > {
> > > > /* Wait on all fences before queuing the Request. */
> > > > for (int fence : request->fences()) {
> > > > @@ -121,6 +122,7 @@ void CameraWorker::Worker::processRequest(CaptureRequest *request)
> > > > if (ret < 0) {
> > > > LOG(HAL, Error) << "Failed waiting for fence: "
> > > > << fence << ": " << strerror(-ret);
> > > > + context->requestQueueFailed.emit(request);
> >
> > .... Here
> >
> > Is your question about stop() related to dispatching messages before
> > the thread has actually been stopped ? Isn't that been fixed some time
> > ago by overriding CameraWorker::exec() ?
> >
> > Thanks
> > j
> >
More information about the libcamera-devel
mailing list