[libcamera-devel] [PATCH v3 2/3] android: CameraDevice: Prevent race due to concurrent HAL calls

Hirokazu Honda hiroh at chromium.org
Mon May 10 08:48:41 CEST 2021


Hi Jacopo and Laurent, thank you for reviewing.

On Tue, May 4, 2021 at 1:02 AM Jacopo Mondi <jacopo at jmondi.org> wrote:

> Hi Laurent,
>
> On Mon, May 03, 2021 at 05:55:38PM +0300, Laurent Pinchart wrote:
> > Hello,
> >
> > On Mon, May 03, 2021 at 04:41:04PM +0200, Jacopo Mondi wrote:
> > > On Mon, Apr 26, 2021 at 05:38:29PM +0900, Hirokazu Honda wrote:
> > > > HAL API functions might be called by different threads in Android
> > > > Camera HAL3 API, while process_capture_request() must be called
> > > > by a single thread. Furthermore, close() and flush() can be
> > > > called any time. Therefore, races to variables can occur among
> > > > the HAL API functions.
> > > > This prevents the races by enclosing HAL calls by mutex. It
> > > > might not be the best in terms of the performance, but the
> > > > simplicity is good as a first step and also further development.
> > > >
> > > > Signed-off-by: Hirokazu Honda <hiroh at chromium.org>
> > > > ---
> > > >  src/android/camera_device.cpp | 59
> ++++++++++++++++++++---------------
> > > >  src/android/camera_device.h   |  7 +++--
> > > >  2 files changed, 39 insertions(+), 27 deletions(-)
> > > >
> > > > diff --git a/src/android/camera_device.cpp
> b/src/android/camera_device.cpp
> > > > index b3def04b..96c8d4a9 100644
> > > > --- a/src/android/camera_device.cpp
> > > > +++ b/src/android/camera_device.cpp
> > > > @@ -743,7 +743,7 @@ int CameraDevice::open(const hw_module_t
> *hardwareModule)
> > > >
> > > >  void CameraDevice::close()
> > > >  {
> > > > - streams_.clear();
> > > > + std::scoped_lock<std::mutex> halLock(halMutex_);
> > >
> > > Have you considered locking the wrappers functions in camera_ops.cpp ?
> > > Wouldn't it be simpler ? However locking with such a coarse-grained
> > > granularity I feel contradicts a bit the Camera3 API flush()
> > > documentation, which implies that flush() and
> > > process_capture_request() can race
> > >
> > >      * Due to
> > >      * concurrency issues, it is possible that from the HAL's point of
> view, a
> > >      * process_capture_request() call may be started after flush has
> been invoked but has not
> > >      * returned yet. If such a call happens before flush() returns,
> the HAL should treat the new
> > >      * capture request like other in-flight pending requests (see #4
> below).
> > >
> > > Locking at function level rules out this use case: if we lock each
> > > function it's not possible for process_capture_request() to be started
> > > during flush() execution as the lock hold by flush will be released
> > > only at exit time.
> > >
> > > If serializing operations with this granularity is fine (and I really
> > > with it is, as it would really simplify life for use)
> >
> > I think it is as an initial step, but likely not in the longer term. It
> > defeats the point of having the camera service issue
> > process_capture_request() and stop()/flush() calls concurrently. If
>
> Yes, that was my concern, although the documentation says it's enough
> to implement the full success/full fail behavior which I think it's
> what's happening here.
>
> serializing operations was good enough in every case, it would be done
> > in the camera service itself, and HAL implementations wouldn't need to
> > deal with this.
>
> Also true, indeed the framework allows to implement a finer granularity
> in locking the operations
>
>
Since ChromeOS camera services sealizes the calls, ChromeOS doesn't
require this serialization here.
But shall we assume that a HAL client serializes the call? Since this is a
HAL implementation, I think we should align the HAL specification; no
serialization is ensured.


> >
> > > the only racing
> > > condition we have to protect from is between the request completion
> > > signal emitted by libcamera, and the camera framework calling into the
> > > Camera HAL using the camera3 API as flush(), close() and
> > > process_capture_request() will always be serialized.
> > >
> > > Assuming the above is ok with the Camera3 API, isn't it enough to
> > >
> > > - flush() to call stop() and stop the Camera:
> > >         - All queued requests at the time Camera::stop() is called
> > >           will be returned one by one with an error state. There's no
> > >           reason to cancel the descriptors at stop() time, those
> > >           requests will be completed
> > >
>

That's correct. Well, I would do so just in case to ensure all requests are
returned before flush().
But it should be unnecessary.


> > >         - A process_capture_request() called after flush() has stopped
> > >           the camera will return immeditely as the camera is stopped
> > >           and the capture request is returned with an error
> > >           (we might want a flag to keep track of this, otherwise we'll
> > >           hit a Camera state machine error, much like we do with
> > >           running_)
> > >
>

Hmm, in my understanding, the process_capture_request() should re-start the
camera as it is a usual request.
Why do you think so?


> > >         - A close() called after flush() will be a no op
> >
> > So far it seems fine to me.
>

The critical difference between close() and flush() is whether a camera is
held or not, isn't it?
That is, Camera::release() is called or not.
I agree almost all is done by the preceding flush(), so close() called
after flush() will just call Camera::release().
How do you think?


> >
> > >         - Only open() will allow the CameraDevice to move back from
> > >           stopped_ state
> >
> > That's correct for close(). For flush(), we can to make sure that a
> > configureStreams() call will reopen the camera.
>
> Right, it's probably enough to just reset the state of the new flag in
> configureStreams() and return immediately from
> processCaptureRequest().
>
> My main point should have been that processCaptureRequest() cannot be
> called after a flush() without going through configureStreams() ?
>
> Maybe not true however, considering:
>
>      * flush() should only return when there are no more outstanding
> buffers or
>      * requests left in the HAL. The framework may call configure_streams
> (as
>      * the HAL state is now quiesced) or may issue new requests.
>
> Which seems to suggest after a flush, processCaptureRequest() can be
> called again.
>
>
Yeah, from this, I think we should re-open camera in the first
process_capture_request() after flush() has been called.


> >
> > > - There's a tiny tiny window where a request can complete after flush
> > >   has been called but the camera has not been stopped yet. In
> >
> > It's not that tiny, it can certainly be hit in practice.
> >
> > >   requestComplete() we'll have to inspect the stopped_ flag state and
> > >   force the request to return with error even if it did not.
> >
> > Why can't we complete the request without any error in that case ?
>
> Because flush() has been called already (but the camera has not
> stopped yet?)
>
> Although the documentation reports
>
>      * 1. For captures that are too late for the HAL to cancel/stop, and
> will be
>      *    completed normally by the HAL; i.e. the HAL can send
> shutter/notify and
>      *    process_capture_result and buffers as normal.
>
> It's then probably fine to let those requests just complete
> successfully (even better, it's easier)
>
>
Camera::stop() is a blocking function and calls requestComplete() with all
the pending requests?
In that case, a return request is marked with CANCELLED or if the buffer is
already complete, marked with COMPLETE.
Well, ideally, so that we should not execute a post-processing during flush
and a thread should be separated for the post processing, we should
introduce a way of detecting that flush is being called in
requestComplete().


> >
> > > - Once we move JPEG encoding to a worker thread, stop() will have to
> > >   forcefully stop it. Requests interrupted during encoding will be
> > >   returned with error.
> > >
> > > A simple lock around stopped_ should protect it from concurrent
> > > accesses between stop() and requestComplete().
> >
> > The issue is that the flush() function must wait for all requests to
> > complete. This could be implemented with a condition variable, allowing
> > to avoid serializing CameraDevice::requestComplete() with the API calls
> > from the camera service. I think that would be better indeed.
>
> Ah yes, I overlooked this part
>
>      * flush() should only return when there are no more outstanding
> buffers or
>      * requests left in the HAL.
>
> Actually with a condition variable it might also get easier, as
>
>         flush() {
>                 {
>                         /* mutex lock *
>                         flushing_ = true;
>                 }
>
>                 stop()
>
>                 flushed.wait()
>                 {
>                         /* mutex lock *
>                         flushing_ = false;
>                 }
>         }
>
>
The difficult point is with this other HAL calls (e.g.
process_capture_request() and stop()) can be called during stop().
We can protect it with flushing_. But I think it should be simplified by
guarding two mutexes as this patch does.


> Would allow to held the flushing_ state until all the pending requests
> have been returned as a consequence of stop(). Once that's done we can
> reset the flushing_ flag, and the next processCaptureRequest() can
> continue as normal ?
>
>
Yes, I did so in this patch series though without flushing_, but with a new
mutex.


> Do we have a list of CTS or cros_camera_test that exercizes flush() to
> be used as validation ?
>
>
I think CTS doesn't detect the issue.


> I've tried with this series applied:
>
> # cros_camera_test
> --gtest_filter="Camera3FrameTest/Camera3FlushRequestsTest*"
> [  PASSED  ] 4 tests.
> [  FAILED  ] 8 tests, listed below
>
> Which is the same result I get on the most recent master..
>
>
You need 1954 and 1967 to pass the test.

Thanks,
-Hiro


> Thanks
>    j
>
> > > Have you considered this option and ruled it out for reasons I'm
> > > failing to consider ? It feels simpler than decoupling requests and
> > > descriptors, maybe you tried and found out it's not :)
> > >
> > > >   stop();
> > > >
> > > > @@ -758,12 +758,17 @@ void CameraDevice::stop()
> > > >   worker_.stop();
> > > >   camera_->stop();
> > > >
> > > > - descriptors_.clear();
> > > > + {
> > > > +         std::scoped_lock<std::mutex> lock(mutex_);
> > > > +         descriptors_.clear();
> > > > + }
> > > > +
> > > >   running_ = false;
> > > >  }
> > > >
> > > >  void CameraDevice::setCallbacks(const camera3_callback_ops_t
> *callbacks)
> > > >  {
> > > > + std::scoped_lock<std::mutex> halLock(halMutex_);
> > > >   callbacks_ = callbacks;
> > > >  }
> > > >
> > > > @@ -1643,8 +1648,7 @@ PixelFormat CameraDevice::toPixelFormat(int
> format) const
> > > >   */
> > > >  int CameraDevice::configureStreams(camera3_stream_configuration_t
> *stream_list)
> > > >  {
> > > > - /* Before any configuration attempt, stop the camera. */
> > > > - stop();
> > > > + std::scoped_lock<std::mutex> halLock(halMutex_);
> > > >
> > > >   if (stream_list->num_streams == 0) {
> > > >           LOG(HAL, Error) << "No streams in configuration";
> > > > @@ -1656,6 +1660,9 @@ int
> CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > >           return -EINVAL;
> > > >  #endif
> > > >
> > > > + /* Before any configuration attempt, stop the camera. */
> > > > + stop();
> > > > +
> > > >   /*
> > > >    * Generate an empty configuration, and construct a
> StreamConfiguration
> > > >    * for each camera3_stream to add to it.
> > > > @@ -1671,6 +1678,8 @@ int
> CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > >    * ensure the required entries are available without further
> > > >    * reallocation.
> > > >    */
> > > > +
> > > > + std::scoped_lock<std::mutex> lock(mutex_);
> > > >   streams_.clear();
> > > >   streams_.reserve(stream_list->num_streams);
> > > >
> > > > @@ -1907,6 +1916,8 @@ int
> CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > >   if (!isValidRequest(camera3Request))
> > > >           return -EINVAL;
> > > >
> > > > + std::scoped_lock<std::mutex> halLock(halMutex_);
> > > > +
> > > >   /* Start the camera if that's the first request we handle. */
> > > >   if (!running_) {
> > > >           worker_.start();
> > > > @@ -2022,29 +2033,26 @@ int
> CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > >
> > > >  void CameraDevice::requestComplete(Request *request)
> > > >  {
> > > > + std::scoped_lock<std::mutex> lock(mutex_);
> > > > +
> > > >   const Request::BufferMap &buffers = request->buffers();
> > > >   camera3_buffer_status status = CAMERA3_BUFFER_STATUS_OK;
> > > >   std::unique_ptr<CameraMetadata> resultMetadata;
> > > >
> > > > - decltype(descriptors_)::node_type node;
> > > > - std::unique_ptr<CaptureRequest> captureRequest;
> > > > - {
> > > > -         std::scoped_lock<std::mutex> lock(mutex_);
> > > > -         auto requestIt = requests_.find(request->cookie());
> > > > -         if (requestIt == requests_.end()) {
> > > > -                 LOG(HAL, Fatal)
> > > > -                         << "Unknown request: " <<
> request->cookie();
> > > > -                 return;
> > > > -         }
> > > > -         captureRequest = std::move(requestIt->second);
> > > > -         requests_.erase(requestIt);
> > > > + auto requestIt = requests_.find(request->cookie());
> > > > + if (requestIt == requests_.end()) {
> > > > +         LOG(HAL, Fatal)
> > > > +                 << "Unknown request: " << request->cookie();
> > > > +         return;
> > > > + }
> > > > + auto captureRequest = std::move(requestIt->second);
> > > > + requests_.erase(requestIt);
> > > >
> > > > -         auto it = descriptors_.find(request->cookie());
> > > > -         if (it == descriptors_.end())
> > > > -                 return;
> > > > + auto it = descriptors_.find(request->cookie());
> > > > + if (it == descriptors_.end())
> > > > +         return;
> > > >
> > > > -         node = descriptors_.extract(it);
> > > > - }
> > > > + auto node = descriptors_.extract(it);
> > > >   Camera3RequestDescriptor &descriptor = node.mapped();
> > > >
> > > >   if (request->status() != Request::RequestComplete) {
> > > > @@ -2149,11 +2157,12 @@ void CameraDevice::notifyError(uint32_t
> frameNumber, camera3_stream_t *stream)
> > > >   camera3_notify_msg_t notify = {};
> > > >
> > > >   /*
> > > > -  * \todo Report and identify the stream number or configuration to
> > > > -  * clarify the stream that failed.
> > > > +  * \todo Report and identify the stream number or configuration
> > > > +  * to clarify the stream that failed.
> > > >    */
> > > > - LOG(HAL, Error) << "Error occurred on frame " << frameNumber << "
> ("
> > > > -                 << toPixelFormat(stream->format).toString() << ")";
> > > > + LOG(HAL, Error)
> > > > +         << "Error occurred on frame " << descriptor.frameNumber_
> << " ("
> > > > +         <<
> toPixelFormat(descriptor.buffers_[0].stream->format).toString() << ")";
> > > >
> > > >   notify.type = CAMERA3_MSG_ERROR;
> > > >   notify.message.error.error_stream = stream;
> > > > diff --git a/src/android/camera_device.h
> b/src/android/camera_device.h
> > > > index 95d77890..325fb967 100644
> > > > --- a/src/android/camera_device.h
> > > > +++ b/src/android/camera_device.h
> > > > @@ -125,9 +125,12 @@ private:
> > > >
> > > >   std::vector<Camera3StreamConfiguration> streamConfigurations_;
> > > >   std::map<int, libcamera::PixelFormat> formatsMap_;
> > > > - std::vector<CameraStream> streams_;
> > > >
> > > > - std::mutex mutex_; /* Protect descriptors_ and requests_. */
> > > > + /* Avoid races by concurrent Camera HAL API calls. */
> > > > + std::mutex halMutex_;
> > > > +
> > > > + std::mutex mutex_; /* Protect streams_, descriptors_ and
> requests_. */
> > > > + std::vector<CameraStream> streams_;
> > > >   std::map<uint64_t, Camera3RequestDescriptor> descriptors_;
> > > >   std::map<uint64_t, std::unique_ptr<CaptureRequest>> requests_;
> > > >
> >
> > --
> > Regards,
> >
> > Laurent Pinchart
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libcamera.org/pipermail/libcamera-devel/attachments/20210510/f1aef5fa/attachment-0001.htm>


More information about the libcamera-devel mailing list