[libcamera-devel] [PATCH v3 2/3] android: CameraDevice: Prevent race due to concurrent HAL calls

Jacopo Mondi jacopo at jmondi.org
Mon May 3 18:02:51 CEST 2021


Hi Laurent,

On Mon, May 03, 2021 at 05:55:38PM +0300, Laurent Pinchart wrote:
> Hello,
>
> On Mon, May 03, 2021 at 04:41:04PM +0200, Jacopo Mondi wrote:
> > On Mon, Apr 26, 2021 at 05:38:29PM +0900, Hirokazu Honda wrote:
> > > HAL API functions might be called by different threads in Android
> > > Camera HAL3 API, while process_capture_request() must be called
> > > by a single thread. Furthermore, close() and flush() can be
> > > called any time. Therefore, races to variables can occur among
> > > the HAL API functions.
> > > This prevents the races by enclosing HAL calls by mutex. It
> > > might not be the best in terms of the performance, but the
> > > simplicity is good as a first step and also further development.
> > >
> > > Signed-off-by: Hirokazu Honda <hiroh at chromium.org>
> > > ---
> > >  src/android/camera_device.cpp | 59 ++++++++++++++++++++---------------
> > >  src/android/camera_device.h   |  7 +++--
> > >  2 files changed, 39 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > index b3def04b..96c8d4a9 100644
> > > --- a/src/android/camera_device.cpp
> > > +++ b/src/android/camera_device.cpp
> > > @@ -743,7 +743,7 @@ int CameraDevice::open(const hw_module_t *hardwareModule)
> > >
> > >  void CameraDevice::close()
> > >  {
> > > -	streams_.clear();
> > > +	std::scoped_lock<std::mutex> halLock(halMutex_);
> >
> > Have you considered locking the wrappers functions in camera_ops.cpp ?
> > Wouldn't it be simpler ? However locking with such a coarse-grained
> > granularity I feel contradicts a bit the Camera3 API flush()
> > documentation, which implies that flush() and
> > process_capture_request() can race
> >
> >      * Due to
> >      * concurrency issues, it is possible that from the HAL's point of view, a
> >      * process_capture_request() call may be started after flush has been invoked but has not
> >      * returned yet. If such a call happens before flush() returns, the HAL should treat the new
> >      * capture request like other in-flight pending requests (see #4 below).
> >
> > Locking at function level rules out this use case: if we lock each
> > function it's not possible for process_capture_request() to be started
> > during flush() execution as the lock hold by flush will be released
> > only at exit time.
> >
> > If serializing operations with this granularity is fine (and I really
> > with it is, as it would really simplify life for use)
>
> I think it is as an initial step, but likely not in the longer term. It
> defeats the point of having the camera service issue
> process_capture_request() and stop()/flush() calls concurrently. If

Yes, that was my concern, although the documentation says it's enough
to implement the full success/full fail behavior which I think it's
what's happening here.

> serializing operations was good enough in every case, it would be done
> in the camera service itself, and HAL implementations wouldn't need to
> deal with this.

Also true, indeed the framework allows to implement a finer granularity
in locking the operations

>
> > the only racing
> > condition we have to protect from is between the request completion
> > signal emitted by libcamera, and the camera framework calling into the
> > Camera HAL using the camera3 API as flush(), close() and
> > process_capture_request() will always be serialized.
> >
> > Assuming the above is ok with the Camera3 API, isn't it enough to
> >
> > - flush() to call stop() and stop the Camera:
> >         - All queued requests at the time Camera::stop() is called
> >           will be returned one by one with an error state. There's no
> >           reason to cancel the descriptors at stop() time, those
> >           requests will be completed
> >
> >         - A process_capture_request() called after flush() has stopped
> >           the camera will return immeditely as the camera is stopped
> >           and the capture request is returned with an error
> >           (we might want a flag to keep track of this, otherwise we'll
> >           hit a Camera state machine error, much like we do with
> >           running_)
> >
> >         - A close() called after flush() will be a no op
>
> So far it seems fine to me.
>
> >         - Only open() will allow the CameraDevice to move back from
> >           stopped_ state
>
> That's correct for close(). For flush(), we can to make sure that a
> configureStreams() call will reopen the camera.

Right, it's probably enough to just reset the state of the new flag in
configureStreams() and return immediately from
processCaptureRequest().

My main point should have been that processCaptureRequest() cannot be
called after a flush() without going through configureStreams() ?

Maybe not true however, considering:

     * flush() should only return when there are no more outstanding buffers or
     * requests left in the HAL. The framework may call configure_streams (as
     * the HAL state is now quiesced) or may issue new requests.

Which seems to suggest after a flush, processCaptureRequest() can be
called again.

>
> > - There's a tiny tiny window where a request can complete after flush
> >   has been called but the camera has not been stopped yet. In
>
> It's not that tiny, it can certainly be hit in practice.
>
> >   requestComplete() we'll have to inspect the stopped_ flag state and
> >   force the request to return with error even if it did not.
>
> Why can't we complete the request without any error in that case ?

Because flush() has been called already (but the camera has not
stopped yet?)

Although the documentation reports

     * 1. For captures that are too late for the HAL to cancel/stop, and will be
     *    completed normally by the HAL; i.e. the HAL can send shutter/notify and
     *    process_capture_result and buffers as normal.

It's then probably fine to let those requests just complete
successfully (even better, it's easier)

>
> > - Once we move JPEG encoding to a worker thread, stop() will have to
> >   forcefully stop it. Requests interrupted during encoding will be
> >   returned with error.
> >
> > A simple lock around stopped_ should protect it from concurrent
> > accesses between stop() and requestComplete().
>
> The issue is that the flush() function must wait for all requests to
> complete. This could be implemented with a condition variable, allowing
> to avoid serializing CameraDevice::requestComplete() with the API calls
> from the camera service. I think that would be better indeed.

Ah yes, I overlooked this part

     * flush() should only return when there are no more outstanding buffers or
     * requests left in the HAL.

Actually with a condition variable it might also get easier, as

        flush() {
                {
                        /* mutex lock *
                        flushing_ = true;
                }

                stop()

                flushed.wait()
                {
                        /* mutex lock *
                        flushing_ = false;
                }
        }

Would allow to held the flushing_ state until all the pending requests
have been returned as a consequence of stop(). Once that's done we can
reset the flushing_ flag, and the next processCaptureRequest() can
continue as normal ?

Do we have a list of CTS or cros_camera_test that exercizes flush() to
be used as validation ?

I've tried with this series applied:

# cros_camera_test --gtest_filter="Camera3FrameTest/Camera3FlushRequestsTest*"
[  PASSED  ] 4 tests.
[  FAILED  ] 8 tests, listed below

Which is the same result I get on the most recent master..

Thanks
   j

> > Have you considered this option and ruled it out for reasons I'm
> > failing to consider ? It feels simpler than decoupling requests and
> > descriptors, maybe you tried and found out it's not :)
> >
> > >  	stop();
> > >
> > > @@ -758,12 +758,17 @@ void CameraDevice::stop()
> > >  	worker_.stop();
> > >  	camera_->stop();
> > >
> > > -	descriptors_.clear();
> > > +	{
> > > +		std::scoped_lock<std::mutex> lock(mutex_);
> > > +		descriptors_.clear();
> > > +	}
> > > +
> > >  	running_ = false;
> > >  }
> > >
> > >  void CameraDevice::setCallbacks(const camera3_callback_ops_t *callbacks)
> > >  {
> > > +	std::scoped_lock<std::mutex> halLock(halMutex_);
> > >  	callbacks_ = callbacks;
> > >  }
> > >
> > > @@ -1643,8 +1648,7 @@ PixelFormat CameraDevice::toPixelFormat(int format) const
> > >   */
> > >  int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > >  {
> > > -	/* Before any configuration attempt, stop the camera. */
> > > -	stop();
> > > +	std::scoped_lock<std::mutex> halLock(halMutex_);
> > >
> > >  	if (stream_list->num_streams == 0) {
> > >  		LOG(HAL, Error) << "No streams in configuration";
> > > @@ -1656,6 +1660,9 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > >  		return -EINVAL;
> > >  #endif
> > >
> > > +	/* Before any configuration attempt, stop the camera. */
> > > +	stop();
> > > +
> > >  	/*
> > >  	 * Generate an empty configuration, and construct a StreamConfiguration
> > >  	 * for each camera3_stream to add to it.
> > > @@ -1671,6 +1678,8 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > >  	 * ensure the required entries are available without further
> > >  	 * reallocation.
> > >  	 */
> > > +
> > > +	std::scoped_lock<std::mutex> lock(mutex_);
> > >  	streams_.clear();
> > >  	streams_.reserve(stream_list->num_streams);
> > >
> > > @@ -1907,6 +1916,8 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > >  	if (!isValidRequest(camera3Request))
> > >  		return -EINVAL;
> > >
> > > +	std::scoped_lock<std::mutex> halLock(halMutex_);
> > > +
> > >  	/* Start the camera if that's the first request we handle. */
> > >  	if (!running_) {
> > >  		worker_.start();
> > > @@ -2022,29 +2033,26 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > >
> > >  void CameraDevice::requestComplete(Request *request)
> > >  {
> > > +	std::scoped_lock<std::mutex> lock(mutex_);
> > > +
> > >  	const Request::BufferMap &buffers = request->buffers();
> > >  	camera3_buffer_status status = CAMERA3_BUFFER_STATUS_OK;
> > >  	std::unique_ptr<CameraMetadata> resultMetadata;
> > >
> > > -	decltype(descriptors_)::node_type node;
> > > -	std::unique_ptr<CaptureRequest> captureRequest;
> > > -	{
> > > -		std::scoped_lock<std::mutex> lock(mutex_);
> > > -		auto requestIt = requests_.find(request->cookie());
> > > -		if (requestIt == requests_.end()) {
> > > -			LOG(HAL, Fatal)
> > > -				<< "Unknown request: " << request->cookie();
> > > -			return;
> > > -		}
> > > -		captureRequest = std::move(requestIt->second);
> > > -		requests_.erase(requestIt);
> > > +	auto requestIt = requests_.find(request->cookie());
> > > +	if (requestIt == requests_.end()) {
> > > +		LOG(HAL, Fatal)
> > > +			<< "Unknown request: " << request->cookie();
> > > +		return;
> > > +	}
> > > +	auto captureRequest = std::move(requestIt->second);
> > > +	requests_.erase(requestIt);
> > >
> > > -		auto it = descriptors_.find(request->cookie());
> > > -		if (it == descriptors_.end())
> > > -			return;
> > > +	auto it = descriptors_.find(request->cookie());
> > > +	if (it == descriptors_.end())
> > > +		return;
> > >
> > > -		node = descriptors_.extract(it);
> > > -	}
> > > +	auto node = descriptors_.extract(it);
> > >  	Camera3RequestDescriptor &descriptor = node.mapped();
> > >
> > >  	if (request->status() != Request::RequestComplete) {
> > > @@ -2149,11 +2157,12 @@ void CameraDevice::notifyError(uint32_t frameNumber, camera3_stream_t *stream)
> > >  	camera3_notify_msg_t notify = {};
> > >
> > >  	/*
> > > -	 * \todo Report and identify the stream number or configuration to
> > > -	 * clarify the stream that failed.
> > > +	 * \todo Report and identify the stream number or configuration
> > > +	 * to clarify the stream that failed.
> > >  	 */
> > > -	LOG(HAL, Error) << "Error occurred on frame " << frameNumber << " ("
> > > -			<< toPixelFormat(stream->format).toString() << ")";
> > > +	LOG(HAL, Error)
> > > +		<< "Error occurred on frame " << descriptor.frameNumber_ << " ("
> > > +		<< toPixelFormat(descriptor.buffers_[0].stream->format).toString() << ")";
> > >
> > >  	notify.type = CAMERA3_MSG_ERROR;
> > >  	notify.message.error.error_stream = stream;
> > > diff --git a/src/android/camera_device.h b/src/android/camera_device.h
> > > index 95d77890..325fb967 100644
> > > --- a/src/android/camera_device.h
> > > +++ b/src/android/camera_device.h
> > > @@ -125,9 +125,12 @@ private:
> > >
> > >  	std::vector<Camera3StreamConfiguration> streamConfigurations_;
> > >  	std::map<int, libcamera::PixelFormat> formatsMap_;
> > > -	std::vector<CameraStream> streams_;
> > >
> > > -	std::mutex mutex_; /* Protect descriptors_ and requests_. */
> > > +	/* Avoid races by concurrent Camera HAL API calls. */
> > > +	std::mutex halMutex_;
> > > +
> > > +	std::mutex mutex_; /* Protect streams_, descriptors_ and requests_. */
> > > +	std::vector<CameraStream> streams_;
> > >  	std::map<uint64_t, Camera3RequestDescriptor> descriptors_;
> > >  	std::map<uint64_t, std::unique_ptr<CaptureRequest>> requests_;
> > >
>
> --
> Regards,
>
> Laurent Pinchart


More information about the libcamera-devel mailing list