[libcamera-devel] [PATCH v3 8/8] android: Implement flush() camera operation

Laurent Pinchart laurent.pinchart at ideasonboard.com
Thu May 27 04:26:51 CEST 2021


Hi Jacopo,

(expanding the CC list to finalize the race conditions discussion)

On Mon, May 24, 2021 at 09:47:55AM +0200, Jacopo Mondi wrote:
> On Sun, May 23, 2021 at 09:50:46PM +0300, Laurent Pinchart wrote:
> > On Sun, May 23, 2021 at 04:22:51PM +0200, Jacopo Mondi wrote:
> > > On Sat, May 22, 2021 at 11:55:36AM +0200, Niklas Söderlund wrote:
> > > > On 2021-05-21 17:42:27 +0200, Jacopo Mondi wrote:
> > > > > Implement the flush() camera operation in the CameraDevice class
> > > > > and make it available to the camera framework by implementing the
> > > > > operation wrapper in camera_ops.cpp.
> > > > >
> > > > > The flush() implementation stops the Camera and the worker thread and
> > > > > waits for all in-flight requests to be returned. Stopping the Camera
> > > > > forces all Requests already queued to be returned immediately in error
> > > > > state. As flush() has to wait until all of them have been returned, make it
> > > > > wait on a newly introduced condition variable which is notified by the
> > > > > request completion handler when the queue of pending requests has been
> > > > > exhausted.
> > > > >
> > > > > As flush() can race with processCaptureRequest() protect the requests
> > > > > queueing by introducing a new CameraState::CameraFlushing state that
> > > > > processCaptureRequest() inspects before queuing the Request to the
> > > > > Camera. If flush() has been called while processCaptureRequest() was
> > > > > executing, return the current Request immediately in error state.
> > > > >
> > > > > Protect potentially concurrent calls to close() and configureStreams()
> >
> > Can this happen ? Quoting camera3.h,
> >
> >  * 12. Alternatively, the framework may call camera3_device_t->common->close()
> >  *    to end the camera session. This may be called at any time when no other
> >  *    calls from the framework are active, although the call may block until all
> >  *    in-flight captures have completed (all results returned, all buffers
> >  *    filled). After the close call returns, no more calls to the
> >  *    camera3_callback_ops_t functions are allowed from the HAL. Once the
> >  *    close() call is underway, the framework may not call any other HAL device
> >  *    functions.
> >
> > The important part is "when no other calss from the framework are
> > active". I don't think we need to handle close() racing with anything
> > else than process_capture_request().
> 
> I've been discussing this with Hiro during v1, as initially I didn't
> consider close() and configureStreams().
> 
> https://patchwork.libcamera.org/patch/12248/#16884
> 
> I initially only considered processCaptureRequest() as a potential
> race, but got suggested differently by the cros camera team.

Let's try to get to the bottom of this.

Section S2 ("Startup and general expected operation sequence") states:

 * 12. Alternatively, the framework may call camera3_device_t->common->close()
 *    to end the camera session. This may be called at any time when no other
 *    calls from the framework are active, although the call may block until all
 *    in-flight captures have completed (all results returned, all buffers
 *    filled). After the close call returns, no more calls to the
 *    camera3_callback_ops_t functions are allowed from the HAL. Once the
 *    close() call is underway, the framework may not call any other HAL device
 *    functions.

There can be in-flight requests when .close() is called, but it can't be
called concurrently with any other call. There's thus no race condition
to protect against.

The .configure_streams() documentation states:

     * Preconditions:
     *
     * The framework will only call this method when no captures are being
     * processed. That is, all results have been returned to the framework, and
     * all in-flight input and output buffers have been returned and their
     * release sync fences have been signaled by the HAL. The framework will not
     * submit new requests for capture while the configure_streams() call is
     * underway.

This clearly forbids calling .configure_streams() and
.process_capture_request() concurrently.

The .flush() documentation states:

     * Flush all currently in-process captures and all buffers in the pipeline
     * on the given device. The framework will use this to dump all state as
     * quickly as possible in order to prepare for a configure_streams() call.

I interpret this as at least a very strong hint that .flush() and
.configure_streams() can't be called concurrently :-)

If anyone disagrees, I'd like compelling evidence that those races can
occur.

> > > > > by inspecting the CameraState, and force a wait for any flush() call
> > > > > to complete before proceeding.
> > > > >
> > > > > Signed-off-by: Jacopo Mondi <jacopo at jmondi.org>
> > > > > ---
> > > > >  src/android/camera_device.cpp | 90 +++++++++++++++++++++++++++++++++--
> > > > >  src/android/camera_device.h   |  9 +++-
> > > > >  src/android/camera_ops.cpp    |  8 +++-
> > > > >  3 files changed, 100 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > > > index 3fce14035718..899afaa49439 100644
> > > > > --- a/src/android/camera_device.cpp
> > > > > +++ b/src/android/camera_device.cpp
> > > > > @@ -750,16 +750,65 @@ int CameraDevice::open(const hw_module_t *hardwareModule)
> > > > >
> > > > >  void CameraDevice::close()
> > > > >  {
> > > > > -	streams_.clear();
> > > > > +	MutexLocker cameraLock(cameraMutex_);
> >
> > I'd add a blank line here.
> >
> > > > > +	if (state_ == CameraFlushing) {
> >
> > As mentioned above, I don't think you need to protect against close()
> > and flush() racing each other.
> >
> > > > > +		flushed_.wait(cameraLock, [&] { return state_ != CameraStopped; });
> > > > > +		camera_->release();
> > > > >
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +	streams_.clear();
> > > > >  	stop();
> > > > >
> > > > >  	camera_->release();
> > > > >  }
> > > > >
> > > > > -void CameraDevice::stop()
> > > > > +/*
> > > > > + * Flush is similar to stop() but sets the camera state to 'flushing' and wait
> >
> > s/wait/waits/
> >
> > > > > + * until all the in-flight requests have been returned before setting the
> > > > > + * camera state to stopped.
> > > > > + *
> > > > > + * Once flushing is done it unlocks concurrent calls to camera close() and
> > > > > + * configureStreams().
> > > > > + */
> > > > > +void CameraDevice::flush()
> > > > >  {
> > > > > +	{
> > > > > +		MutexLocker cameraLock(cameraMutex_);
> > > > > +
> > > > > +		if (state_ != CameraRunning)
> > > > > +			return;
> > > > > +
> > > > > +		worker_.stop();
> > > > > +		camera_->stop();
> > > > > +		state_ = CameraFlushing;
> > > > > +	}
> > > > > +
> > > > > +	/*
> > > > > +	 * Now wait for all the in-flight requests to be completed before
> > > > > +	 * continuing. Stopping the Camera guarantees that all in-flight
> > > > > +	 * requests are completed in error state.
> >
> > Do we need to wait ? Camera::stop() guarantees that all requests
> > complete synchronously with the stop() call.
> 
> I didn't get the API that way... I thought after stop we would receive
> a sequence of failed requests... Actually I don't see anything that
> suggests that in camera.cpp or pipeline_handler.cpp apart from an assertion
> in Camera::stop()

The camera::stop() documentation states

 * This method stops capturing and processing requests immediately. All pending
 * requests are cancelled and complete synchronously in an error state.

Is this ambiguous ?

> > Partly answering myself here, we'll have to wait for post-processing
> > tasks to complete once we'll process them in a separate thread, but that
> > will likely be handled by Thread::wait(). I don't think you need a
> > condition variable here. I'm I'm not mistaken, this should simplify the
> > implementation.
> 
> If Camera::stop() is synchronous we don't need to wait indeed
> 
> > > > > +	 */
> > > > > +	{
> > > > > +		MutexLocker requestsLock(requestsMutex_);
> > > > > +		flushing_.wait(requestsLock, [&] { return descriptors_.empty(); });
> > > > > +	}
> > > >
> > > > I'm still uneasy about releasing the cameraMutex_ for this section. In
> > > > patch 6/8 you add it to protect the state_ variable but here it's
> > >
> > > I'm not changing state_ without the mutex acquired, am I ?
> > >
> > > > ignored. I see the ASSERT() added to stop() but the patter of taking the
> > > > lock checking state_, releasing the lock and do some work, retake the
> > > > lock and update state_ feels like a bad idea. Maybe I'm missing
> > >
> > > How so, apart from the fact it feels a bit unusual, I concur ?
> > >
> > > If I keep the held the mutex for the whole duration of flush no other
> > > concurrent method can proceed until all the queued requests have not
> > > been completed. While flush waits for the flushing_ condition to be
> > > signaled, processCaptureRequest() can proceed and immediately return
> > > the newly queued requests in error state by detecting state_ ==
> > > CameraFlushing which signals that flush in is progress.
> > > Otherwise it would have had to wait for flush to end. But then we're back
> > > to a situation where we could serialize all calls and that's it, we
> > > would be done with a single mutex to be held for the whole duration of
> > > all operations.
> > >
> > > If it only was for close() or configureStreams() we could have locked
> > > for the whole duration of flush(), as they anyway wait for flush to
> > > complete before proceeding (by waiting on the flushed_ condition here
> > > below signaled).
> > >
> > > > something and this is not a real problem, if so maybe we can capture
> > > > that in the comment here?
> > > >
> > > > > +
> > > > > +	/*
> > > > > +	 * Set state to stopped and unlock close() or configureStreams() that
> > > > > +	 * might be waiting for flush to be completed.
> > > > > +	 */
> > > > >  	MutexLocker cameraLock(cameraMutex_);
> > > > > +	state_ = CameraStopped;
> > > > > +	flushed_.notify_one();
> >
> > You should drop the lock before calling notify_one(). Otherwise you'll
> > wake up the task waiting on flushed_, which will try to lock
> > cameraMutex_, which will block immediately. The scheduler will have to
> > reschedule this task for the function to return and the lock to be
> > released before the waiter can proceed. That works, but isn't very
> > efficient.
> 
> Weird, the cpp reference shows example about notify_one where the
> caller always has the mutex held locked, but I see your point and
> seems correct..

I'm looking at
https://en.cppreference.com/w/cpp/thread/condition_variable and
https://en.cppreference.com/w/cpp/thread/condition_variable/notify_one
and both calls to notify_one() in the example are made without the lock
held, aren't they ?

> >
> > 	{
> > 		MutexLocker cameraLock(cameraMutex_);
> > 		state_ = CameraStopped;
> > 	}
> >
> > 	flushed_.notify_one();
> >
> 
> So I could change to this one, if I don't have to drop this part
> completely if we consider close() and configureStreams() not as
> possible races...
> 
> > > > > +}
> > > > > +
> > > > > +/* Calls to stop() must be protected by cameraMutex_ being held by the caller. */
> > > > > +void CameraDevice::stop()
> > > > > +{
> > > > > +	ASSERT(state_ != CameraFlushing);
> > > > > +
> > > > >  	if (state_ == CameraStopped)
> > > > >  		return;
> > > > >
> > > > > @@ -1581,8 +1630,18 @@ PixelFormat CameraDevice::toPixelFormat(int format) const
> > > > >   */
> > > > >  int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > > >  {
> > > > > -	/* Before any configuration attempt, stop the camera. */
> > > > > -	stop();
> > > > > +	{
> > > > > +		/*
> > > > > +		 * If a flush is in progress, wait for it to complete and to
> > > > > +		 * stop the camera, otherwise before any new configuration
> > > > > +		 * attempt we have to stop the camera explictely.
> > > > > +		 */
> >
> > Same here, I don't think flush() and configure_streams() can race each
> > other. I believe the only possible race to be between flush() and
> > process_capture_request().
> 
> Ditto.
> 
> > > > > +		MutexLocker cameraLock(cameraMutex_);
> > > > > +		if (state_ == CameraFlushing)
> > > > > +			flushed_.wait(cameraLock, [&] { return state_ != CameraStopped; });
> > > > > +		else
> > > > > +			stop();
> > > > > +	}
> > > > >
> > > > >  	if (stream_list->num_streams == 0) {
> > > > >  		LOG(HAL, Error) << "No streams in configuration";
> > > > > @@ -1950,6 +2009,25 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > > >  	if (ret)
> > > > >  		return ret;
> > > > >
> > > > > +	/*
> > > > > +	 * Just before queuing the request, make sure flush() has not
> > > > > +	 * been called after this function has been executed. In that
> > > > > +	 * case, immediately return the request with errors.
> > > > > +	 */
> > > > > +	MutexLocker cameraLock(cameraMutex_);
> > > > > +	if (state_ == CameraFlushing || state_ == CameraStopped) {
> > > > > +		for (camera3_stream_buffer_t &buffer : descriptor.buffers_) {
> > > > > +			buffer.status = CAMERA3_BUFFER_STATUS_ERROR;
> > > > > +			buffer.release_fence = buffer.acquire_fence;
> > > > > +		}
> > > > > +
> > > > > +		notifyError(descriptor.frameNumber_,
> > > > > +			    descriptor.buffers_[0].stream,
> >
> > As commented on a previous patch, I think you should pass nullptr for
> > the stream here.
> 
> The "S6. Error management:" section of the camera3.h header does not
> mention that, not the ?

Indeed, that section doesn't mention the camera3_error_msg::error_stream
field at all. The field is documented in the structure as

    /**
     * Pointer to the stream that had a failure. NULL if the stream isn't
     * applicable to the error.
     */

The question is thus when the stream is applicable to the error. The
documentation of enum camera3_error_msg_code mentions error_stream in
the CAMERA3_MSG_ERROR_BUFFER case only. The other errors are related to
the device, the request or the result metadata, which are not specific
to a stream.

> where does you suggestion come from ? I don't find any reference in
> the review of [1/8]

([PATCH v3 1/8] android: Rework request completion notification'
(YKqV6Iik2sN3XUEf at pendragon.ideasonboard.com)

> > > > > +			    CAMERA3_MSG_ERROR_REQUEST);
> > > > > +
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > >  	worker_.queueRequest(descriptor.request_.get());
> > > > >
> > > > >  	{
> > > > > @@ -1979,6 +2057,10 @@ void CameraDevice::requestComplete(Request *request)
> > > > >  			return;
> > > > >  		}
> > > > >
> > > > > +		/* Release flush if all the pending requests have been completed. */
> > > > > +		if (descriptors_.empty())
> > > > > +			flushing_.notify_one();
> >
> > This will never happen, as you can only get here if descriptors_.find()
> > has found the descriptor. Did you mean to do this after the extract()
> > call below ?
> 
> Ugh. This works only because Camera::stop() is synchronous then ?

I believe so.

> > > > > +
> > > > >  		node = descriptors_.extract(it);
> > > > >  	}
> > > > >  	Camera3RequestDescriptor &descriptor = node.mapped();
> > > > > diff --git a/src/android/camera_device.h b/src/android/camera_device.h
> > > > > index 7cf8e8370387..e1b3bf7d30f2 100644
> > > > > --- a/src/android/camera_device.h
> > > > > +++ b/src/android/camera_device.h
> > > > > @@ -7,6 +7,7 @@
> > > > >  #ifndef __ANDROID_CAMERA_DEVICE_H__
> > > > >  #define __ANDROID_CAMERA_DEVICE_H__
> > > > >
> > > > > +#include <condition_variable>
> > > > >  #include <map>
> > > > >  #include <memory>
> > > > >  #include <mutex>
> > > > > @@ -42,6 +43,7 @@ public:
> > > > >
> > > > >  	int open(const hw_module_t *hardwareModule);
> > > > >  	void close();
> > > > > +	void flush();
> > > > >
> > > > >  	unsigned int id() const { return id_; }
> > > > >  	camera3_device_t *camera3Device() { return &camera3Device_; }
> > > > > @@ -92,6 +94,7 @@ private:
> > > > >  	enum State {
> > > > >  		CameraStopped,
> > > > >  		CameraRunning,
> > > > > +		CameraFlushing,
> > > > >  	};
> > > > >
> > > > >  	void stop();
> > > > > @@ -120,8 +123,9 @@ private:
> > > > >
> > > > >  	CameraWorker worker_;
> > > > >
> > > > > -	libcamera::Mutex cameraMutex_; /* Protects access to the camera state. */
> > > > > +	libcamera::Mutex cameraMutex_; /* Protects the camera state and flushed_. */
> > > > >  	State state_;
> > > > > +	std::condition_variable flushed_;
> > > > >
> > > > >  	std::shared_ptr<libcamera::Camera> camera_;
> > > > >  	std::unique_ptr<libcamera::CameraConfiguration> config_;
> > > > > @@ -134,8 +138,9 @@ private:
> > > > >  	std::map<int, libcamera::PixelFormat> formatsMap_;
> > > > >  	std::vector<CameraStream> streams_;
> > > > >
> > > > > -	libcamera::Mutex requestsMutex_; /* Protects descriptors_. */
> > > > > +	libcamera::Mutex requestsMutex_; /* Protects descriptors_ and flushing_. */
> > > > >  	std::map<uint64_t, Camera3RequestDescriptor> descriptors_;
> > > > > +	std::condition_variable flushing_;
> > > > >
> > > > >  	std::string maker_;
> > > > >  	std::string model_;
> > > > > diff --git a/src/android/camera_ops.cpp b/src/android/camera_ops.cpp
> > > > > index 696e80436821..8a3cfa175ff5 100644
> > > > > --- a/src/android/camera_ops.cpp
> > > > > +++ b/src/android/camera_ops.cpp
> > > > > @@ -66,8 +66,14 @@ static void hal_dev_dump([[maybe_unused]] const struct camera3_device *dev,
> > > > >  {
> > > > >  }
> > > > >
> > > > > -static int hal_dev_flush([[maybe_unused]] const struct camera3_device *dev)
> > > > > +static int hal_dev_flush(const struct camera3_device *dev)
> > > > >  {
> > > > > +	if (!dev)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	CameraDevice *camera = reinterpret_cast<CameraDevice *>(dev->priv);
> > > > > +	camera->flush();
> > > > > +
> > > > >  	return 0;
> > > > >  }
> > > > >

-- 
Regards,

Laurent Pinchart


More information about the libcamera-devel mailing list