[libcamera-devel] [PATCH v2 1/2] android: CameraDevice: Prevent race due to concurrent HAL calls

Mon Apr 26 04:48:02 CEST 2021

Hi Laurent, thanks for the comments.

On Mon, Apr 26, 2021 at 9:30 AM Laurent Pinchart
<laurent.pinchart at ideasonboard.com> wrote:
>
> Hi Hiro,
>
> Thank you for the patch.
>
> On Fri, Apr 23, 2021 at 01:07:37PM +0900, Hirokazu Honda wrote:
> > HAL API functions might be called by different threads in Android
> > Camera HAL3 API, while process_capture_request() must be called
> > by a single thread.
>
> I'm not sure to follow you here. Do you mean that the HAL API requires
> the camera service to call process_capture_request() from a single
> thread ? Or do you mean that the libcamera HAL implementation assumes
> that all calls come from the same thread ?
>

I mean process_capture_request() only. The document says.
      * ... Only
      * one call to process_capture_request() will be made at a time by the
      * framework, and the calls will all be from the same thread.

> > Furthermore, close() and flush() can be
> > called any time. Therefore, races to variables can occur among
> > the HAL API functions.
>
> I'd like to have a bit more details about calling contexts, and record
> it in the commit message, or, actually better, in a comment at the top
> of the file, as this is very important. There's unfortunately little
> information about call contexts in the HAL documentation (camera3.h).
> Here's what I've found:
>
>      * flush() may be called concurrently to process_capture_request(), with the expectation that
>      * process_capture_request will return quickly and the request submitted in that
>      * process_capture_request call is treated like all other in-flight requests.  Due to
>      * concurrency issues, it is possible that from the HAL's point of view, a
>      * process_capture_request() call may be started after flush has been invoked but has not
>      * returned yet. If such a call happens before flush() returns, the HAL should treat the new
>      * capture request like other in-flight pending requests (see #4 below).
>
> This means that flush() can be called from a different thread than
> process_capture_request(). I think this makes sense, and we need to
> handle that.
>
>      * process_capture_request:
>      *
>      * Send a new capture request to the HAL. The HAL should not return from
>      * this call until it is ready to accept the next request to process. Only
>      * one call to process_capture_request() will be made at a time by the
>      * framework, and the calls will all be from the same thread. The next call
>      * to process_capture_request() will be made as soon as a new request and
>      * its associated buffers are available. In a normal preview scenario, this
>      * means the function will be called again by the framework almost
>      * instantly.
>
> There's only one call to process_capture_request() at a time, so we
> don't need to serialize process_capture_request() calls.
>
> That's pretty much all I've found. Is there any other documentation
> you're aware of, information coming from discussions with the Android
> camera team, or from reading the Android camera service implementation ?
>

>From Android camera team, flush() and close() can be called at any time.
Yes, there is no race among process_capture_request() calls. However,
the HAL functions must be serialized due to the any time of flush()
and close() call.

> > This prevents the races by enclosing HAL calls by mutex. It
> > might not be the best in terms of the performance, but the
> > simplicity is good as a first step and also further development.
> >
> > Signed-off-by: Hirokazu Honda <hiroh at chromium.org>
> > ---
> >  src/android/camera_device.cpp | 70 +++++++++++++++++++++--------------
> >  src/android/camera_device.h   |  9 ++++-
> >  2 files changed, 50 insertions(+), 29 deletions(-)
> >
> > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > index a71aee2f..c74dc0e7 100644
> > --- a/src/android/camera_device.cpp
> > +++ b/src/android/camera_device.cpp
> > @@ -748,27 +748,40 @@ int CameraDevice::open(const hw_module_t *hardwareModule)
> >
> >  void CameraDevice::close()
> >  {
> > -     streams_.clear();
> > -
> > -     stop();
> > +     std::scoped_lock<std::mutex> lock(mutex_);
> >
> > -     camera_->release();
> > +     stop(/*releaseCamera=*/true);
>
> The camera is only released here. Can't we keep the camera_->release()
> call here, to avoid the bool parameter to stop() ?
>

Thanks, it should be better. I will move in the next patch.

> >  }
> >
> > -void CameraDevice::stop()
> > +void CameraDevice::stop(bool releaseCamera)
> >  {
> > -     if (!running_)
> > +     streams_.clear();
> > +
> > +     if (!running_) {
> > +             if (releaseCamera) {
> > +                     descriptors_.clear();
> > +                     camera_->release();
> > +             }
> >               return;
> > +     }
> > +
> > +     stopping_ = true;
> >
> >       worker_.stop();
> >       camera_->stop();
> >
> >       descriptors_.clear();
> > +
> >       running_ = false;
> > +     stopping_ = false;
> > +
> > +     if (releaseCamera)
> > +             camera_->release();
> >  }
> >
> >  void CameraDevice::setCallbacks(const camera3_callback_ops_t *callbacks)
> >  {
> > +     std::scoped_lock<std::mutex> lock(mutex_);
> >       callbacks_ = callbacks;
>
> This function is called by the HAL .initialize() operation. I really
> don't think it could race with anything else. While it's not explicitly
> documented, I don't see how it would be reasonable for the camera
> service to call .initialize() from one thread, and other HAL functions
> from different threads before .initialize() returns.
>

Thanks. Yeah, I am sure this function will not be concurrently called
with other functions.
However, callbacks_ will be called in stop(), I wonder it is worth
having a seliralization here also thinking of consistency.
How do you think?

> >  }
> >
> > @@ -1581,6 +1594,8 @@ std::unique_ptr<CameraMetadata> CameraDevice::requestTemplateVideo()
> >   */
> >  const camera_metadata_t *CameraDevice::constructDefaultRequestSettings(int type)
> >  {
> > +     std::scoped_lock<std::mutex> lock(mutex_);
> > +
>
> This function shouldn't affect the state of the camera, does it need to
> be serialized ?
>

Thanks. Even thinking of the consistency, having a serialization in
the function is not reasonable.
I will remove mutex_ in this function.

> >       auto it = requestTemplates_.find(type);
> >       if (it != requestTemplates_.end())
> >               return it->second->get();
> > @@ -1648,8 +1663,7 @@ PixelFormat CameraDevice::toPixelFormat(int format) const
> >   */
> >  int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> >  {
> > -     /* Before any configuration attempt, stop the camera. */
> > -     stop();
> > +     std::scoped_lock<std::mutex> lock(mutex_);
>
> Same comment here, is this something that can race with other calls ?
>
> This being said, I'm not opposed to locking the mutex here and in
> CameraDevice::setCallbacks(). Locks are best documented as what data
> they protect, and using them consistently with their documentation makes
> it easier to check correct locking (this includes making it easier for
> static code checkers to catch issues). As taking the lock here should
> not cause performance issues (especially given that there should be no
> concurrent call to this function), we can keep it for consistency.
>

I think flush() and close() can be called during configureStreams() call.
I think it is mandatory to have a lock at least somewhere in
process_capture_request(). If we don't enclose
process_capture_request() by lock, handling flush() and concurrent
process_capture_request() might become a bit complicated (though I can
still do it, I actually implemented so initially but discarded because
of the complexity).

> >
> >       if (stream_list->num_streams == 0) {
> >               LOG(HAL, Error) << "No streams in configuration";
> > @@ -1661,6 +1675,9 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> >               return -EINVAL;
> >  #endif
> >
> > +     /* Before any configuration attempt, stop the camera. */
> > +     stop();
> > +
> >       /*
> >        * Generate an empty configuration, and construct a StreamConfiguration
> >        * for each camera3_stream to add to it.
> > @@ -1672,11 +1689,10 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> >       }
> >
> >       /*
> > -      * Clear and remove any existing configuration from previous calls, and
> > -      * ensure the required entries are available without further
> > +      * Ensure the required entries are available without further
> >        * reallocation.
> >        */
> > -     streams_.clear();
> > +     ASSERT(streams_.empty());
> >       streams_.reserve(stream_list->num_streams);
> >
> >       std::vector<Camera3StreamConfig> streamConfigs;
> > @@ -1912,6 +1928,8 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> >       if (!isValidRequest(camera3Request))
> >               return -EINVAL;
> >
> > +     std::scoped_lock<std::mutex> lock(mutex_);
> > +
> >       /* Start the camera if that's the first request we handle. */
> >       if (!running_) {
> >               worker_.start();
> > @@ -2015,33 +2033,31 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> >
> >       worker_.queueRequest(descriptor.request_.get());
> >
> > -     {
> > -             std::scoped_lock<std::mutex> lock(mutex_);
> > -             descriptors_[descriptor.request_->cookie()] = std::move(descriptor);
> > -     }
> > +     descriptors_[descriptor.request_->cookie()] = std::move(descriptor);
> >
> >       return 0;
> >  }
> >
> >  void CameraDevice::requestComplete(Request *request)
> >  {
> > +     if (stopping_)
> > +             return;
> > +
> > +     std::scoped_lock<std::mutex> lock(mutex_);
>
> The requestComplete signal is emitted from the CameraManager thread, and
> must not block for a large amount of time. We already lock the mutex_ in
> this function, but we currently use the mutex_ only to protect
> descriptors_, which should never block for long. Extending mutex_ to
> serialize the Camera calls is dangerous. Could we implement better
> locking ? I think the first step would be to clearly explain what races
> we need to handle, as mentioned above in the review of the commit
> message. Race conditions are tricky to handle right, proper
> documentation is a key.
>
> I know we already have issues with JPEG compression that is performed
> synchronously in this function. This needs to be fixed, but let's not
> make it worse and introduce another issue that will then need to be
> fixed too.
>

I agree.
Since requestComplete() doesn't have to be serialized with HAL
functions, we don't need to have a lock from begin to end.
My first thought is coarse lock is a good first step as fine lock
might make the code complicated.
I will next try a fine lock in this function.

-Hiro

> > +
> >       const Request::BufferMap &buffers = request->buffers();
> >       camera3_buffer_status status = CAMERA3_BUFFER_STATUS_OK;
> >       std::unique_ptr<CameraMetadata> resultMetadata;
> >
> > -     decltype(descriptors_)::node_type node;
> > -     {
> > -             std::scoped_lock<std::mutex> lock(mutex_);
> > -             auto it = descriptors_.find(request->cookie());
> > -             if (it == descriptors_.end()) {
> > -                     LOG(HAL, Fatal)
> > -                             << "Unknown request: " << request->cookie();
> > -                     status = CAMERA3_BUFFER_STATUS_ERROR;
> > -                     return;
> > -             }
> > -
> > -             node = descriptors_.extract(it);
> > +     auto it = descriptors_.find(request->cookie());
> > +     if (it == descriptors_.end()) {
> > +             LOG(HAL, Debug)
> > +                     << "Unknown request: " << request->cookie();
> > +             status = CAMERA3_BUFFER_STATUS_ERROR;
> > +             return;
> >       }
> > +
> > +     auto node = descriptors_.extract(it);
> >       Camera3RequestDescriptor &descriptor = node.mapped();
> >
> >       if (request->status() != Request::RequestComplete) {
> > diff --git a/src/android/camera_device.h b/src/android/camera_device.h
> > index c63e8e21..08553584 100644
> > --- a/src/android/camera_device.h
> > +++ b/src/android/camera_device.h
> > @@ -88,7 +88,7 @@ private:
> >               int androidFormat;
> >       };
> >
> > -     void stop();
> > +     void stop(bool releaseCamera = false);
> >
> >       int initializeStreamConfigurations();
> >       std::vector<libcamera::Size>
> > @@ -127,7 +127,12 @@ private:
> >       std::map<int, libcamera::PixelFormat> formatsMap_;
> >       std::vector<CameraStream> streams_;
> >
> > -     std::mutex mutex_; /* Protect descriptors_ */
> > +     /*
> > +      * Protect descriptors_ and camera_, and also avoid races by concurrent
> > +      * Camera HAL API calls.
> > +      */
> > +     std::mutex mutex_;
> > +     std::atomic_bool stopping_;
> >       std::map<uint64_t, Camera3RequestDescriptor> descriptors_;
> >
> >       std::string maker_;
>
> --
> Regards,
>
> Laurent Pinchart