[libcamera-devel] [PATCH v2 1/2] android: CameraDevice: Prevent race due to concurrent HAL calls

Hirokazu Honda hiroh at chromium.org
Mon Apr 26 09:51:59 CEST 2021


Hi Laurent, thanks for commenting.

On Mon, Apr 26, 2021 at 3:48 PM Laurent Pinchart
<laurent.pinchart at ideasonboard.com> wrote:
>
> Hi Hiro,
>
> On Mon, Apr 26, 2021 at 11:48:02AM +0900, Hirokazu Honda wrote:
> > On Mon, Apr 26, 2021 at 9:30 AM Laurent Pinchart wrote:
> > > On Fri, Apr 23, 2021 at 01:07:37PM +0900, Hirokazu Honda wrote:
> > > > HAL API functions might be called by different threads in Android
> > > > Camera HAL3 API, while process_capture_request() must be called
> > > > by a single thread.
> > >
> > > I'm not sure to follow you here. Do you mean that the HAL API requires
> > > the camera service to call process_capture_request() from a single
> > > thread ? Or do you mean that the libcamera HAL implementation assumes
> > > that all calls come from the same thread ?
> >
> > I mean process_capture_request() only. The document says.
> >       * ... Only
> >       * one call to process_capture_request() will be made at a time by the
> >       * framework, and the calls will all be from the same thread.
> >
> >
> > > > Furthermore, close() and flush() can be
> > > > called any time. Therefore, races to variables can occur among
> > > > the HAL API functions.
> > >
> > > I'd like to have a bit more details about calling contexts, and record
> > > it in the commit message, or, actually better, in a comment at the top
> > > of the file, as this is very important. There's unfortunately little
> > > information about call contexts in the HAL documentation (camera3.h).
> > > Here's what I've found:
> > >
> > >      * flush() may be called concurrently to process_capture_request(), with the expectation that
> > >      * process_capture_request will return quickly and the request submitted in that
> > >      * process_capture_request call is treated like all other in-flight requests.  Due to
> > >      * concurrency issues, it is possible that from the HAL's point of view, a
> > >      * process_capture_request() call may be started after flush has been invoked but has not
> > >      * returned yet. If such a call happens before flush() returns, the HAL should treat the new
> > >      * capture request like other in-flight pending requests (see #4 below).
> > >
> > > This means that flush() can be called from a different thread than
> > > process_capture_request(). I think this makes sense, and we need to
> > > handle that.
> > >
> > >      * process_capture_request:
> > >      *
> > >      * Send a new capture request to the HAL. The HAL should not return from
> > >      * this call until it is ready to accept the next request to process. Only
> > >      * one call to process_capture_request() will be made at a time by the
> > >      * framework, and the calls will all be from the same thread. The next call
> > >      * to process_capture_request() will be made as soon as a new request and
> > >      * its associated buffers are available. In a normal preview scenario, this
> > >      * means the function will be called again by the framework almost
> > >      * instantly.
> > >
> > > There's only one call to process_capture_request() at a time, so we
> > > don't need to serialize process_capture_request() calls.
> > >
> > > That's pretty much all I've found. Is there any other documentation
> > > you're aware of, information coming from discussions with the Android
> > > camera team, or from reading the Android camera service implementation ?
> >
> > From Android camera team, flush() and close() can be called at any time.
> > Yes, there is no race among process_capture_request() calls. However,
> > the HAL functions must be serialized due to the any time of flush()
> > and close() call.
>
> flush() and close() can race with process_capture_request(). But can
> they race each other ? Can they race with configure_streams() ?
>

+Ricky Liang tells me flush() and close() can be called at any time.
I have no idea if flush() and close() race each other. But we should
protect other functions with them.
Ricky, is there any document stating this?

> > > > This prevents the races by enclosing HAL calls by mutex. It
> > > > might not be the best in terms of the performance, but the
> > > > simplicity is good as a first step and also further development.
> > > >
> > > > Signed-off-by: Hirokazu Honda <hiroh at chromium.org>
> > > > ---
> > > >  src/android/camera_device.cpp | 70 +++++++++++++++++++++--------------
> > > >  src/android/camera_device.h   |  9 ++++-
> > > >  2 files changed, 50 insertions(+), 29 deletions(-)
> > > >
> > > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp
> > > > index a71aee2f..c74dc0e7 100644
> > > > --- a/src/android/camera_device.cpp
> > > > +++ b/src/android/camera_device.cpp
> > > > @@ -748,27 +748,40 @@ int CameraDevice::open(const hw_module_t *hardwareModule)
> > > >
> > > >  void CameraDevice::close()
> > > >  {
> > > > -     streams_.clear();
> > > > -
> > > > -     stop();
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > > >
> > > > -     camera_->release();
> > > > +     stop(/*releaseCamera=*/true);
> > >
> > > The camera is only released here. Can't we keep the camera_->release()
> > > call here, to avoid the bool parameter to stop() ?
> >
> > Thanks, it should be better. I will move in the next patch.
> >
> > > >  }
> > > >
> > > > -void CameraDevice::stop()
> > > > +void CameraDevice::stop(bool releaseCamera)
> > > >  {
> > > > -     if (!running_)
> > > > +     streams_.clear();
> > > > +
> > > > +     if (!running_) {
> > > > +             if (releaseCamera) {
> > > > +                     descriptors_.clear();
> > > > +                     camera_->release();
> > > > +             }
> > > >               return;
> > > > +     }
> > > > +
> > > > +     stopping_ = true;
> > > >
> > > >       worker_.stop();
> > > >       camera_->stop();
> > > >
> > > >       descriptors_.clear();
> > > > +
> > > >       running_ = false;
> > > > +     stopping_ = false;
> > > > +
> > > > +     if (releaseCamera)
> > > > +             camera_->release();
> > > >  }
> > > >
> > > >  void CameraDevice::setCallbacks(const camera3_callback_ops_t *callbacks)
> > > >  {
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > > >       callbacks_ = callbacks;
> > >
> > > This function is called by the HAL .initialize() operation. I really
> > > don't think it could race with anything else. While it's not explicitly
> > > documented, I don't see how it would be reasonable for the camera
> > > service to call .initialize() from one thread, and other HAL functions
> > > from different threads before .initialize() returns.
> >
> > Thanks. Yeah, I am sure this function will not be concurrently called
> > with other functions.
> > However, callbacks_ will be called in stop(), I wonder it is worth
> > having a seliralization here also thinking of consistency.
> > How do you think?
>
> As commented below, even if we don't need to serialize these calls with
> a lock, doing so shouldn't hurt performances, and should make locking
> verification easier, so I'm not opposed to it.
>
> > > >  }
> > > >
> > > > @@ -1581,6 +1594,8 @@ std::unique_ptr<CameraMetadata> CameraDevice::requestTemplateVideo()
> > > >   */
> > > >  const camera_metadata_t *CameraDevice::constructDefaultRequestSettings(int type)
> > > >  {
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > > > +
> > >
> > > This function shouldn't affect the state of the camera, does it need to
> > > be serialized ?
> >
> > Thanks. Even thinking of the consistency, having a serialization in
> > the function is not reasonable.
> > I will remove mutex_ in this function.
> >
> > > >       auto it = requestTemplates_.find(type);
> > > >       if (it != requestTemplates_.end())
> > > >               return it->second->get();
> > > > @@ -1648,8 +1663,7 @@ PixelFormat CameraDevice::toPixelFormat(int format) const
> > > >   */
> > > >  int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > >  {
> > > > -     /* Before any configuration attempt, stop the camera. */
> > > > -     stop();
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > >
> > > Same comment here, is this something that can race with other calls ?
> > >
> > > This being said, I'm not opposed to locking the mutex here and in
> > > CameraDevice::setCallbacks(). Locks are best documented as what data
> > > they protect, and using them consistently with their documentation makes
> > > it easier to check correct locking (this includes making it easier for
> > > static code checkers to catch issues). As taking the lock here should
> > > not cause performance issues (especially given that there should be no
> > > concurrent call to this function), we can keep it for consistency.
> >
> > I think flush() and close() can be called during configureStreams() call.
>
> That's interesting. I would have expected those to be serialized. Is
> there any documentation that explains this ?
>
> > I think it is mandatory to have a lock at least somewhere in
> > process_capture_request(). If we don't enclose
> > process_capture_request() by lock, handling flush() and concurrent
> > process_capture_request() might become a bit complicated (though I can
> > still do it, I actually implemented so initially but discarded because
> > of the complexity).
> >
> > > >
> > > >       if (stream_list->num_streams == 0) {
> > > >               LOG(HAL, Error) << "No streams in configuration";
> > > > @@ -1661,6 +1675,9 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > >               return -EINVAL;
> > > >  #endif
> > > >
> > > > +     /* Before any configuration attempt, stop the camera. */
> > > > +     stop();
> > > > +
> > > >       /*
> > > >        * Generate an empty configuration, and construct a StreamConfiguration
> > > >        * for each camera3_stream to add to it.
> > > > @@ -1672,11 +1689,10 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)
> > > >       }
> > > >
> > > >       /*
> > > > -      * Clear and remove any existing configuration from previous calls, and
> > > > -      * ensure the required entries are available without further
> > > > +      * Ensure the required entries are available without further
> > > >        * reallocation.
> > > >        */
> > > > -     streams_.clear();
> > > > +     ASSERT(streams_.empty());
> > > >       streams_.reserve(stream_list->num_streams);
> > > >
> > > >       std::vector<Camera3StreamConfig> streamConfigs;
> > > > @@ -1912,6 +1928,8 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > >       if (!isValidRequest(camera3Request))
> > > >               return -EINVAL;
> > > >
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > > > +
> > > >       /* Start the camera if that's the first request we handle. */
> > > >       if (!running_) {
> > > >               worker_.start();
> > > > @@ -2015,33 +2033,31 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques
> > > >
> > > >       worker_.queueRequest(descriptor.request_.get());
> > > >
> > > > -     {
> > > > -             std::scoped_lock<std::mutex> lock(mutex_);
> > > > -             descriptors_[descriptor.request_->cookie()] = std::move(descriptor);
> > > > -     }
> > > > +     descriptors_[descriptor.request_->cookie()] = std::move(descriptor);
> > > >
> > > >       return 0;
> > > >  }
> > > >
> > > >  void CameraDevice::requestComplete(Request *request)
> > > >  {
> > > > +     if (stopping_)
> > > > +             return;
> > > > +
> > > > +     std::scoped_lock<std::mutex> lock(mutex_);
> > >
> > > The requestComplete signal is emitted from the CameraManager thread, and
> > > must not block for a large amount of time. We already lock the mutex_ in
> > > this function, but we currently use the mutex_ only to protect
> > > descriptors_, which should never block for long. Extending mutex_ to
> > > serialize the Camera calls is dangerous. Could we implement better
> > > locking ? I think the first step would be to clearly explain what races
> > > we need to handle, as mentioned above in the review of the commit
> > > message. Race conditions are tricky to handle right, proper
> > > documentation is a key.
> > >
> > > I know we already have issues with JPEG compression that is performed
> > > synchronously in this function. This needs to be fixed, but let's not
> > > make it worse and introduce another issue that will then need to be
> > > fixed too.
> >
> > I agree.
> > Since requestComplete() doesn't have to be serialized with HAL
> > functions, we don't need to have a lock from begin to end.
> > My first thought is coarse lock is a good first step as fine lock
> > might make the code complicated.
> > I will next try a fine lock in this function.
>
> Thank you. If it starts causing issues, please feel free to reach out to
> discuss them before spending too much time on the implementation.
>

While I tried improving the grain of lock, I found requestComplete()
can touch an invalid request.
This is caused because descriptors_ can be destroyed in stop(), but
requestComplete() needs to call request->cookie() in order to check
it.
I think the easiest way of avoiding this is to separate request from
descriptor and destroy it always in requestComplete().
Please see the next patch series.

-Hiro

> > > > +
> > > >       const Request::BufferMap &buffers = request->buffers();
> > > >       camera3_buffer_status status = CAMERA3_BUFFER_STATUS_OK;
> > > >       std::unique_ptr<CameraMetadata> resultMetadata;
> > > >
> > > > -     decltype(descriptors_)::node_type node;
> > > > -     {
> > > > -             std::scoped_lock<std::mutex> lock(mutex_);
> > > > -             auto it = descriptors_.find(request->cookie());
> > > > -             if (it == descriptors_.end()) {
> > > > -                     LOG(HAL, Fatal)
> > > > -                             << "Unknown request: " << request->cookie();
> > > > -                     status = CAMERA3_BUFFER_STATUS_ERROR;
> > > > -                     return;
> > > > -             }
> > > > -
> > > > -             node = descriptors_.extract(it);
> > > > +     auto it = descriptors_.find(request->cookie());
> > > > +     if (it == descriptors_.end()) {
> > > > +             LOG(HAL, Debug)
> > > > +                     << "Unknown request: " << request->cookie();
> > > > +             status = CAMERA3_BUFFER_STATUS_ERROR;
> > > > +             return;
> > > >       }
> > > > +
> > > > +     auto node = descriptors_.extract(it);
> > > >       Camera3RequestDescriptor &descriptor = node.mapped();
> > > >
> > > >       if (request->status() != Request::RequestComplete) {
> > > > diff --git a/src/android/camera_device.h b/src/android/camera_device.h
> > > > index c63e8e21..08553584 100644
> > > > --- a/src/android/camera_device.h
> > > > +++ b/src/android/camera_device.h
> > > > @@ -88,7 +88,7 @@ private:
> > > >               int androidFormat;
> > > >       };
> > > >
> > > > -     void stop();
> > > > +     void stop(bool releaseCamera = false);
> > > >
> > > >       int initializeStreamConfigurations();
> > > >       std::vector<libcamera::Size>
> > > > @@ -127,7 +127,12 @@ private:
> > > >       std::map<int, libcamera::PixelFormat> formatsMap_;
> > > >       std::vector<CameraStream> streams_;
> > > >
> > > > -     std::mutex mutex_; /* Protect descriptors_ */
> > > > +     /*
> > > > +      * Protect descriptors_ and camera_, and also avoid races by concurrent
> > > > +      * Camera HAL API calls.
> > > > +      */
> > > > +     std::mutex mutex_;
> > > > +     std::atomic_bool stopping_;
> > > >       std::map<uint64_t, Camera3RequestDescriptor> descriptors_;
> > > >
> > > >       std::string maker_;
>
> --
> Regards,
>
> Laurent Pinchart


More information about the libcamera-devel mailing list