<div dir="ltr"><div dir="ltr">Hi Jacopo and Laurent, thank you for reviewing.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 4, 2021 at 1:02 AM Jacopo Mondi <<a href="mailto:jacopo@jmondi.org">jacopo@jmondi.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Laurent,<br>

<br>

On Mon, May 03, 2021 at 05:55:38PM +0300, Laurent Pinchart wrote:<br>

> Hello,<br>

><br>

> On Mon, May 03, 2021 at 04:41:04PM +0200, Jacopo Mondi wrote:<br>

> > On Mon, Apr 26, 2021 at 05:38:29PM +0900, Hirokazu Honda wrote:<br>

> > > HAL API functions might be called by different threads in Android<br>

> > > Camera HAL3 API, while process_capture_request() must be called<br>

> > > by a single thread. Furthermore, close() and flush() can be<br>

> > > called any time. Therefore, races to variables can occur among<br>

> > > the HAL API functions.<br>

> > > This prevents the races by enclosing HAL calls by mutex. It<br>

> > > might not be the best in terms of the performance, but the<br>

> > > simplicity is good as a first step and also further development.<br>

> > ><br>

> > > Signed-off-by: Hirokazu Honda <<a href="mailto:hiroh@chromium.org" target="_blank">hiroh@chromium.org</a>><br>

> > > ---<br>

> > >  src/android/camera_device.cpp | 59 ++++++++++++++++++++---------------<br>

> > >  src/android/camera_device.h   |  7 +++--<br>

> > >  2 files changed, 39 insertions(+), 27 deletions(-)<br>

> > ><br>

> > > diff --git a/src/android/camera_device.cpp b/src/android/camera_device.cpp<br>

> > > index b3def04b..96c8d4a9 100644<br>

> > > --- a/src/android/camera_device.cpp<br>

> > > +++ b/src/android/camera_device.cpp<br>

> > > @@ -743,7 +743,7 @@ int CameraDevice::open(const hw_module_t *hardwareModule)<br>

> > ><br>

> > >  void CameraDevice::close()<br>

> > >  {<br>

> > > - streams_.clear();<br>

> > > + std::scoped_lock<std::mutex> halLock(halMutex_);<br>

> ><br>

> > Have you considered locking the wrappers functions in camera_ops.cpp ?<br>

> > Wouldn't it be simpler ? However locking with such a coarse-grained<br>

> > granularity I feel contradicts a bit the Camera3 API flush()<br>

> > documentation, which implies that flush() and<br>

> > process_capture_request() can race<br>

> ><br>

> >      * Due to<br>

> >      * concurrency issues, it is possible that from the HAL's point of view, a<br>

> >      * process_capture_request() call may be started after flush has been invoked but has not<br>

> >      * returned yet. If such a call happens before flush() returns, the HAL should treat the new<br>

> >      * capture request like other in-flight pending requests (see #4 below).<br>

> ><br>

> > Locking at function level rules out this use case: if we lock each<br>

> > function it's not possible for process_capture_request() to be started<br>

> > during flush() execution as the lock hold by flush will be released<br>

> > only at exit time.<br>

> ><br>

> > If serializing operations with this granularity is fine (and I really<br>

> > with it is, as it would really simplify life for use)<br>

><br>

> I think it is as an initial step, but likely not in the longer term. It<br>

> defeats the point of having the camera service issue<br>

> process_capture_request() and stop()/flush() calls concurrently. If<br>

<br>

Yes, that was my concern, although the documentation says it's enough<br>

to implement the full success/full fail behavior which I think it's<br>

what's happening here. <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> serializing operations was good enough in every case, it would be done<br>

> in the camera service itself, and HAL implementations wouldn't need to<br>

> deal with this.<br>

<br>

Also true, indeed the framework allows to implement a finer granularity<br>

in locking the operations<br>

<br></blockquote><div><br></div><div><div>Since ChromeOS camera services sealizes the calls, ChromeOS doesn't require this serialization here. </div><div></div></div><div>But shall we assume that a HAL client serializes the call? Since this is a HAL implementation, I think we should align the HAL specification; no serialization is ensured.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

><br>

> > the only racing<br>

> > condition we have to protect from is between the request completion<br>

> > signal emitted by libcamera, and the camera framework calling into the<br>

> > Camera HAL using the camera3 API as flush(), close() and<br>

> > process_capture_request() will always be serialized.<br>

> ><br>

> > Assuming the above is ok with the Camera3 API, isn't it enough to<br>

> ><br>

> > - flush() to call stop() and stop the Camera:<br>

> >         - All queued requests at the time Camera::stop() is called<br>

> >           will be returned one by one with an error state. There's no<br>

> >           reason to cancel the descriptors at stop() time, those<br>

> >           requests will be completed<br>

> ><br></blockquote><div><br></div><div>That's correct. Well, I would do so just in case to ensure all requests are returned before flush().</div><div>But it should be unnecessary.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> >         - A process_capture_request() called after flush() has stopped<br>

> >           the camera will return immeditely as the camera is stopped<br>

> >           and the capture request is returned with an error<br>

> >           (we might want a flag to keep track of this, otherwise we'll<br>

> >           hit a Camera state machine error, much like we do with<br>

> >           running_)<br>

> ><br></blockquote><div><br></div><div>Hmm, in my understanding, the process_capture_request() should re-start the camera as it is a usual request.</div><div>Why do you think so?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> >         - A close() called after flush() will be a no op<br>

><br>

> So far it seems fine to me.<br></blockquote><div><br></div><div>The critical difference between close() and flush() is whether a camera is held or not, isn't it?</div><div>That is, Camera::release() is called or not.</div><div>I agree almost all is done by the preceding flush(), so close() called after flush() will just call Camera::release().</div><div>How do you think?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

><br>

> >         - Only open() will allow the CameraDevice to move back from<br>

> >           stopped_ state<br>

><br>

> That's correct for close(). For flush(), we can to make sure that a<br>

> configureStreams() call will reopen the camera.<br>

<br>

Right, it's probably enough to just reset the state of the new flag in<br>

configureStreams() and return immediately from<br>

processCaptureRequest().<br>

<br>

My main point should have been that processCaptureRequest() cannot be<br>

called after a flush() without going through configureStreams() ?<br>

<br>

Maybe not true however, considering:<br>

<br>

     * flush() should only return when there are no more outstanding buffers or<br>

     * requests left in the HAL. The framework may call configure_streams (as<br>

     * the HAL state is now quiesced) or may issue new requests.<br>

<br>

Which seems to suggest after a flush, processCaptureRequest() can be<br>

called again.<br>

<br></blockquote><div><br></div><div>Yeah, from this, I think we should re-open camera in the first process_capture_request() after flush() has been called.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

><br>

> > - There's a tiny tiny window where a request can complete after flush<br>

> >   has been called but the camera has not been stopped yet. In<br>

><br>

> It's not that tiny, it can certainly be hit in practice.<br>

><br>

> >   requestComplete() we'll have to inspect the stopped_ flag state and<br>

> >   force the request to return with error even if it did not.<br>

><br>

> Why can't we complete the request without any error in that case ?<br>

<br>

Because flush() has been called already (but the camera has not<br>

stopped yet?)<br>

<br>

Although the documentation reports<br>

<br>

     * 1. For captures that are too late for the HAL to cancel/stop, and will be<br>

     *    completed normally by the HAL; i.e. the HAL can send shutter/notify and<br>

     *    process_capture_result and buffers as normal.<br>

<br>

It's then probably fine to let those requests just complete<br>

successfully (even better, it's easier)<br>

<br></blockquote><div><br></div><div>Camera::stop() is a blocking function and calls requestComplete() with all the pending requests?</div><div>In that case, a return request is marked with CANCELLED or if the buffer is already complete, marked with COMPLETE.</div><div>Well, ideally, so that we should not execute a post-processing during flush and a thread should be separated for the post processing, we should introduce a way of detecting that flush is being called in requestComplete().</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

><br>

> > - Once we move JPEG encoding to a worker thread, stop() will have to<br>

> >   forcefully stop it. Requests interrupted during encoding will be<br>

> >   returned with error.<br>

> ><br>

> > A simple lock around stopped_ should protect it from concurrent<br>

> > accesses between stop() and requestComplete().<br>

><br>

> The issue is that the flush() function must wait for all requests to<br>

> complete. This could be implemented with a condition variable, allowing<br>

> to avoid serializing CameraDevice::requestComplete() with the API calls<br>

> from the camera service. I think that would be better indeed.<br>

<br>

Ah yes, I overlooked this part<br>

<br>

     * flush() should only return when there are no more outstanding buffers or<br>

     * requests left in the HAL.<br>

<br>

Actually with a condition variable it might also get easier, as<br>

<br>

        flush() {<br>

                {<br>

                        /* mutex lock *<br>

                        flushing_ = true;<br>

                }<br>

<br>

                stop()<br>

<br>

                flushed.wait()<br>

                {<br>

                        /* mutex lock *<br>

                        flushing_ = false;<br>

                }<br>

        }<br>

<br></blockquote><div><br></div><div>The difficult point is with this other HAL calls (e.g. process_capture_request() and stop()) can be called during stop().</div><div>We can protect it with flushing_. But I think it should be simplified by guarding two mutexes as this patch does.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Would allow to held the flushing_ state until all the pending requests<br>

have been returned as a consequence of stop(). Once that's done we can<br>

reset the flushing_ flag, and the next processCaptureRequest() can<br>

continue as normal ?<br>

<br></blockquote><div><br></div><div>Yes, I did so in this patch series though without flushing_, but with a new mutex.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Do we have a list of CTS or cros_camera_test that exercizes flush() to<br>

be used as validation ?<br>

<br></blockquote><div><br></div><div>I think CTS doesn't detect the issue.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

I've tried with this series applied:<br>

<br>

# cros_camera_test --gtest_filter="Camera3FrameTest/Camera3FlushRequestsTest*"<br>

[  PASSED  ] 4 tests.<br>

[  FAILED  ] 8 tests, listed below<br>

<br>

Which is the same result I get on the most recent master..<br>

<br></blockquote><div><br></div><div>You need 1954 and 1967 to pass the test.</div><div><br></div><div>Thanks,</div><div>-Hiro</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Thanks<br>

   j<br>

<br>

> > Have you considered this option and ruled it out for reasons I'm<br>

> > failing to consider ? It feels simpler than decoupling requests and<br>

> > descriptors, maybe you tried and found out it's not :)<br>

> ><br>

> > >   stop();<br>

> > ><br>

> > > @@ -758,12 +758,17 @@ void CameraDevice::stop()<br>

> > >   worker_.stop();<br>

> > >   camera_->stop();<br>

> > ><br>

> > > - descriptors_.clear();<br>

> > > + {<br>

> > > +         std::scoped_lock<std::mutex> lock(mutex_);<br>

> > > +         descriptors_.clear();<br>

> > > + }<br>

> > > +<br>

> > >   running_ = false;<br>

> > >  }<br>

> > ><br>

> > >  void CameraDevice::setCallbacks(const camera3_callback_ops_t *callbacks)<br>

> > >  {<br>

> > > + std::scoped_lock<std::mutex> halLock(halMutex_);<br>

> > >   callbacks_ = callbacks;<br>

> > >  }<br>

> > ><br>

> > > @@ -1643,8 +1648,7 @@ PixelFormat CameraDevice::toPixelFormat(int format) const<br>

> > >   */<br>

> > >  int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)<br>

> > >  {<br>

> > > - /* Before any configuration attempt, stop the camera. */<br>

> > > - stop();<br>

> > > + std::scoped_lock<std::mutex> halLock(halMutex_);<br>

> > ><br>

> > >   if (stream_list->num_streams == 0) {<br>

> > >           LOG(HAL, Error) << "No streams in configuration";<br>

> > > @@ -1656,6 +1660,9 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)<br>

> > >           return -EINVAL;<br>

> > >  #endif<br>

> > ><br>

> > > + /* Before any configuration attempt, stop the camera. */<br>

> > > + stop();<br>

> > > +<br>

> > >   /*<br>

> > >    * Generate an empty configuration, and construct a StreamConfiguration<br>

> > >    * for each camera3_stream to add to it.<br>

> > > @@ -1671,6 +1678,8 @@ int CameraDevice::configureStreams(camera3_stream_configuration_t *stream_list)<br>

> > >    * ensure the required entries are available without further<br>

> > >    * reallocation.<br>

> > >    */<br>

> > > +<br>

> > > + std::scoped_lock<std::mutex> lock(mutex_);<br>

> > >   streams_.clear();<br>

> > >   streams_.reserve(stream_list->num_streams);<br>

> > ><br>

> > > @@ -1907,6 +1916,8 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques<br>

> > >   if (!isValidRequest(camera3Request))<br>

> > >           return -EINVAL;<br>

> > ><br>

> > > + std::scoped_lock<std::mutex> halLock(halMutex_);<br>

> > > +<br>

> > >   /* Start the camera if that's the first request we handle. */<br>

> > >   if (!running_) {<br>

> > >           worker_.start();<br>

> > > @@ -2022,29 +2033,26 @@ int CameraDevice::processCaptureRequest(camera3_capture_request_t *camera3Reques<br>

> > ><br>

> > >  void CameraDevice::requestComplete(Request *request)<br>

> > >  {<br>

> > > + std::scoped_lock<std::mutex> lock(mutex_);<br>

> > > +<br>

> > >   const Request::BufferMap &buffers = request->buffers();<br>

> > >   camera3_buffer_status status = CAMERA3_BUFFER_STATUS_OK;<br>

> > >   std::unique_ptr<CameraMetadata> resultMetadata;<br>

> > ><br>

> > > - decltype(descriptors_)::node_type node;<br>

> > > - std::unique_ptr<CaptureRequest> captureRequest;<br>

> > > - {<br>

> > > -         std::scoped_lock<std::mutex> lock(mutex_);<br>

> > > -         auto requestIt = requests_.find(request->cookie());<br>

> > > -         if (requestIt == requests_.end()) {<br>

> > > -                 LOG(HAL, Fatal)<br>

> > > -                         << "Unknown request: " << request->cookie();<br>

> > > -                 return;<br>

> > > -         }<br>

> > > -         captureRequest = std::move(requestIt->second);<br>

> > > -         requests_.erase(requestIt);<br>

> > > + auto requestIt = requests_.find(request->cookie());<br>

> > > + if (requestIt == requests_.end()) {<br>

> > > +         LOG(HAL, Fatal)<br>

> > > +                 << "Unknown request: " << request->cookie();<br>

> > > +         return;<br>

> > > + }<br>

> > > + auto captureRequest = std::move(requestIt->second);<br>

> > > + requests_.erase(requestIt);<br>

> > ><br>

> > > -         auto it = descriptors_.find(request->cookie());<br>

> > > -         if (it == descriptors_.end())<br>

> > > -                 return;<br>

> > > + auto it = descriptors_.find(request->cookie());<br>

> > > + if (it == descriptors_.end())<br>

> > > +         return;<br>

> > ><br>

> > > -         node = descriptors_.extract(it);<br>

> > > - }<br>

> > > + auto node = descriptors_.extract(it);<br>

> > >   Camera3RequestDescriptor &descriptor = node.mapped();<br>

> > ><br>

> > >   if (request->status() != Request::RequestComplete) {<br>

> > > @@ -2149,11 +2157,12 @@ void CameraDevice::notifyError(uint32_t frameNumber, camera3_stream_t *stream)<br>

> > >   camera3_notify_msg_t notify = {};<br>

> > ><br>

> > >   /*<br>

> > > -  * \todo Report and identify the stream number or configuration to<br>

> > > -  * clarify the stream that failed.<br>

> > > +  * \todo Report and identify the stream number or configuration<br>

> > > +  * to clarify the stream that failed.<br>

> > >    */<br>

> > > - LOG(HAL, Error) << "Error occurred on frame " << frameNumber << " ("<br>

> > > -                 << toPixelFormat(stream->format).toString() << ")";<br>

> > > + LOG(HAL, Error)<br>

> > > +         << "Error occurred on frame " << descriptor.frameNumber_ << " ("<br>

> > > +         << toPixelFormat(descriptor.buffers_[0].stream->format).toString() << ")";<br>

> > ><br>

> > >   notify.type = CAMERA3_MSG_ERROR;<br>

> > >   notify.message.error.error_stream = stream;<br>

> > > diff --git a/src/android/camera_device.h b/src/android/camera_device.h<br>

> > > index 95d77890..325fb967 100644<br>

> > > --- a/src/android/camera_device.h<br>

> > > +++ b/src/android/camera_device.h<br>

> > > @@ -125,9 +125,12 @@ private:<br>

> > ><br>

> > >   std::vector<Camera3StreamConfiguration> streamConfigurations_;<br>

> > >   std::map<int, libcamera::PixelFormat> formatsMap_;<br>

> > > - std::vector<CameraStream> streams_;<br>

> > ><br>

> > > - std::mutex mutex_; /* Protect descriptors_ and requests_. */<br>

> > > + /* Avoid races by concurrent Camera HAL API calls. */<br>

> > > + std::mutex halMutex_;<br>

> > > +<br>

> > > + std::mutex mutex_; /* Protect streams_, descriptors_ and requests_. */<br>

> > > + std::vector<CameraStream> streams_;<br>

> > >   std::map<uint64_t, Camera3RequestDescriptor> descriptors_;<br>

> > >   std::map<uint64_t, std::unique_ptr<CaptureRequest>> requests_;<br>

> > ><br>

><br>

> --<br>

> Regards,<br>

><br>

> Laurent Pinchart<br>

</blockquote></div></div>