SAM 3D
Computer VisionFor the first time since I began my initial model exploration for the project that would eventually become Perfect Form, I am excited about a new open source model that has been released. The SAM 3D Body models by Meta demonstrate pretty impressive pose reconstruction abilities from images. The team was kind enough to release the weights and accompanying repository, making for a very convenient onboarding experience for developers curious to try the model. As a proud member of the elite unmonetized $0MRR club, I knew I had to spend some time and GPU cycles profiling this model for my specific application of powerlifting form analysis.
After a quick glance at the code, you’ll realize that the pose estimation relies on 4 specialized models used in conjunction: an object detection model, a segmentation model, a field-of-view model, and then finally the human pose estimation model. The data flow through these models is a bit unexpected, since each frame is fed through all of the models sequentially before moving to the next frame. Additionally, you will find that the models are fairly tightly coupled together, a familiar pattern in these pose estimation repositories. All of this means that a non-trivial amount of engineering work would be required if I wanted to productionize this series of models following the same paradigm as before. Of course, we are getting a bit ahead of ourselves here. The pressing question is whether the performance of this series of models is strong enough to justify this future developer effort.
Let’s start with the good. These models seem to obtain a much higher accuracy in reconstructing human poses in 3D space than the currently deployed models powering Perfect Form. I actually wanted to record my workouts so that I could visualize the results, truthfully a new experience for me. I don’t know how much of this desire was psychological: somehow looking at the humanoid mesh lifting the weight is infinitely more satisfying than watching myself lift it. For an example, I went to workout on Thanksgiving just because I wanted to see how the model would handle a heavy deadlift set, and I was not disappointed with the results:
The inference for each model frame is entirely Markovian, so nowhere is any sort of temporal consistency enforced. The estimated pose is generally remarkably smooth considering the fact that each frame is analyzed completely independently. In general, the models do an impressive job estimating the visible keypoint positions. Here is a bench example, where we can that the model generally creates a very stable pose:
As a reference point, here is the pose that would be extracted from the current Perfect Form models:
You can see there is significant variation in the estimated position of even fully visible keypoints due to the challenging camera angle.
While we are on the topic of performance, it is worth adding that while the visible keypoints are quite stable, any occlusions often result in very chaotic estimates. This video really nicely illustrates the two personas of the models:
The models perform really well, stably localizing the position of keypoints in 3D space even while the camera is moving. This is a very impressive feat. Yet around the 10 second mark, you can clearly see the chaos emerge as the right hand gets occluded by the plate. The estimated hand position wildly jumps around until the occlusion is removed. Without any world model, physics model, or temporal information, the models simply have no choice but to guess where the hand could be located while it is not visible.
In summary, I have found the performance of these models to be very impressive. And we have only looked at the raw output from following along in the repository. This leaves the door open for specific improvements targeting shortcomings in temporal consistency or occlusions, which would further improve performance. We have only been looking at the model output visually, of course. There are certain biomechanical metrics that are sensitive to even small estimation errors, so there is not a guarantee that we would suddenly have access to every metric of interest. But all of the initial analysis suggests that we would expect a significant jump in metrics derived from pose estimation across the board.
Unfortunately, there is no free lunch here. The models required significantly more computation to produce the impressive pose estimation results. Most of the videos required 30-40 minutes of inference on an RTX 3090, compared to around 2 minutes for the series of models currently deployed to my website. Clearly, requiring 40 minutes of GPU time per video is unrealistic for both the patience of the user as well as the associated cost. I have some ideas of how I could reduce this to a more manageable figure, but I have not validated any of the perceived speed-up opportunities yet to understand the time savings or any impacts on accuracy. When I next time some substantial time to devote to this investigation, I will see how far I can push these models and possibly even upgrade the Perfect Form back end to run these models in a serverless fashion. There is a chance that a version of these models would allow the form analysis to cross the invisible threshold in my mind that would allow me to feel good charging for the product that I have built. Stay tuned.