show: all  february  march  april  may
week 03feb 12, 2025 This week, I set out to find a more accurate model for selfie segmentation and I came across a paper detailing a model called BBox-Mask-Pose. It seemed promising due to its combination of pose detection and image segmentation for improved accuracy. I attempted to implement the model myself, but it proved challenging since it wasn’t pre-built—I had to train it from scratch. Additionally, I hadn’t set up my machine for machine learning tasks yet, so configuring the necessary packages and environments was difficult (though it did give me some valuable experience with Conda).

After investing a significant amount of time without much progress, I raised an issue on the GitHub repository, asking for tips on running the model and for benchmarks on its speed. Unfortunately, they informed me that the model isn’t fast enough for a live webcam feed.
My search continued until I stumbled upon an article by Google researchers on TensorFlow's website discussing image segmentation. The article mentions the Selfie Segmentation model I used last week and another model called BlazePose. The key distinction is that while Selfie Segmentation is optimized for close-range tasks like separating a face from a background during video calls, BlazePose is designed for full-body segmentation. This explains why Selfie Segmentation struggled to accurately segment arms and hands.

The article links to a demo site, and on it I found that BlazePose is significantly more accurate. It offers three levels—lite, full, and heavy—with the heavy setting still managing to run at 60fps on my laptop. I was satisfied with the performance so I cloned the demo, and started poking around the code. So far I’ve set it up to run on my machine but have yet to make any of my own changes to it.


©Aditi Gupta
New York University
Integrated Design & Media (IDM) Graduate Thesis