Leveraging 2D pose estimators for American Sign Language Recognition
MetadataShow full item record
Most deaf children born to hearing parents do not have continuous access to language, leading to weaker short-term memory compared to deaf children born to deaf parents. This lack of short-term memory has serious consequences on their mental health and employment rate. To this end, prior work has explored CopyCat, a game where children interact with virtual actors using sign language. While CopyCat has been shown to improve language generation, reception, and repetition, it uses expensive hardware for sign language recognition. This thesis explores the feasibility of using 2D off-the-shelf camera-based pose estimators such as MediaPipe for complementing sign language recognition and moving towards a ubiquitous recognition framework. We compare MediaPipe with 3D pose estimators such as Azure Kinect to determine the feasibility of using off-the-shelf cameras. Furthermore, we develop and compare Hidden Markov Models (HMMs) with state-of-the-art recognition models like Transformers to determine which model is best suited for American Sign Language Recognition in a constrained environment. We find that MediaPipe marginally outperforms Kinect in various experimental settings. Additionally, HMMs outperform Transformers by on average 17.0% on recognition accuracy. Given these results, we believe that a widely deployable game using only a 2D camera is feasible.