HomeIdentity EngineeringContact
Back to All Projects
MAJOR 04 / 10

Hand Sign to Audio

Real-time sign language to speech translator with plug-and-play gesture database. MediaPipe landmark detection, dual online/offline TTS, and virtual audio routing for video calls.

MediaPipePythongTTS pyttsx3SQLiteTkinter VB-Cable
View Source
150+
Signs Tested
Dual
TTS Pipeline
21
Hand Landmarks
Plug
& Play Gestures

The Challenge

Over 70 million deaf individuals worldwide rely on interpreters for daily communication. Existing solutions are either expensive, cloud-dependent, or limited to ASL. This project targets a universal, plug-and-play approach — any sign language can be added by anyone.

  • Language-agnostic — hand signs differ across ISL, ASL, BSL. Need a system where anyone can create their own sign database
  • Offline capability — must work without internet using local TTS engine
  • Desktop distribution — needs to be packaged as a .exe for non-technical users
  • Video call integration — output speech to virtual microphone for Zoom, Teams, Meet

Engineering Thinking

The core insight was that hand signs vary across languages — ISL, ASL, BSL all have different gesture patterns. Training a YOLO model would require massive per-language datasets, making it impractical.

MediaPipe solved this elegantly: it extracts 21 hand landmark coordinates without any pre-trained gesture data. The system becomes plug-and-play — users can record new gestures in minutes, save them to SQLite, and share the database files. A community could build an entire language in days.

We chose Tkinter for the UI because the goal was to distribute it as a standalone .exe file — making it accessible to non-technical deaf users without requiring Python installation.

CHOSEN

MediaPipe + SQLite gesture DB

No training needed. 21-point landmarks as features. Anyone can create/share gesture sets. True plug-and-play.

REJECTED

YOLO-based hand sign detection

Requires massive per-language training datasets. Different sign languages = different models. Not scalable for community use.

REJECTED

Web-based UI (Flask/React)

Requires server hosting. Goal was offline-first desktop app distributable as .exe for non-technical users.

Pipeline Design

┌──────────┐ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │ Webcam │───▶│ MediaPipe │───▶│ Gesture DB │───▶│ TTS Engine │ │ Input │ │ 21-Point Hand │ │ (SQLite) │ │ gTTS/pyttsx │ └──────────┘ │ Landmark │ │ │ └──────┬───────┘ └────────────────┘ │ Plug & Play │ │ │ Community │ ┌──────┴───────┐ │ Gesture DB │ │ Speaker / │ └──────────────┘ │ VB-Cable │ │ (Virtual Mic)│ └──────────────┘ │ ┌─────┴────────┐ │ Zoom / Teams │ │ / Google Meet│ └──────────────┘

Why These Technologies

MediaPipe
21-point hand landmarks with zero training data needed. Works offline. Real-time detection at 30+ FPS.
SQLite
Portable gesture database. Share .db files to distribute sign language packs. No server needed.
gTTS + pyttsx3
Dual TTS: gTTS for high-quality online audio, pyttsx3 as offline fallback. Automatic failover.
Tkinter
Native desktop UI. Packageable as standalone .exe for non-technical users. No Python install required.
VB-Cable
Virtual audio device routing TTS output to mic input. Enables sign-to-speech in video calls.

Key Systems

MediaPipe Landmark Detection

  • 21-point hand landmark extraction per frame
  • Finger state detection (extended/curled per digit)
  • Debounced gesture matching to prevent false triggers
  • Configurable confidence threshold

Plug & Play Gesture Database

  • 100–150 signs created and tested
  • User-trainable: capture gesture → assign phrase → save
  • Shareable .db files — community can build any language
  • Import/export for gesture library distribution

Dual TTS Pipeline

  • gTTS (Google) for online high-quality audio
  • pyttsx3 local engine as offline fallback
  • Automatic failover: online → offline on network failure
  • Configurable voice speed, pitch, and language

Virtual Audio Routing

  • VB-Cable routes TTS output to virtual mic input
  • Enables sign-to-speech in Zoom, Teams, Meet
  • Screen crop approach for meeting video input
  • Latency optimization for real-time conversation

What Broke & What I Learned

Challenge 01
YOLO Approach Failed — Dataset Nightmare
Initially tried building with YOLO, but hand signs have completely different patterns across languages (ISL vs ASL vs BSL). Creating training datasets for each language was impractical. We spent weeks searching for alternatives before discovering MediaPipe's landmark-based approach which needs zero training data.
Lesson: Sometimes the best model is no model. Feature extraction (landmarks) can outperform trained classifiers when the problem space is too diverse.
Challenge 02
Video Call Integration — Latency Killed It
For video call integration, we tried screen-sharing the entire screen and cropping the video call area to feed into MediaPipe. The approach worked, but the latency was too high — screen capture + crop + detection + TTS created a noticeable delay that broke natural conversation flow.
Lesson: Real-time pipelines need every stage optimized. A multi-step screen capture approach adds unacceptable latency for conversational use.
Challenge 03
Similar Gestures Causing False Matches
With 100+ signs, some gestures had very similar finger positions. The system would oscillate between two similar signs rapidly, causing audio stuttering. Fixed with debounced matching — a gesture must be held stable for a threshold duration before triggering audio output.
Lesson: More gestures = more collisions. Debouncing and confidence thresholds are essential in classification systems with large category counts.

What It Achieved

Outcome
150+ Signs Tested
Created and validated 100–150 sign gestures, proving the plug-and-play community model works at scale.
Outcome
Language-Agnostic System
Any sign language can be added by anyone — just record gestures and share the SQLite database file. True community-driven approach.
Outcome
Offline-First Desktop App
Works without internet via pyttsx3 fallback. Distributable as .exe for non-technical users.

Screenshots & Demo

Hand sign detection with MediaPipe landmarks
Real-Time Detection — MediaPipe Landmark Overlay
*Note: These images are generated by AI, the images will be replaced new very soon
Gesture library database
Gesture Library — Plug & Play Community Database
*Note: These images are generated by AI, the images will be replaced new very soon
Previous
Payroll Suite
Next
TruFleet