MAJOR 04 / 10

Hand Sign to Audio

Real-time sign language to speech translator with plug-and-play gesture database. MediaPipe landmark detection, dual online/offline TTS, and virtual audio routing for video calls.

MediaPipePythongTTS pyttsx3SQLiteTkinter VB-Cable

View Source

// Problem Statement

The Challenge

Over 70 million deaf individuals worldwide rely on interpreters for daily communication. Existing solutions are either expensive, cloud-dependent, or limited to ASL. This project targets a universal, plug-and-play approach — any sign language can be added by anyone.

Language-agnostic — hand signs differ across ISL, ASL, BSL. Need a system where anyone can create their own sign database
Offline capability — must work without internet using local TTS engine
Desktop distribution — needs to be packaged as a .exe for non-technical users
Video call integration — output speech to virtual microphone for Zoom, Teams, Meet

// My Approach

Engineering Thinking

The core insight was that hand signs vary across languages — ISL, ASL, BSL all have different gesture patterns. Training a YOLO model would require massive per-language datasets, making it impractical.

MediaPipe solved this elegantly: it extracts 21 hand landmark coordinates without any pre-trained gesture data. The system becomes plug-and-play — users can record new gestures in minutes, save them to SQLite, and share the database files. A community could build an entire language in days.

We chose Tkinter for the UI because the goal was to distribute it as a standalone .exe file — making it accessible to non-technical deaf users without requiring Python installation.

CHOSEN

MediaPipe + SQLite gesture DB

No training needed. 21-point landmarks as features. Anyone can create/share gesture sets. True plug-and-play.

REJECTED

YOLO-based hand sign detection

Requires massive per-language training datasets. Different sign languages = different models. Not scalable for community use.

REJECTED

Web-based UI (Flask/React)

Requires server hosting. Goal was offline-first desktop app distributable as .exe for non-technical users.

// System Architecture

Pipeline Design

┌──────────┐ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │ Webcam │───▶│ MediaPipe │───▶│ Gesture DB │───▶│ TTS Engine │ │ Input │ │ 21-Point Hand │ │ (SQLite) │ │ gTTS/pyttsx │ └──────────┘ │ Landmark │ │ │ └──────┬───────┘ └────────────────┘ │ Plug & Play │ │ │ Community │ ┌──────┴───────┐ │ Gesture DB │ │ Speaker / │ └──────────────┘ │ VB-Cable │ │ (Virtual Mic)│ └──────────────┘ │ ┌─────┴────────┐ │ Zoom / Teams │ │ / Google Meet│ └──────────────┘

// Tech Stack

Why These Technologies

MediaPipe

21-point hand landmarks with zero training data needed. Works offline. Real-time detection at 30+ FPS.

SQLite

Portable gesture database. Share .db files to distribute sign language packs. No server needed.

gTTS + pyttsx3

Dual TTS: gTTS for high-quality online audio, pyttsx3 as offline fallback. Automatic failover.

Tkinter

Native desktop UI. Packageable as standalone .exe for non-technical users. No Python install required.

VB-Cable

Virtual audio device routing TTS output to mic input. Enables sign-to-speech in video calls.

// Technical Deep-Dive

Key Systems

MediaPipe Landmark Detection

21-point hand landmark extraction per frame
Finger state detection (extended/curled per digit)
Debounced gesture matching to prevent false triggers
Configurable confidence threshold

Plug & Play Gesture Database

100–150 signs created and tested
User-trainable: capture gesture → assign phrase → save
Shareable .db files — community can build any language
Import/export for gesture library distribution

Dual TTS Pipeline

gTTS (Google) for online high-quality audio
pyttsx3 local engine as offline fallback
Automatic failover: online → offline on network failure
Configurable voice speed, pitch, and language

Virtual Audio Routing

VB-Cable routes TTS output to virtual mic input
Enables sign-to-speech in Zoom, Teams, Meet
Screen crop approach for meeting video input
Latency optimization for real-time conversation

// Challenges & Failures

What Broke & What I Learned

Challenge 01

YOLO Approach Failed — Dataset Nightmare

Initially tried building with YOLO, but hand signs have completely different patterns across languages (ISL vs ASL vs BSL). Creating training datasets for each language was impractical. We spent weeks searching for alternatives before discovering MediaPipe's landmark-based approach which needs zero training data.

Lesson: Sometimes the best model is no model. Feature extraction (landmarks) can outperform trained classifiers when the problem space is too diverse.

Challenge 02

Video Call Integration — Latency Killed It

For video call integration, we tried screen-sharing the entire screen and cropping the video call area to feed into MediaPipe. The approach worked, but the latency was too high — screen capture + crop + detection + TTS created a noticeable delay that broke natural conversation flow.

Lesson: Real-time pipelines need every stage optimized. A multi-step screen capture approach adds unacceptable latency for conversational use.

Challenge 03

Similar Gestures Causing False Matches

With 100+ signs, some gestures had very similar finger positions. The system would oscillate between two similar signs rapidly, causing audio stuttering. Fixed with debounced matching — a gesture must be held stable for a threshold duration before triggering audio output.

Lesson: More gestures = more collisions. Debouncing and confidence thresholds are essential in classification systems with large category counts.

// Results & Impact

What It Achieved

Outcome

150+ Signs Tested

Created and validated 100–150 sign gestures, proving the plug-and-play community model works at scale.

Outcome

Language-Agnostic System

Any sign language can be added by anyone — just record gestures and share the SQLite database file. True community-driven approach.

Outcome

Offline-First Desktop App

Works without internet via pyttsx3 fallback. Distributable as .exe for non-technical users.

Hand Sign to Audio

The Challenge

Engineering Thinking

Pipeline Design

Why These Technologies

Key Systems

What Broke & What I Learned

What It Achieved

Screenshots & Demo