{{ ... }}
Advanced Open-Source Voice Analysis & Transcription Framework
Features • Architecture • Installation • Usage • API • Contribution • License
MyAgent is a comprehensive open-source framework designed for intelligent audio recording, speaker recognition, text transcription, and adaptive voice model learning. It creates a personal, searchable voice archive that allows users to retrieve and analyze their conversations for informed decision-making.
Unlike traditional voice recognition systems, MyAgent is built on a hybrid architecture:
- Local Processing on Powerful Hardware - Leverages state-of-the-art models without cloud dependency
- Mobile Integration via REST API - Ensures universal compatibility with Android and iOS devices
- Adaptive Speaker Recognition - Continuously improves accuracy through reinforcement learning
- Shareable Voice Models - Facilitates recognition across different devices without extensive retraining
- Automatically identifies when human speech is present
- Prevents unnecessary continuous recording
- Implemented with Silero VAD, a neural network trained for precise human voice detection
- Advanced audio signal filtering for more reliable transcription
- Algorithms:
- Noisereduce: Removes light background noise
- Demucs: Separates voices from noise for maximum quality
- Creates unique voice fingerprints for each speaker
- Based on SpeechBrain (ECAPA-TDNN) model, capable of distinguishing voices even in noisy environments
- Continuous reinforcement: Recognition becomes more accurate as users validate or correct results
- Converts audio signals to text using Whisper, OpenAI's model known for accuracy even in noisy environments
- Supports:
- Local Mode (Faster-Whisper): Optimized for speed and efficiency
- API Mode (OpenAI Whisper API): Cloud solution for users preferring remote processing
- Export and share voice models between users
- Enables cross-recognition (e.g., a user can send their voice model to a friend for direct recognition without retraining)
- Functions as a local server that handles all processing and communicates with mobile devices through an API
POST /upload
: Mobile app sends audio file for processingGET /transcription/{id}
: Retrieves transcription and detected speakersPOST /train_speaker
: Adds a new voice user to improve recognitionGET /export_model
: Exports and shares a voice model
MyAgent is structured in a modular and extensible way to facilitate adding new features:
📂 myagent-framework/
├── 📂 vad/ # Voice Activity Detection
│ ├── vad_silero.py # Silero-based detection
│ ├── __init__.py
├── 📂 noise_reduction/ # Noise Suppression
│ ├── noisereduce.py # Spectral noise reduction
│ ├── demucs.py # Advanced suppression with Demucs
│ ├── __init__.py
├── 📂 speaker_id/ # Speaker Recognition
│ ├── train_model.py # Local voice model training
│ ├── recognize.py # Speaker identification
│ ├── reinforcement.py # Continuous model improvement
│ ├── export_model.py # Voice model export/import
│ ├── __init__.py
├── 📂 transcription/ # Audio-to-Text Transcription
│ ├── whisper_local.py # Local transcription (Faster-Whisper)
│ ├── whisper_api.py # OpenAI API transcription (optional)
│ ├── __init__.py
├── 📂 api/ # Mobile Communication API
│ ├── server.py # Main API (FastAPI)
│ ├── routes.py # Interaction endpoints
│ ├── __init__.py
├── 📂 utils/ # Miscellaneous Tools
│ ├── audio_utils.py # Audio conversion, normalization
│ ├── config.py # Configuration file
│ ├── __init__.py
├── setup.py # Installation with pip install .
├── README.md # Main documentation
├── requirements.txt # Dependencies
├── .gitignore # Files to ignore in GitHub
# Clone the repository
git clone https://github.com/yourusername/myagent.git
cd myagent
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install .
from myagent import VAD, NoiseReduction, SpeakerID, Transcription
# Initialize components
vad = VAD()
noise_reducer = NoiseReduction()
speaker_id = SpeakerID()
transcriber = Transcription()
# Process an audio file
audio_path = "path/to/audio.wav"
speech_segments = vad.detect(audio_path)
clean_audio = noise_reducer.process(audio_path)
speakers = speaker_id.identify(clean_audio)
transcription = transcriber.transcribe(clean_audio)
print(f"Transcription: {transcription}")
print(f"Speakers identified: {speakers}")
from myagent.api import server
# Start server on default port 8000
server.run()
# Or specify a custom port
server.run(port=5000)
Upload an audio file for processing.
Request:
POST /upload
Content-Type: multipart/form-data
file: [audio_file]
Response:
{
"id": "task_12345",
"status": "processing"
}
Get the transcription and speaker information for a processed audio file.
Response:
{
"id": "task_12345",
"status": "completed",
"transcription": "Hello, this is a test message.",
"speakers": [
{
"id": "speaker_1",
"name": "John",
"segments": [{"start": 0.0, "end": 2.5}]
}
]
}
Add a new speaker to the recognition model.
Request:
POST /train_speaker
Content-Type: multipart/form-data
name: "John Doe"
file: [audio_file]
Response:
{
"speaker_id": "speaker_12345",
"status": "trained"
}
Export a trained speaker model.
Response:
{
"model_data": "base64_encoded_model",
"speaker_id": "speaker_12345"
}
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by the MyAgent Team