-
Notifications
You must be signed in to change notification settings - Fork 15
Description
This issue is a follow-on from this short thread in a related side-project; nomadkaraoke/python-audio-separator#8 (comment)
Problem: lyrics-transcriber
currently transcribes the given audio file using whisper-timestamped and writes the detected words to a lyrics file directly with no cleanup or modification.
This results in very variable accuracy for the lyrics output, as Whisper is far from perfect at correctly detecting lyrics from music audio.
For an example, compare these two synced lyrics videos:
- Synced manually by me with correct lyrics and manual syncing using MidiCo. Manual synced .lrc file here for comparison.
- Generated using lyrics-transcriber - pure Whisper transcription only, taking the output .lrc file and loading that into MidiCo
Fortunately, for the majority of songs, as long as we know the artist and title, we can download lyrics from the internet and hopefully use this to correct the detected lyrics from Whisper.
I've already implemented the fetch of lyrics from both genius and spotify.
This issue is to track the implementation of the hard part - using those lyrics to correct the detected lyrics.
Before discussing ways to approach this, it's worth being aware of the biggest limitations first:
1 - Lyrics from the internet are often wrong in various ways
Common examples include:
- Missing repetitions of chorus/refrain or bridge sections of songs
- Missing intro or outro sections
- Wrong/incorrect words, e.g. where a person typing up the lyrics has misheard
- Wrong/incorrect words, e.g. where the "official" lyrics don't match what ended up actually being sung by the artist in the commercial recording
2 - Whisper-timestamped transcriptions are almost always wrong in various places
- It will almost always have some words which are wrong, depending on the singers style, accent, background music, recording quality, etc. This is especially likely when the lyrics include names or less common words, and are sometimes hilarious to read, e.g. mishearing "Whitehall" as "Phytol" in one song I recently created a karaoke version of 😄
- While it usually gets the timestamps of words correctly (even if the word itself is wrong), there are still some issues with this which may need to be solved in the whisper-timestamped project itself, e.g. it commonly gets the timestamp of the very first word wrong, and occasionally starts sentences too soon.
- Fortunately, it at least provides a confidence score for each detected word, which we can hopefully use to improve the transcription by replacing low confidence words with more-likely words from the internet lyrics
So, given these challenges, I'm holding out hope for the following approach (roughly):
- Take the internet lyrics and split those up into lines (both genius and spotify if both were successfully fetched)
- For each line returned from the whisper transcription, find a couple of "anchor words" which have a high confidence score
- Attempt to match up the line with a lyrics line from the internet lyrics using these "anchor words"
- Attempt to replace the low confidence (less than 50%?) words with words from the matched internet lyrics line, potentially replacing the entire line if there are multiple low confidence words in the line or if the number of words doesn't match up
This is a super rough set of thoughts though, and I'm sure the reality of this approach will become apparent when attempting to implement ;)