Skip to content

Correct the synced lyrics heuristically #1

@beveradb

Description

@beveradb

This issue is a follow-on from this short thread in a related side-project; nomadkaraoke/python-audio-separator#8 (comment)

Problem: lyrics-transcriber currently transcribes the given audio file using whisper-timestamped and writes the detected words to a lyrics file directly with no cleanup or modification.

This results in very variable accuracy for the lyrics output, as Whisper is far from perfect at correctly detecting lyrics from music audio.
For an example, compare these two synced lyrics videos:

Fortunately, for the majority of songs, as long as we know the artist and title, we can download lyrics from the internet and hopefully use this to correct the detected lyrics from Whisper.

I've already implemented the fetch of lyrics from both genius and spotify.

This issue is to track the implementation of the hard part - using those lyrics to correct the detected lyrics.

Before discussing ways to approach this, it's worth being aware of the biggest limitations first:

1 - Lyrics from the internet are often wrong in various ways
Common examples include:

  • Missing repetitions of chorus/refrain or bridge sections of songs
  • Missing intro or outro sections
  • Wrong/incorrect words, e.g. where a person typing up the lyrics has misheard
  • Wrong/incorrect words, e.g. where the "official" lyrics don't match what ended up actually being sung by the artist in the commercial recording

2 - Whisper-timestamped transcriptions are almost always wrong in various places

  • It will almost always have some words which are wrong, depending on the singers style, accent, background music, recording quality, etc. This is especially likely when the lyrics include names or less common words, and are sometimes hilarious to read, e.g. mishearing "Whitehall" as "Phytol" in one song I recently created a karaoke version of 😄
  • While it usually gets the timestamps of words correctly (even if the word itself is wrong), there are still some issues with this which may need to be solved in the whisper-timestamped project itself, e.g. it commonly gets the timestamp of the very first word wrong, and occasionally starts sentences too soon.
  • Fortunately, it at least provides a confidence score for each detected word, which we can hopefully use to improve the transcription by replacing low confidence words with more-likely words from the internet lyrics

So, given these challenges, I'm holding out hope for the following approach (roughly):

  • Take the internet lyrics and split those up into lines (both genius and spotify if both were successfully fetched)
  • For each line returned from the whisper transcription, find a couple of "anchor words" which have a high confidence score
  • Attempt to match up the line with a lyrics line from the internet lyrics using these "anchor words"
  • Attempt to replace the low confidence (less than 50%?) words with words from the matched internet lyrics line, potentially replacing the entire line if there are multiple low confidence words in the line or if the number of words doesn't match up

This is a super rough set of thoughts though, and I'm sure the reality of this approach will become apparent when attempting to implement ;)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions