Skip to content

Some songs have duplicate rows (due to artist aliases?) #5

@colinmorris

Description

@colinmorris

In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in azlyrics_lyrics_l.csv under the artist name "Liz Phair". 13 are in azlyrics_lyrics_p.csv under "Phair, Liz".

There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.

I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.

I did a quick analysis and found 6,513 total rows with duplicate song urls.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions