-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Wikipedia articles can contain slashes (/
). Wikipedia accepts them in urls escaped or not, e.g.
https://en.wikipedia.org/wiki/Baltimore%2FWashington_International_Airport
and
https://en.wikipedia.org/wiki/Baltimore/Washington_International_Airport
return the same page, and neither redirects to the other.
The generator attempts to decode urls from OSM tags, and then encodes '%' again when it converts them back into urls.
My guess is that some of the tags that are not urls still have url encoding in them, but determining which are actually url-encoded and which just have %
in them is a little tricky, and the generator doesn't do that.
It looks like some of the resulting urls are encoded twice, thankfully a small number:
$ tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 | grep -F '%' | sort | uniq
https://de.wikipedia.org/wiki/Georg-B%25C3%25BCchner-Platz
https://de.wikipedia.org/wiki/Kontorhaus_am_J%25C3%25B6debrunnen
https://en.wikipedia.org/wiki/Brighton_%2526_Hove_Greyhound_Stadium
https://en.wikipedia.org/wiki/de:Liste_der_Kulturdenkmäler_in_Schwachhausen#0218%252CT003
https://en.wikipedia.org/wiki/McMullen%2527s_Brewery
https://en.wikipedia.org/wiki/P%25C3%25A9cs_TV_Tower
https://en.wikipedia.org/wiki/Sedbergh_People%2527s_Hall
https://en.wikipedia.org/wiki/Sight_%2526_Sound_Theatres
https://es.wikipedia.org/wiki/100%25_Banco
https://es.wikipedia.org/wiki/Ruta_de_los_D%25C3%25B3lmenes
https://FR.wikipedia.org/wiki/Maisons_industrialis%25C3%25A9es_%25C3%25A0_Meudon
https://fr.wikipedia.org/wiki/Salm_(rivi%25C3%25A8re_de_Belgique)
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
https://sv.wikipedia.org/wiki/Kungliga_Tr%25C3%25A4dg%25C3%25A5rden_3
https://sv.wikipedia.org/wiki/Sverigev%25C3%25A4ggen
https://sv.wikipedia.org/wiki/V%25C3%25A4ttern,_Storfors_kommun
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598
Of those, all except the three below are malformed:
https://es.wikipedia.org/wiki/100%25_Banco
https://ka.wikipedia.org/wiki/%25E1%2583%25AD%25E1%2583%2590%25E1%2583%25A3%25E1%2583%25AE%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A6%25E1%2583%2594%25E1%2583%259A%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25AE%25E1%2583%2598%25E1%2583%259A%25E1%2583%2598
https://xmf.wikipedia.org/wiki/%25E1%2583%2592%25E1%2583%25A3%25E1%2583%2593%25E1%2583%2590%25E1%2583%259B%25E1%2583%2590%25E1%2583%25A7%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2598%25E1%2583%25A8_%25E1%2583%25B8%25E1%2583%2590%25E1%2583%259A%25E1%2583%2590%25E1%2583%2598%25E1%2583%2591%25E1%2583%259D%25E1%2583%259C%25E1%2583%2598
Some seem to be arbitrary character data, for example:
https://sv.wikipedia.org/wiki/Kanngjutarm%25C3%25A4starens_hus
with the extra escaped %25
s removed becomes:
https://sv.wikipedia.org/wiki/Kanngjutarm%C3%A4starens_hus
which the browser converts to:
https://sv.wikipedia.org/wiki/Kanngjutarmästarens_hus