Skip to content

Parse etymologies from Wiktionary #1

@skalyan91

Description

@skalyan91

Write a function that does the following:

  1. Take a language and a wordform (you may need to map from ISO code to language name; CLiCS might have a mapping table, otherwise use the one from Glottolog).
  2. Visit the Wiktionary page for the wordform and go to the section for the appropriate language.
  3. Go to the "Etymology" subsection (or if there are multiple etymology subsections, loop over them).
  4. Extract the first sentence, of the form "From <language name> <ancestral form>".
  5. Take the <language name> <ancestral form> pairing, and use those as input for another function call. (I.e. use tail recursion.)
  6. Keep chasing etymologies until you’ve hit a dead end. Along the way, save each etymological link to a table (with the fields "Language", "Wordform", "Source language", and "Source wordform").

Once we have a function that does the above, we can just loop it over all the words in NorthEuraLex.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions