Generate word embeddings for youtube comments on Kpop idols.
Word embedding projection: embeddings trained on youtube comments on videos about Korean idols.
- Used Youtube API to collect comments on videos about each artist.
- Used FastText pretrained Korean embedding and built vocabs from youtube_comments dataset.
- Generating tensor.tsv and metadata.tsv for Embedding projection as an attempt to see neighbors of query.
query='장원영'
query='카리나'
- Generating all vocab and tensors from FastText's pretrained model will lead to creating files over 10G.
- Therefore, after training on youtube_comment dataset from pretrained model, I extracted only vectors of tokens in my dataset from pretrained model.
query='하니'
- '하니' is a member of girl group New Jeans.
- However, '하니' is one of verb conjugations of verb '하다 (do)' at the same time, and it returned a dissapointing result this time as seen above.
- Pretrained model (, which built new vocabulary from comments dataset) may have more general sense, but failed to catch meaningful similar vectors of '하니'.
- Since the purpose and dataset is very focused on a specific theme, it may be better to train word embedding from scratch, not by loading from pretrained embedding from FastText.