-
Notifications
You must be signed in to change notification settings - Fork 248
Description
Continuation from #637 (comment) by @pannxe:
Look pretty good to me! I English Wiktionary for Chinese language is now totally usable. If there is anything left to improve I think there are as below. Please take a look and see if you agree.
1 — Pronunciation: handle duplication generated from by Wiktionary's zh-pron
If I understand correctly, each phonetic elements are manually entered once and then auto-generated into a template.
This results in the same phonetic element rendered differently and thus having different
tags
in Wiktextrat.jsonl
data. [^1].For example, in the word 不安, the Hakka section contains only one example from Sixian dialect, but is rendered twice.
![]()
Which came from these two different list element (the one that is shown by default and the one that you have to click
more ↓
to seeFor the same reason, Mandarin Pinyin contains many duplication.
Theoretically we can take I look at a generator script (which is open-source) and revert the pattern. The script was written in PHP which I am not familiar with so I'll take a look when I have some free time.
2 — Examples: maybe also put the language/diaclect and writing system in the
<small></small>
too?![]()
![]()
Just to be more consistent with Wiktionary.
3 — A more robust handling of Synonyms section
Chinese synonym section, unlike other languages, can have a table which is currently not handled, thus producing a large concatenation of nonsense. This is an example from the word 「我」(I, me) which I argue, might have the largest number of synonyms.
Synonyms: 吾, 我, 余, 予, 台, 朕 for emperors, 臣, 愚 humble, 我, 本人, 鄙人 humble, 在下 humble, 不才 humble, 我, 我, 我, 我, 我, 咱, 我, 我, 俺, 我, 俺, 我, 俺, 我, 俺, 咱, 我, 我, 我, 咱, 我, 我, 我, 我, 我, 我, 我, 俺, 我, 我, 俺, 我, 俺, 我, 俺, 我, 俺, 我, 我, 俺, 我, 俺, 我, 俺, 我, 俺, 俺, 我, 俺, 我, 俺, 我, 我, 俺, 我, 俺, 我, 我, 俺, 我, 俺, 我, 俺, 我, 我, 我, 俺 †, 我, 我, 我, 我, 俺, 咱, 我, 俺, 我, 俺, 我, 俺, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 頑, 俺, 我, 我, 我, 我, 我, 俺, 呣, 我, 我, 俺, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 𠊎, 我, 我, 我, 我, 我, 我, 我, 我, 我, 儂, 儂, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我仂, 我, 我濃, 阿, 我郎, 阿仂, 我, 阿仂, 我, 阿, 巷, 𠊎, 𠊎, 𠊎, 我, 𠊎, 我, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 𠊎, 我, 𠊎, 𠊎, 𠊎, 我, 我, 阿, 我, 黨, 卬, 我儂, 我, 俺 †, used only by females, 我, 俺, 我, 我, 俺, 我, 俺, 我, 我, 我, 俺, 我, 俺, 我, 俺, 我, 俺, 我, 我, 俺, 我, 俺, 我, 我, 俺, 我, 奴, 我, 我, 阮, 遮人, 遮的人, 我, 我, 阮, 我, 我, 我, 我, 阮, 遮人, 遮的人, 我, 我, 我, 我, 我, 阮 GT, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 我, 儂 polite/humble, 我, 我, 儂 polite/humble, 我, 㑚, 伉, 我, 我, 我, 吾, 我, 阿拉, 我儂, 唔, 唔儂, 我, 我, 奴, 吾, 吾奴, 我, 我, 我奴, 活儂, 活奴, 我, 我, 是我, 我, 是我, 什我, 我, 是我, 我, 我儂, 像我, 像我, 伢, 我俚, 我, 動我, 我儂, 我, 我儂, 阿儂, 阿, 儂, 我, 我, 我, 卬, 我, 我, † - as an attributive, used without 的; GT - General Taiwanese (no specific region identified), 吾儕, 吾等, 我等, 吾曹, 我曹, 我輩, 吾輩, 我們, 我們, 姆們, 我們, 我們, 俺們, 我們, 我們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 俺們, 我們, 我們, 我們, 我們, 我們, 我們, 俺們, 我們, 俺們, 俺們, 我們, 我們, 俺們, 俺這夥, 我們, 我們, 俺們, 我們, 我們, 我們, 俺們, 我們, 俺們, 我的, 我們, 我家, 我們, 俺們, 我們, 我們 literary, 我們, 我們, 俺的, 俺, 我的, 我們, 我們, 我們, 我的, 𠊎的, 我們, 我的, 我們, 我們, 我們, 俺們, 俺幾個, 我們, 卬們, 阿們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們, 我們兒, 我們, 我們, 我哋, 我們, 我們, 我們, 我們, 我們, 我們, 們, 我輩兒, 我們, 我們, 我恁, 我倈, 我們, 我們, 我們, 我家, 我們, 我家, 我哋, 我哋, 我俚, 我哋, 我哋, 我哋, 我啲, 我班人, 我哋, 我哋, 我哋, 我哋, 我哋, 我哋, 我哋, 我哋, 我俚, 偔, 我哋, 我哋, 偔, 偔, 偔, 我喊齊, 我哋, 我哋, 我哋, 我哋, 偔, 我哋, 我哋, 我哋, 我人, 俺, 我哋, 我哋, 我哋, 我個俚, 我等, 我們, 我多, 我俚, 巷俚, 𠊎等人, 𠊎兜人, 𠊎儕, 𠊎兜人, 我咧, 我大家, 我茶家, 我俚, 我咧, 吾兜, 我俚, 我齊家, 𠊎兜, 吾兜, 𠊎兜, 吾哋, 吾哋, 𠊎兜, 我哋, 𠊎等, 𠊎俚, 我根, 阿俚, 𠊎俚, 𠊎兜, 𠊎兜儕, 𠊎儕們, 𠊎儕, 𠊎等人, 𠊎兜, 𠊎大全, 𠊎人, 𠊎多人, 𠊎底, 𠊎班人, 𠊎郎人, 𠊎子人, 𠊎兜, 𠊎等, 𠊎兜人, 𠊎兜儕, 𠊎這兜, 𠊎等, 𠊎等人, 𠊎兜, 𠊎等, 𠊎兜人, 𠊎兜儕, 吾等, 吾這兜, 吾這兜人, 𠊎人, 𠊎哋, 𠊎俚, 我們, 𠊎人, 𠊎大家人, 𠊎哋, 𠊎啲人, 𠊎兜, 𠊎兜, 吾兜, 我儕, 吾儕, 𠊎兜儕, 我人, 我人, 我人, 我大家, 俺家, 刷俺, 我拉, 阿啊, 我些人, 我們, 俺們, 我家, 我家們家, 我們, 我們, 我們, 我們, 俺們, 俺們, 我家, 我們, 我們, 我們, 我們, 俺們, 我們, 俺們, 我們, 我們, 俺們, 我們, 俺們, 我們, 我人, 我夥人, 俺叢人, 我人, 我夥人, 我多人, 我各儂, 儂家各儂, 儂家, 儂家, 儂家各儂, 我各儂, 我儕, 我儕儂, 我儂, 我各儂, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 阮, 我儂, 我儂, 阮, 阮儂, 阮, 我儂, 我夥, 阮儂, 阮, 阮, 阮, 阮儂, 我儂, 我儂, 我儂, 我儂, 我儂, 我輩, 㑚輩, 滾, 我哋, 我哋, 𠊎人, 𠊎仔, 𠊎個人, 我伲 dated, 伲 dated, 我㑚, 我俚, 伲, 我己, 我們, 阿拉, 象拉, 我倈, 我郎, 昂, 我道, 我們, 我屋裡, 我人, 我俺, 我俚, 卬俚, 俺, 吾, 余, 予, 台, 朕 for emperors, 臣, 愚 humble, 本人, 鄙人 humble, 在下 humble, 不才 humble, Beijing, Taiwan, Langfang, Chengde, Ulanhot, Tongliao, Chifeng, 咱, Hulunbuir, Heihe, Qiqihar, Harbin, Jiamusi, Baicheng, Changchun, Tonghua, Shenyang, Jinzhou, Malaysia, Singapore, Olginsky, Tianjin, Tangshan, Cangzhou, Baoding, Shijiazhuang, Lijin, Weifang, Weifang, Changle, Shouguang, Rizhao, Wulian, Jinan, Dalian, Dandong, Yantai, Yantai, Qingdao, Weifang, Changyi, Gaomi, Zhucheng, Anqiu, Linqu, Qingzhou, Luoyang, Lingbao, Jining, Wanrong, Linfen, Shangqiu, Yuanyang, Zhengzhou, Kaifeng, Xinyang, Ankang, Baihe, Xi'an, 俺 †, Baoji, Tongxin, Guyuan, Tianshui, Xining, Yanqi, Xuzhou, Xuzhou, Pizhou, Suining, Xinyi, Fengxian, Suqian, Lianyungang, Donghai, Fuyang, Bengbu, Sokuluk, Yinchuan, Zhongwei, Lanzhou, Dunhuang, Hami, Ürümqi, Chengdu, Chengdu, Nanchong, Nanbu, Dazhou, Hanyuan, Xichang, Zigong, Chongqing, Wuhan, Yichang, Xiangyang, Tianmen, Lhasa, Guiyang, Zunyi, Bijie, Liping, Zhaotong, Dali, Kunming, Mengzi, Guilin, Guilin, Guanyang, Lipu, Pingle, Yangshuo, Liuzhou, Nanning, Nanning, Nanning, Nanning, Binyang, Shanglin, Hechi, Jishou, Changde, 頑, Xiangtan, Ziyang, Hanzhong, Dagudi, Reshuitang, Mae Salong, Mae Sai, Nanjing, Yangzhou, Yangzhou, Baoying, Gaoyou, Yizheng, Taizhou, Taixing, Taizhou, Jingjiang, Zhenjiang, Jurong, Lianyungang, Guanyun, Guannan, 呣, Huai'an, Huai'an, Lianshui, Xuyi, Jinhu, Xinghua, Nantong, Rugao, Rudong, Hai'an, Yancheng, Dongtai, Sheyang, Funing, Jianhu, Xiangshui, Shuyang, Sihong, Anqing, Wuhu, Hefei, Chuzhou, Huanggang, Hong'an, Qinzhou, Guangzhou, Hong Kong, Hong Kong, Macau, Guangzhou, Guangzhou, Guangzhou, Guangzhou, Foshan, Foshan, Foshan, Foshan, Foshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhongshan, Zhuhai, Zhuhai, Zhuhai, Jiangmen, 𠊎, Jiangmen, Taishan, Kaiping, Enping, Heshan, Dongguan, Shenzhen, Shenzhen, Qingyuan, Fogang, Yingde, Yangshan, Lianshan, Lianzhou, Shaoguan, Shaoguan, Renhua, Lechang, Zhaoqing, Sihui, Guangning, Deqing, Huaiji, Fengkai, 儂, Yunfu, Xinxing, Luoding, Yunan, Yangjiang, Maoming, Lianjiang, Wuchuan, Nanning, Wuzhou, Yulin, Hepu, Hepu, Guiping, Mengshan, Guigang, Beiliu, Baise, Bobai, Lingshan, Pubei, Qinzhou, Beihai, Beihai, Beihai, Ningming, Hengzhou, Hezhou, Fangchenggang, Danzhou, Kuala Lumpur, Penang, Singapore, Ho Chi Minh City, Móng Cái, Nanchang, Nanchang, Anyi, 我仂, Lushan, Pengze, Duchang, 我濃, Wuning, Poyang, Yugan, 阿, 我郎, Wannian, Hengfeng, 阿仂, Yanshan, Leping, Yichun, Yifeng, Gao'an, Fengxin, Shanggao, Wanzai, Fengcheng, Xinyu, Fuzhou, Fuzhou, Nancheng, Nanfeng, Yihuang, Lichuan, Chongren, Pingxiang, 巷, Lianhua, Ji'an, Yongfeng, Taihe, Xiajiang, Yongxin, Yingtan, Guixi, Jianning, Jinxian, Jinxi, Le'an, Guangchang, Anfu, Suichuan, Wan'an, Jing'an, Zhangshu, Xingan, Fenyi, Meixian, Xingning, Huizhou, Huizhou, Huizhou, Huiyang, Huidong, Huidong, Dongguan, Longmen, Longmen, Boluo, Shenzhen, Guangzhou, Zhongshan, Zhongshan, Wuhua, Wuhua, Wuhua, Wuhua, Wuhua, Heyuan, Zijin, Zijin, Longchuan, Longchuan, Heping, Lianping, Lianping, Wengyuan, Nanxiong, Xinfeng, Xinfeng, Lianshan, Liannan, Guangzhou, Jiexi, Luhe, Zhao'an, Changting, Shanghang, Longyan, Wuping, Wuping, Wuping, Liancheng, Ninghua, Qingliu, Yudu, Ningdu, Ruijin, Shicheng, Shangyou, Tonggu, Ganzhou, Ganzhou, Dayu, Dingnan, Longnan, Xunwu, Huichang, Chongyi, Xingguo, Miaoli, Pingtung, Hsinchu County, Taichung, Hsinchu County, Yunlin, Yangxi, Yangchun, Xinyi, Xinyi, Gaozhou, Maoming, Huazhou, Lianjiang, Lianjiang, Luchuan, Luchuan, Bobai, Bobai, Bobai, Bobai, Beiliu, Mashan, Sabah, Senai, Sungai Tapang, Batu Kawa, Singkawang, Singapore, Bangkok, Jixi, Shexian, Wuyuan, Fuliang, Dexing, Chun'an, Jiande, 黨, 卬, Jiande, Taiyuan, used only by females, Yangyuan, Datong, Xinzhou, Lüliang, Changzhi, Linhe, Hohhot, Erenhot, Pingshan, Zhangjiakou, Handan, Linzhou, Suide, Jian'ou, Songxi, Zhenghe, Jianyang, Fuzhou, 奴, Fuzhou, Fuqing, Pingtan, Yongtai, Minqing, Gutian, Pingnan, Luoyuan, Fu'an, Ningde, Xiapu, Zherong, Shouning, Zhouning, Fuding, Taishun, Cangnan, Singapore, Sitiawan, Xiamen, 遮人, 遮的人, Xiamen, Quanzhou, Jinjiang, Nan'an, Shishi, Hui'an, Anxi, Yongchun, Dehua, Zhangzhou, Zhangzhou, Zhangzhou, Hua'an, Pinghe, Zhangpu, Yunxiao, Zhao'an, Dongshan, Taipei, 阮 GT, Taipei, New Taipei, New Taipei, New Taipei, Kaohsiung, Kaohsiung, Kaohsiung, Kaohsiung, Yilan, Changhua, Taichung, Taichung, Tainan, Taitung, Hsinchu, Kinmen, Penghu, Penang, Singapore, Manila, Longyan, Zhangping, Datian, Shunchang, Pingnan, Guilin, Chaozhou, Shantou, Shantou, Shantou, Jieyang, Haifeng, Johor Bahru, Singapore, Batam, Pontianak, Wenchang, 儂 polite, humble, Haikou, Singapore, Putian, 㑚, Putian, Putian, Putian, Xianyou, Xianyou, Xianyou, Yong'an, Sanming, Nanping, Shaowu, Guangze, Jiangle, Mingxi, Zhongshan, Nanning, Guilin, Jingning, Guzhang, Yuanling, Shanghai, 阿拉, Shanghai, Shanghai, 唔, 唔儂, Shanghai, Suzhou, Suzhou, 吾奴, Wuxi, Changshu, Nantong, Jiaxing, 我奴, Jiashan, Pinghu, Haining, Tongxiang, Haiyan, 活儂, 活奴, Changzhou, Danyang, Nanjing, Huzhou, 是我, Changxing, Anji, Hangzhou, Hangzhou, Hangzhou, 什我, Hangzhou, Hangzhou, Tonglu, Tonglu, Shaoxing, Zhuji, Shengzhou, Xinchang, Ningbo, 像我, Ningbo, Ningbo, Yuyao, Cixi, Xiangshan, Zhoushan, 伢, Taizhou, Tiantai, Xianju, Sanmen, Linhai, Wenling, 我俚, Wenzhou, Yueqing, Yongjia, Rui'an, Pingyang, Wencheng, Lishui, Qingtian, Jinyun, 動我, Wuyi, Songyang, Yunhe, Jingning, Longquan, Quzhou, Suichang, Jiangshan, Changshan, Kaihua, Longyou, Jinhua, Jinhua, Yiwu, Yongkang, Pujiang, Dongyang, Wuyi, Lanxi, Shangrao, 阿儂, Shangrao, Yushan, Changsha, Loudi, Shuangfeng, Hengyang, Jiangyong, used without 的, GT, 吾儕, 吾等, 我等, 吾曹, 我曹, 我輩, 吾輩, 姆們, 俺這夥, 我的, 我家, 我們 literary, 俺的, Dingbian, 𠊎的, 俺幾個, Masanchin, 卬們, 阿們, Bayanhot, 我們兒, 我哋, 們, 我輩兒, 我恁, 我倈, Hong Kong, Hong Kong, Hong Kong, 我啲, 我班人, 偔, 我喊齊, 我個俚, 我多, 巷俚, 𠊎等人, 𠊎兜人, 𠊎儕, 我咧, 我大家, 我茶家, 吾兜, 我齊家, 𠊎兜, 吾哋, 𠊎等, 𠊎俚, 我根, 阿俚, 𠊎兜儕, 𠊎儕們, 𠊎大全, 𠊎多人, 𠊎底, 𠊎班人, 𠊎郎人, 𠊎子人, 𠊎這兜, 吾這兜, 吾這兜人, 𠊎哋, 𠊎大家人, 𠊎啲人, Sabah, Kuching, Huangshan, Xiuning, Yixian, Qimen, 俺家, 刷俺, 我拉, Jingde, 阿啊, Shitai, 我些人, Pingyao, 我家們家, Taibus, Baotou, Dongsheng, Haibowan, Jian'ou, Zhenghe, Wuyishan, Pucheng, Matsu, Medan, Zhangping, Leizhou, 㑚輩, 滾, Sanming, Shunchang, 我伲 dated, 伲 dated, 我㑚, 伲, 我己, 象拉, 昂, 我道, 我屋裡, Xiangtan, 我俺, 卬俚, 予, 余, 俺, 僕, 仆, 區區, 区区, 台, 吾, 妾, 孤, 我佬, 朕, 爺, 爷, 窩, 窝, 老子
I think this can be done in pre-processing step, basically group them by
tags
. But again, currently I think these are not extracted correctly by Wiktextract itself. If you scroll to the about the last 3/4 of the text, you'll see a list of language/dialect names which should be interpreted astag
notword
.Maybe report this to Wiktextract team? They mentioned that Wiktionary's modules for handling Chinese had changed a lot since they last update.
4 — Add descendants section
This is more like a feature request for me, but the biggest advantage of using Wiktionary over other dictionaries are that they show descendants to another languages.
This is espacially helpful in Chinese dictionary as many Chinese word have descendants in Japanese (which I'm actively study), Korean, Vietnamese, and to a lesser extend, Tai languages (which I am a native of). Other languages that potentially benefit from this addition are, for examples, Latin, Greek, Sanskrit, English, Arabic, etc.
Wiktionary descendent section ususally are just a simple list (including Chinese) as Wiktextract extract them pretty neatly so I think once made, can be used in any languages.
5 — Examples: a more mature ruby text handling
This mainly apply to Japanese. I'd like to ask you opinion on how to render ruby text. I'm thinking of just make another line of text and replace Kanji with their ruby text reading (maybe put them in the
[ ]
so we know which reading belong to which kanji). For example, again for the word「不安(ふあん)」And, as Wiktionary does not specifically write out the romanization system being use for any word I have came across so far, perhabs removeroman
entirely?Now:
- 不(ふ)安(あん)はなかった。 - roman: Fuan wa nakatta. - english: There was no uncertainty on this matter. - english: There will come times when we have to stop and look back, times when we cry, when we laugh… But even in the darkest days when we’re unsure, when we’re confused — slowly but surely, we’ll realize what matters most to us. This is the green season of our lives — our ‟Salad Days”… - 立(た)ち止(ど)まったりふり返(かえ)ったり、泣(な)いたり笑(わら)ったり……不(ふ)安(あん)と混(こん)沌(とん)の中(なか)でもがきながら——少(すこ)しずつ、大(だい)事(じ)なモンが見(み)えてくれ。そんな青(あお)の時(じ)代(だい)——〝SALAD(サラダ) DAYS(デイズ)〞——… - roman: Tachidomattari furikaettari, naitari warattari……Fuan to konton no naka de mogaki nagara——Sukoshizutsu, daiji na mon ga mietekure. Sonna ao no jidai——〝Sarada Deizu〞——…
Proposed change:
- 不安はなかった。 - [ふ][あん]はなかった。 - roman: Fuan wa nakatta. - english: There was no uncertainty on this matter. - 立ち止まったりふり返ったり、泣いたり笑ったり……不安と混沌の中でもがきながら——少しずつ、大事なモンが見えてくれ。そんな青の時代——〝SALAD DAYS〞——… - [た]ち[ど]まったりふり[かえ]ったり、[な]いたり[わら]ったり……[ふ][あん]と[こん][とん]の[なか]でもがきながら——[すこ]しずつ、[だい][じ]なモンが[み]えてくれ。そんな[あお]の[じ][だい]——〝[サラダ] [デイズ]〞 - Tachidomattari furikaettari, naitari warattari……Fuan to konton no naka de mogaki nagara——Sukoshizutsu, daiji na mon ga mietekure. Sonna ao no jidai——〝Sarada Deizu〞——… - English: There will come times when we have to stop and look back, times when we cry, when we laugh… But even in the darkest days when we’re unsure, when we’re confused — slowly but surely, we’ll realize what matters most to us. This is the green season of our lives — our ‟Salad Days”…
I don't know much about real word dictionary, perhabs they have some sort of convention for this?
Another idea is to put different, less noticable color to the ruby text and make them smaller.
Also, I just notice that Japanese example can be unsorted, so maybe sort them so that it always come in
"text"-"kata text"-"romanji"-"english"
order?