Show HN: An adaptive Chinese reader. Only helps where you need it

yorwba · on Nov 28, 2022

The mechanic of clicking the pinyin or translation above a word to hide it for every occurrence is great and feels satisfying to use somehow, but I think the NLP quality leaves something to be desired.

I tried to construct an example sentence demonstrating multiple issues: 几百个人中没有一个愿意帮李苧。 https://www.mandopando.com/file/ffa240af-2713-4d75-b1a9-b62e... (I used the Simplified Chinese view; having to explicitly select it when the input was already Simplified Chinese was a bit weird.)

几 jī "small table" should be jǐ "a few"

个人 gě rén "individual" should be 个 gè "measure word" + 人 rén "human"

中 zhōng "China" should be "among"

个 gě "used in 自個兒|自个儿" should be gè "measure word"

李 lǐ "plum" should probably be "surname Li"

苎 zhù "Boehmeria nivea" should be 苧 níng "tangled" (苧 is an edge case I like to use for testing, see https://en.wikipedia.org/wiki/Ambiguities_in_Chinese_charact... for more)

(Not being able to select ruby text was a bit annoying when making this list.)

The translation "used in 自個兒|自个儿" makes me think you're probably using a dictionary based on CC-CEDICT, but the third tone on the 个 in 个人 suggests that it's probably an older version with many errors, so you should be able to improve the quality a bit by using the latest release https://www.mdbg.net/chinese/dictionary?page=cc-cedict

Based on the pinyin for 几 and 个 I suspect you're sorting possible candidates lexicographically and picking the first (ji1 < ji3, ge3 < ge4). If you have a lot of free time you can just make a big list of the best choice for common words, or crib my work here: https://github.com/Tatoeba/sinoparserd/pull/2

But there will always be some instances where your choice of dictionary entry will turn out to be incorrect, so I think it would be nice to have a way to see alternative possible interpretations.