It's most likely for round-trip compatibility with another encoding. There are many Unicode codepoints that simply represent combinations of other codepoints. If you don't care about round-tripping, just normalize everything to NFKC or NFKD (the difference being that accented letters like รก are one codepoint in NFKC and two codepoints for the base letter and combining mark in NFKD).