A little regex help

A little regex help

Hey Code, I'm working on a new Sanskrit declension module that can work across different scripts, and I was having some issues formulating a regex given Lua's weak pattern syntax. I want to check whether a lemma is monosyllabic ending in ī, but to check for an onset cluster in Devanagari encoding, you need the ् character between each consonant in the cluster, and I can't figure out how to check for:

  • zero or more (consonants each followed by ्)
  • followed by a consonant with ी

or approximately '^(?:[Deva consonants]्)*[Deva consonants]ी$'

This would be easier if I could do use a Kleene star after a non-capturing group, but Lua doesn't allow that. Do you have any advice for this? Thanks.

JohnC502:49, 23 November 2015

You could just capture it and then ignore the capture?

CodeCat15:47, 23 November 2015

Annoyingly according to the MW docs, Lua patterns do not allow greedy quantification over capture groups.

JohnC516:28, 23 November 2015

Then I guess you'll have to make do with several matches, with increasing numbers of preceding consonants.

CodeCat16:55, 23 November 2015

I think so too. I believe the maximum onset sze in Sanskrit is 3. It is also an interesting question how many words possess this edge case. I'll look into it. Thanks for the advice; though it's odd there isn't an easy way to do this.

JohnC517:33, 23 November 2015