User:DCDuring/regex

regex help?

I was wondering how to limit a pattern match from the search box to a single heading, eg L2 or L3/4. My regex and general programming knowledge is very elementary. Can you point to any examples of good regexes that do what I want or to a kind of regex capability that might work. In the simple cases I've tried, I seem to have run afoul of the fact that the regex search is "greedy".

I am optimistic that I could accomplish what I want using Perl on the xml dump, but that it less convenient for many purposes. DCDuring (talk) 18:12, 6 March 2023 (UTC)

@DCDuring I am very familiar with regexes but unfortunately not so much with the limitations of the Cirrus search box because I don't normally use it. (I typically use regex searching through the contents of either a category or all references to a given template, and if that won't work, I search through the dump.) I can definitely help you with Perl or Python regexes applied to the dump file and might be able to help you with the search box if you give me some more details: What exact pattern were you using, what did you expect to happen and what actually happened? Benwing2 (talk) 08:01, 7 March 2023 (UTC)

I've had some trouble interpreting the various fragments of documentation at mediawiki.

I was trying to show how easy it was to find HTML comments using the search box, using "insource=" and 'filters'. I thought it would be handy to show a search focused on one L2 and one L3/4.

My search line is "Pronunciation incategory:"English nouns" insource:/[=]+Pronunciation[=]+[^<]+\<!--.+--\>/"

The regex pattern is what follows "insource". I tried many variations. This search finds any HTML comment in an entry that is in Cat:English nouns that has a pronunciation section, not limited to a section. DCDuring (talk) 13:51, 7 March 2023 (UTC)

@DCDuring: I tried to come up with a regex to find "bor" only within an Etymology section (insource:"Etymology" incategory:"Greek lemmas" insource:/Etymology *[0-9]* *=+((?![^ -􏿿]=).)*bor/) and it didn't work. I guess the negative lookahead syntax ((?!)) is disabled in insource:// though it exists in PHP regex. Negative lookahead is the only way I know to really restrict the search to be within a section. (All this assumes the headers are not commented out and don't have HTML comments interspersed in them, which is legal MediaWiki syntax but not allowed by the style guide.) [^ -􏿿] matches ASCII control characters U+0000-U+001F, which is only newline (U+000A) and tab (U+0009) in wiki pages because all other ASCII control characters are replaced with a replacement character (�). (\n matches a literal n in insource://.) So I think it's impossible to match text only within a section with CirrusSearch. — Eru·tuon 22:51, 7 March 2023 (UTC)

What I feared, not what I hoped, but it's wonderful to be able to stop wasting time. Thanks. DCDuring (talk) 01:47, 8 March 2023 (UTC)

@DCDuring, Erutuon It seems to me it should be possible without negative lookahead. When I created a Python regex to find 'confer' within an Etymology section, I wrote this:

Etymology( [0-9]+)?==*\n((?!=).*\n){0,20}.*[Cc]onfer\b.*

which uses negative lookahead, but you should be able to rewrite it without the negative lookahead like this:

Etymology( [0-9]+)?==*\n([^=\n].*\n|\n){0,20}.*[Cc]onfer\b.*

That is, I'm searching for 0 through 20 occurrences of a line inside an Etymology section, which consists of either (a) a character that's not an equal sign or newline followed by any number of non-newlines followed by a newline, or (b) just a newline. You don't necessarily need the {0,20}, you can use * if it doesn't choke. You have to figure out how to avoid the use of \n but it seems like you've figured that out. Benwing2 (talk) 07:02, 8 March 2023 (UTC)

Thanks, I hope. I'll try it when I can. DCDuring (talk) 15:14, 8 March 2023 (UTC)