Case-insensitive regexp matching of national characters in Python

The goal is to have a positive match on strings ‘Šimánek’ and ‘šimánek’ when doing a case-insensitive comparison. Sounds like an easy task, right? It turns out it’s not that easy due to the ‘Š/š’ national characters at the beginning of the strings. A simple:

>>> re.match(u'šimánek', u'Šimánek', re.I)

returns None. Setting the right locale or using the re.L flag doesn’t help either. After a couple of experiments, I found a way how to match these strings:

>>> re.match('šimánek'.decode('utf-8'), 'Šimánek'.decode('utf-8'), re.I | re.U)

<_sre.SRE_Match object at 0x8d82480>

Hope this helps.


Share your thoughts

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s