Forum Archive

Regex oddity

userista

When trying to use the re module in Pythonista - I get some weird behavior. Specifically the re.sub method doesn't work as documented. Here's my code with sample text. This has been tested in multiple python regex "testers" (e.g. http://regex101.com/r/yP7bA9/1 )

gist here

import re

scores = [[u'Orlando 81   Washington 90 (3:55 IN 4TH)'], [u'Atlanta 59   Cleveland 87 (3:51 IN 3RD)'], [u'Utah 62   Toronto 69 (3:59 IN 3RD)'], [u'Indiana 46   Chicago 42 (0:03 IN 2ND)'], [u'Detroit 50   Memphis 51 (0:18 IN 2ND)'], [u'Minnesota 22   Dallas 28 (0:00 IN 1ST)'], [u'Brooklyn at Portland (10:00 PM ET)'], [u'San Antonio at Sacramento (10:00 PM ET)'], [u'Charlotte at Golden State (10:30 PM ET)'], [u'Phoenix at LA Clippers (10:30 PM ET)']]

for score in scores:
    print score
    print re.sub('([a-zA-Z^ ]+?)(\\d+|at)\\s+?([a-zA-Z^ ]+?)(\\d+)?\\s+?(\\(.+\\))\\s+?', 'whatever replacment', score[0])

and the sample text is (there are extra spaces on the end of some lines) - it's an array of arrays:

Orlando 38   Washington 46 (1:36 IN 2ND) 
Atlanta 25   Cleveland 37 (0:28 IN 1ST) 
Utah 25   Toronto 23 (0:00 IN 1ST) 
Indiana at Chicago (8:00 PM ET) 
Detroit at Memphis (8:00 PM ET) 
Minnesota at Dallas (8:30 PM ET) 
Brooklyn at Portland (10:00 PM ET) 
San Antonio at Sacramento (10:00 PM ET) 
Charlotte at Golden State (10:30 PM ET) 
Phoenix at LA Clippers (10:30 PM ET)

The weird thing is that this seems to work when not in a for loop.....

JonB

Part of the problem is that your regex101 does not match your gist expression... You are missing a few ?'s.

In general, it is easier to use raw strings for your expressions, that is, prefixed by an r, since you can paste the expression directly from other tools without needing to escape them.

Also, personally I find it easier to debug regular expressions first using one of the match or findall methods, building up the expression as I go using implicit string concatenation on multiple lines with comments in each group, E.g

re.findall( ('([a-zA-Z^ ]+?)'  # first team name, letters, spaces and carrots
                 '(\d+?|at)'.   # either score, or word at
                  .....
                 ), score[0])

then you can comment out entire lines to make sure each group works before enabling the next.

Anyway, Here's your code, all I did was copy the expression from regex101, and pasted it as a raw string. Well, I added a findall printout, and showed how you can use your groups in a sub call.
Guessing at what you are doing, I suspect sub might not be what you want... I'm thinking findall might be what your really want, which breaks this up into a table.

for score in scores:
    print re.findall( r'([a-zA-Z^ ]+?)(\d+?|at)\s+([a-zA-Z^ ]+?)(\d+)?\s+?(\(.+\))\s*?',score[0])
    print re.sub( r'([a-zA-Z^ ]+?)(\d+?|at)\s+([a-zA-Z^ ]+?)(\d+)?\s+?(\(.+\))\s*?',r'\1 --- \3',score[0])
userista

@JonB
Yes, thank you!! - the raw string tip really helps - I was getting lost in "escape character hell"

userista

Turns out I was having an issue backreferencing an empty group - see http://bugs.python.org/issue1519638 - so even though re.findall was returning 5 groups - I wasn't able to use re.sub to match/replace all the groups.

EDIT: I ended up using this workaround - adding an empty sub-group
http://bugs.python.org/msg69541