Python advanced regex extraction example

This post cover advanced techniques for regex extraction from text. Sometimes there are more than one way to do something with regular expressions and this case is good example.

You can find the example data below:

NO_OF_DEVICES :2 NO_OF_ELEMENT :8 Hmi_IP_address :10.201.231.201 DEVICE_TYPE : OBU DEVICE_NAME : HV DEVICE_IP : 10.204.225.148 DEVICE_LOGIN : v2v DEVICE_PASSWD : v2v DEVICE_PORT : 80 DEVOCE_MTP : 1 DEVICE_WLANMAC :”1c:65:9d:a7:f7:79″ DEVICE_TYPE : OBU DEVICE_NAME : RV DEVICE_IP : 10.204.225.140 DEVICE_LOGIN : v2v DEVICE_PASSWD : v2v DEVICE_PORT : 80 DEVOCE_MTP : 1 DEVICE_WLANMAC :”1c:65:9d:a7:f7:b2″ 

The question is how to extract the data on the right side? There are several problems related to this question and data:

  • data is not well structured
  • data has mixed format
  • separator is part from the data

how to deal with situations like this. You can two possible approaches:

Separate extraction:

  • extract small words:
re.findall(r"(?:\s|:)([A-Za-z\d]{1,4})(?:\s)", text)

result:

['2', '8', 'OBU', 'HV', 'v2v', 'v2v', '80', '1', 'OBU', 'RV', 'v2v', 'v2v', '80', '1']
  • extract MAC address and IP
re.compile(r'(?:[0-9a-fA-F]:?){12}')

result:

['1c:65:9d:a7:f7:79', '1c:65:9d:a7:f7:b2']

Extract all at once - more complicated

The other approach is more complicated - if you want to extract all at once but cover an interesting code snippet. How to find start and end of a given result for regular expression with findAll in python. You can check it below:

cc = [(m.start(0), m.end(0)) for m in re.finditer(p, text)]
ccc = re.findall(p, text)

result:

[(215, 232), (384, 401)]
['1c:65:9d:a7:f7:79', '1c:65:9d:a7:f7:b2']

So if we combine thise tehnique with for loop we can iterate trough every second value like:

  • find all occurrences for : - c = [(m.start(0), m.end(0)) for m in re.finditer(r"[:\s]+", text)]
    • list them:

for i, l in enumerate(c):
if i % 2 == 0:
print(text[l[1]:c[i+1][1]])

result:

2 
8 
10.201.231.201 
OBU 
HV 
10.204.225.148 
v2v 
v2v 
80 

This solution could have some exceptions so it's better to analyze data first and then to find the optimal solution.

I'll recommend to you checking: