Python advanced regex extraction example

This post cover advanced techniques for regex extraction from text. Sometimes there are more than one way to do something with regular expressions and this case is good example.

You can find the example data below:

NO_OF_DEVICES :2 NO_OF_ELEMENT :8 Hmi_IP_address :10.201.231.201 DEVICE_TYPE : OBU DEVICE_NAME : HV DEVICE_IP : 10.204.225.148 DEVICE_LOGIN : v2v DEVICE_PASSWD : v2v DEVICE_PORT : 80 DEVOCE_MTP : 1 DEVICE_WLANMAC :”1c:65:9d:a7:f7:79″ DEVICE_TYPE : OBU DEVICE_NAME : RV DEVICE_IP : 10.204.225.140 DEVICE_LOGIN : v2v DEVICE_PASSWD : v2v DEVICE_PORT : 80 DEVOCE_MTP : 1 DEVICE_WLANMAC :”1c:65:9d:a7:f7:b2″

The question is how to extract the data on the right side? There are several problems related to this question and data:

data is not well structured
data has mixed format
separator is part from the data

how to deal with situations like this. You can two possible approaches:

Separate extraction:

extract small words:

re.findall(r"(?:\s|:)([A-Za-z\d]{1,4})(?:\s)", text)

result:

['2', '8', 'OBU', 'HV', 'v2v', 'v2v', '80', '1', 'OBU', 'RV', 'v2v', 'v2v', '80', '1']

extract MAC address and IP

re.compile(r'(?:[0-9a-fA-F]:?){12}')

result:

['1c:65:9d:a7:f7:79', '1c:65:9d:a7:f7:b2']

Extract all at once - more complicated

The other approach is more complicated - if you want to extract all at once but cover an interesting code snippet. How to find start and end of a given result for regular expression with findAll in python. You can check it below:

cc = [(m.start(0), m.end(0)) for m in re.finditer(p, text)]
ccc = re.findall(p, text)

result:

[(215, 232), (384, 401)]
['1c:65:9d:a7:f7:79', '1c:65:9d:a7:f7:b2']

So if we combine thise tehnique with for loop we can iterate trough every second value like:

find all occurrences for : - c = [(m.start(0), m.end(0)) for m in re.finditer(r"[:\s]+", text)]
- list them:

for i, l in enumerate(c):
if i % 2 == 0:
print(text[l[1]:c[i+1][1]])

result:

2 
8 
10.201.231.201 
OBU 
HV 
10.204.225.148 
v2v 
v2v 
80

This solution could have some exceptions so it's better to analyze data first and then to find the optimal solution.

I'll recommend to you checking:

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

Separate extraction:

Extract all at once - more complicated