Tuesday, September 18, 2018

Regex for Numerical Lists

There are scenarios where we extract some text from a pdf or text files and try to identify the type of content. One issue I had faced recently is to correctly identify numbered lists. 
I found some pointers from internet searches. Here is how I have combined all that and created a regex for identifying numerical list items. 

You should be able to port this regex to other codes too. I have used python in this example.


Output


('1.2. sample list', 0, 5, '1.2. ')
('a non sample list', 0, 2, 'a ')
('I. roman sample', 0, 3, 'I. ')
('IV. roman sample 2', 0, 4, 'IV. ')
('1.II) num and roman', 0, 6, '1.II) ')
('a) alpha list1', 0, 3, 'a) ')
('a. alpha list 2', 0, 3, 'a. ')
('20 number > 9', 0, 3, '20 ')
('this is a negative test', 'No Match')
('1.a num and alpha 1', 0, 4, '1.a ')
('1.1.1 multi num 1', 0, 6, '1.1.1 ')
('4a) num alpha 2', 0, 4, '4a) ')
('6Z num alpha 3', 0, 3, '6Z ')
('4 a) num alpha 4', 0, 5, '4 a) ')

As you can see this is not without issues as there are some false positives. I hope you guys can tweak it to your liking.