Wednesday, March 27, 2019

Date Validation and Standardization using python

Last month I had an interesting problem of standardizing dates in an input. 

The dataset I got had a date field filled from different document types. There were valid dates in most of the cases but there were some numbers too which should not be considered as dates. 


The challenge is to first identify whether the given value is date and if so, convert it to a standardized date format. 


I went with the datetime package which offers robust functions for date/time manipulation

The two functions I used are srtptime and strftime 

  • strptime function accepts a string and a pattern and converts the string to datetime if the string matches the pattern provided. If not it throws an exception
  • strftime function formats datetime in the given pattern.

I used a list of know patterns I could gather from the dataset. You could increase the number of patterns as per your requirement.


Code


Output


Is "21-2-2019" a date?: True
Standardized date: 2019-02-21
Is "21" a date?: False

Tuesday, September 18, 2018

Regex for Numerical Lists

There are scenarios where we extract some text from a pdf or text files and try to identify the type of content. One issue I had faced recently is to correctly identify numbered lists. 
I found some pointers from internet searches. Here is how I have combined all that and created a regex for identifying numerical list items. 

You should be able to port this regex to other codes too. I have used python in this example.


Output


('1.2. sample list', 0, 5, '1.2. ')
('a non sample list', 0, 2, 'a ')
('I. roman sample', 0, 3, 'I. ')
('IV. roman sample 2', 0, 4, 'IV. ')
('1.II) num and roman', 0, 6, '1.II) ')
('a) alpha list1', 0, 3, 'a) ')
('a. alpha list 2', 0, 3, 'a. ')
('20 number > 9', 0, 3, '20 ')
('this is a negative test', 'No Match')
('1.a num and alpha 1', 0, 4, '1.a ')
('1.1.1 multi num 1', 0, 6, '1.1.1 ')
('4a) num alpha 2', 0, 4, '4a) ')
('6Z num alpha 3', 0, 3, '6Z ')
('4 a) num alpha 4', 0, 5, '4 a) ')

As you can see this is not without issues as there are some false positives. I hope you guys can tweak it to your liking. 

Monday, April 10, 2017

Using JCL Sort to create a report

Reports are an integral part of the process. In mainframes we use cobol to create reports usually. However SORT is a powerful tool to create a report if you need one pretty quick. 

The below example is for a report creating sort with headers, summary and trailer.

The Input file has Header (HDR) and Detail (DTL) lines. 
The header has success record count on position 29,6 and failure record count on 35,06
The detail record has data with status code and description

First of all we sort the data so that HDR record comes up last and DTL records come up first. This is because the count information is gathered from the header and shown at the end of the report. 

/ is used in the build statements to initiate a new Line. 2/ gives us two new lines
COUNT=(M11,LENGTH=6)gives the total number of records. 


//STEP0001 EXEC PGM=SORT                                                
//*                                                                     
//SORTIN   DD *                                                         
HDR SAMPLERPT04042017120322 000005000002                                
DTL DATA-1 NAME-1 ADDR-1                  0000 SUCCESS                  
DTL DATA-2 NAME-2 ADDR-2                  0000 SUCCESS                  
DTL DATA-3 NAME-3 ADDR-3                  0000 SUCCESS                  
DTL DATA-4 NAME-4 ADDR-4                  0000 SUCCESS                  
DTL DATA-5 NAME-5 ADDR-5                  0000 SUCCESS                  
DTL DATA-6 NAME-6 ADDR-6                  0020 NAME ERROR               
DTL DATA-7 NAME-7 ADDR-7                  0010 ID ERROR                 
/*                                                                      
//*                                                                     
//SORTOUT  DD SYSOUT=*                                                  
//*                                                                     
//SYSOUT   DD SYSOUT=*                                                  
//SYSPRINT DD SYSOUT=*                                                  
//*                                                                     
//SYSIN    DD *                                                         
   SORT FIELDS=(01,02,CH,A)                                              
   OUTFIL REMOVECC,                                                      
          IFTHEN=(WHEN=(01,02,CH,EQ,C'DTL'),                             
                  BUILD=(05,07,X,12,07,X,19,15,X,43,04,X,48,20)),        
          IFTHEN=(WHEN=(01,02,CH,EQ,C'HDR'),                             
                  BUILD=(/,C'TOTAL NUMBER OF SUCCESS RECORDS : ',29,06,  
                         /,C'TOTAL NUMBER OF ERROR RECORDS   : ',35,06,  
                         /,C'TOTAL NUMBER OF DETAIL RECORDS  : ',        
                           (29,06,ZD,ADD,35,06,ZD),EDIT=(TTTTTT))),      
   HEADER2=(1:C'REPORT ID  : XXXXXX',                                    
            26:C' YOUR COMPANY NAME HERE  ',69:C'PAGE: ',75:PAGE,/,      
            1:C'REPORT DATE: ',                                          
            14:DATE=(MD4-),27:C' YOUR REPORT NAME HERE ',                
            2/,                                                          
            C'ID',6X,                                                    
            C'NAME',4X,                                                  
            C'ADDRESS',9X,                                               
            C'ERR ',X,                                                   
            C'ERR-DESC',                                                 
            /,                                                   
            C'-------',X,                                        
            C'-------',X,                                        
            C'---------------',X,                                
            C'----',X,                                           
            C'------------------------------'),                  
   TRAILER1=(C'TOTAL NUMBER OF RECORDS         : ',              
             COUNT=(M11,LENGTH=6),                               
             /,29C'*',C' E N D O F R E P O R T ',28C'*')         

/*                                                              

The Output is as follows

REPORT ID  : XXXXXX       YOUR COMPANY NAME HERE                    PAGE:      1
REPORT DATE: 04-10-2017    YOUR REPORT NAME HERE                                
                                                                                
ID      NAME    ADDRESS         ERR  ERR-DESC                                   
------- ------- --------------- ---- ------------------------------             
DATA-1  NAME-1  ADDR-1          0000 SUCCESS                                    
DATA-2  NAME-2  ADDR-2          0000 SUCCESS                                    
DATA-3  NAME-3  ADDR-3          0000 SUCCESS                                    
DATA-4  NAME-4  ADDR-4          0000 SUCCESS                                    
DATA-5  NAME-5  ADDR-5          0000 SUCCESS                                    
DATA-6  NAME-6  ADDR-6          0020 NAME ERROR                                 
DATA-7  NAME-7  ADDR-7          0010 ID ERROR                                   
                                                                                
TOTAL NUMBER OF SUCCESS RECORDS : 000005                                        
TOTAL NUMBER OF ERROR RECORDS   : 000002                                        
TOTAL NUMBER OF DETAIL RECORDS  : 000007                                        
TOTAL NUMBER OF RECORDS         : 000008                                      
***************************** E N D O F R E P O R T ****************************





Wednesday, December 28, 2016

DB2 Compare numeric values in a char column

Consider the following scenario. You have a column which is defined as CHAR but contains numeric values. Your task is to select the rows which has values greater than an integer constant.
Here is how you can compare text values to integer in DB2

T1 is the table with a CHAR(2) column C1

SELECT * FROM T1;
---------+---------+-
C1                   
---------+---------+-
6
--
                    
 6                   
30                   
1                   
 1                

Do note that the column has null (2nd row) and spaces(3rd row). So we need to accommodate for that. The CAST function we use would work with NULL values, but if there are SPACES, it will throw SQL error -420.
You can achieve the objective with below query

SELECT * FROM T1 WHERE CAST(REPLACE(C1,'  ','00') AS INT) > 5;
---------+---------+---------+---------+---------+---------+--
C1                                                            
---------+---------+---------+---------+---------+---------+--
6                                                             
 6                                                        
30                                                

Note : When using replace function make sure you replace for the whole length of the Char field else you would end up with unexpected results. 

Monday, November 14, 2016

DB2 Aggregate function behavior with Null Values in Table

Aggregate functions are very useful in getting meaningful information from a group. 
Here we are looking at what will happen if the table has NULL values. Aggregate functions will ignore the Null values in the column. 

Consider a Table TEMP1 with two columns
C1 - CHAR(1)
C2 - Integer

The Values in the table are 

C1 C2
A 100
B NULL
A 200
B 100
B 400
C NULL
C NULL

SELECT C1, SUM(C2) SUM_C2,     
           MIN(C2) MIN_C2,     
           MAX(C2) MAX_C2,     
           AVG(C2) AVG_C2,     
           COUNT(*) COUNT_STAR,
           COUNT(C2) COUNT_C2  
  FROM SESSION.TEMP1           
  GROUP BY C1;   

Result will be 

C1       SUM_C2       MIN_C2       MAX_C2       AVG_C2   COUNT_STAR     COUNT_C2
---------+---------+---------+---------+---------+---------+---------+---------+
A           300          100          200          150            2            2
B           500          100          400          250            3            2
C   -----------  -----------  -----------  -----------            2            0


As you can see only Count(*) considers the rows with the Null values. Rest all functions ignore the Null Values.
Hope this helps you guys out there.        
     

Thursday, July 28, 2016

Removing spaces from delimited record

When you create a delimited file in COBOL, you are stuck with the fact that the extra spaces in a field is there to stay. Removing those unwanted spaces and squeezing the data using COBOL is a tricky task, which am not sure if it will even be possible. 
This brings me to achieving this result using SORT. 

Below example shows a semicolon delimited file created by a COBOL reporting program. 
We have the first step adding Quotes to fields which is supposed to have spaces in them like name, address etc. Do make sure that when you create your COBOL layout, have an additional leading and trailing space for the quotes. Else the field might get truncated on addition of quotes.
The second step uses the quotes to not squeeze the valid spaces in a field.

//*****************************************************************
//STEP0001 EXEC PGM=SORT
//*****************************************************************
//*  
//SORTIN   DD *   
123   ;SAMUEL JACKSON     ; ALABAMA ST  ;23456 
1456  ;PETER SAMUEL       ; DO NOT MAIL ;
2343  ;HARRY POTTER       ; OHIO ST     ;23356 
/*           
//SORTOUT  DD DSN=&&TEMP,            
//            DISP=(,PASS,DELETE),           
//            UNIT=SYSDA,                    
//            SPACE=(CYL,(1,1),RLSE),         
//            DCB=(RECFM=FB,LRECL=50,BLKSIZE=0)       
//SYSIN    DD *       
  OPTION COPY             
  OUTREC FIELDS=(1,7,
                 8,19,JFY=(SHIFT=LEFT,LEAD=C'"',TRAIL=C'"'),27,1,
                 28,13,JFY=(SHIFT=LEFT,LEAD=C'"',TRAIL=C'"'),41,6)
//SYSOUT   DD SYSOUT=* 
//SYSPRINT DD SYSOUT=*
//*         
//******************************************************************
//STEP0002 EXEC PGM=SORT 
//******************************************************************
//*      
//SORTIN   DD DISP=SHR,DSN=&&TEMP       
//SORTOUT  DD SYSOUT=*           
//SYSIN    DD *                    
  OPTION COPY         
  INREC FIELDS=(1,50,SQZ=(SHIFT=LEFT,PAIR=QUOTE))
  OUTREC FINDREP=(IN=C'"',OUT=C'')            
//SYSOUT   DD SYSOUT=*                     
//SYSPRINT DD SYSOUT=*                           
//*          


Output after Step 1
123   ;"SAMUEL JACKSON"   ;"ALABAMA ST" ;23456 
1456  ;"PETER SAMUEL"     ;"DO NOT MAIL";      
2343  ;"HARRY POTTER"     ;"OHIO ST"    ;23356 

Output after Step 2
123;SAMUEL JACKSON;ALABAMA ST;23456   
1456;PETER SAMUEL;DO NOT MAIL;        
2343;HARRY POTTER;OHIO ST;23356       


Note: All this effort to save space will help only if you convert this to a VB file
Use the below Sort step to do that


  OPTION COPY   
  OUTFIL FTOV,VLTRIM=C' ' 

Sunday, February 21, 2016

Get Running total in Db2

Using an aggregate function in SQL is pretty straightforward. However we sometimes need to have other columns in a table along with the corresponding aggregate we are trying to calculate. This is not possible if we use the functions in the traditional way using group by clause. 

Lets take an example. Below is the table BookSales. 

BookID DateOfSale SaleCnt
1001 2016-01-01 33
1001 2016-01-02 10
1001 2016-01-03 16
1002 2016-01-04 22
1001 2016-01-05 12
1001 2016-01-06 20
1002 2016-01-07 10
1001 2016-01-08 23
1002 2016-01-09 12
1002 2016-01-10 50
1001 2016-01-12 14
1001 2016-01-13 24


What I want to find is the running total for each book along with the dates. The result I am expecting is something like this

BookID DateOfSale SaleCnt RunningTot
1001 2016-01-01 33 33
1001 2016-01-02 10 43
1001 2016-01-03 16 59
1001 2016-01-05 12 71
1001 2016-01-06 20 91
1001 2016-01-08 23 114
1001 2016-01-12 14 128
1001 2016-01-13 24 152
1002 2016-01-04 22 22
1002 2016-01-07 10 32
1002 2016-01-09 12 44
1002 2016-01-10 50 94

Query

SELECT A*,
       SUM(SALECNT) OVER(PARTITION BY BOOKID
                         ORDER BY DATEOFSALE) AS RUNNINGTOT
FROM BOOKSALES A;

You could have the same result even if we change the display order using OrderBy clause in the query.

For Example

SELECT A*,
       SUM(SALECNT) OVER(PARTITION BY BOOKID
                         ORDER BY DATEOFSALE) AS RUNNINGTOT
FROM BOOKSALES A
ORDER BY BOOKID, DATEOFSALE DESC;

---------------------------- RESULTS IN ------------------------------

BookID DateOfSale SaleCnt RunningTot
1001 2016-01-13 24 152
1001 2016-01-12 14 128
1001 2016-01-08 23 114
1001 2016-01-06 20 91
1001 2016-01-05 12 71
1001 2016-01-03 16 59
1001 2016-01-02 10 43
1001 2016-01-01 33 33
1002 2016-01-10 50 94
1002 2016-01-09 12 44
1002 2016-01-07 10 32
1002 2016-01-04 22 22