A few weeks ago, I was assigned a data cleansing, standardization and manipulations task for one of our clients, a B2B start-up with a relatively young email program. There were some 16,000 records in the file and they needed all of these normalized and squeaky-clean in order to import it all into their automation software of choice, Pardot.
The client was very specific about how they wanted their data to be but here’s the gist of the rules:
– No initials after first name
– No prefix, i.e. no Mr., Dr. or Mrs. in first name
– Proper capitalization of all fields
– No LLC, Ltd or any other abbreviation in the company name
– No apartment number, suite number, P.O box or any combination of these in Street Address 1
– Any of the above should be in Street Address 2
– No state abbreviation, i.e. Tx , TX or TEXAS should be replaced by Texas
– All the above rules were to be done for 10+ fields
There were many combinations that any of the above could appear in the respected fields. Obviously, it would have been very time consuming to “find and replace” for all of the standardization, one field at a time.
It was a very tedious job any way you sliced it since the client was so specific about the how they wanted their data to be formatted and there was no place for missing something. One fine Thursday, however, I set my mind to it and stayed at the office until 8 p.m. to complete this task of 16,000 records. I decided to create a process that would do this in the most efficient way possible. Since my expertise is SQL (database) programing, I thought I could find a solution for this by creating stored procedures, functions and script in T-SQL.
I created a function to take a string of characters and return the same string in proper capitalized format. A stored procedure was also created to look for any of the client-specified characters that shouldn’t be in that field. If the procedure comes across something on the no-no list above, it can automatically either remove them or truncate the field according to the given specifications starting from the first character in the effective character string. The process I followed was basically importing the data file into the database and then running the script on the specified fields, passing them through the stored procedure.
The scripts, stored procedure and the functions did the above client requested data formatting in one-third of the time it would have taken in Excel or another typical spread sheet setup. It is still somewhat of a manual process to spot-check that nothing is missed or added when it shouldn’t be but, with a little savvy engineering, the time spent on this and the work intensity level has reduced to half. New data still comes in every few weeks, and when it does, it can be quickly formatted just how they like it so the client can apply the cleansed data to dynamic content in their email program and even sales follow-ups in other channels.
It’s all in a day’s work for a BrightWave data engineer.