Gerald Stanley – CEO
The information age is enlightening… well not exactly. For those of us who see the dirty little secrets of data, we know that too often data is not really enlightening at all, but a rather monstrous leviathan that’s just sitting there waiting to devour anyone who sets foot in its labyrinth.
Alright, enough with the middle earth references. We frequently run into situations as an organization where the data that we need to run predictions against is less than desirable and is at times downright evil. Have you ever seen the 20+ ways you can represent a date? Anyone?
In order to run predictions against historical data, there needs to be scrubbing and massaging to find meaning. Each organization is different, but the goal isn’t to have data or even clean data——the goal is to create meaning.
Here are the necessary preparations to run predictions against historical data and create that meaning.
Vision – What do you want to accomplish?
With anything that is going to be great, you must have a vision. Take for example a fine piece of artwork. Monet’s paintings are the results of having a vision, the right tools, and an ample amount of time to create beauty. Building a solid data foundation will echo these same steps. An artist’s vision is to create something that gives the observer meaning in their art. You want to achieve the same with your data. Visualizing the future of your organization through the lens of clean and meaningful data will drive the direction of the organization’s policies and systems.
Foundation – A foundation must be laid to have consistent and complete data.
A CRM is a good example of a foundation that is universal to most profit or non-profit organizations. Many organizations will go to extremes when selecting a foundation. On one hand, the foundation can be so budget-conscious that a spreadsheet or a subpar database will be selected for enterprise-level activities. This can be costly to future success and growth. On the flip side, an organization might select the “Rolls Royce” of systems. The risk here is creating unhealthy dependencies on a third party or specialized employees that will be detrimental in the event that cuts need to be made. An integral employee may leave, taking their knowledge capital with them. Selecting a reasonable foundation is important for both knowledge transfer and costs.
Accessibility – Accessibility is a no brainer when it comes to data but it’s worth discussing to make sure no “t” is left uncrossed and no “i” is left un-dotted.
There are many types of users within an organization. First and foremost, the system that is utilized must have a quality front end. Your average users (sales, administrative, support, etc…) must be able to easily Create, Read, Update, and Delete (CRUD) data. Additionally, the system must be well documented to ensure consistency and knowledge transfer. The data must also be accessible to high-end users such as data scientists, database administrators, and developers. An API is the preferred method for capturing data by high-end users. It allows an organization to extend the foundation to build new and exciting features that both enhance user experience and create efficiencies that are not a part of the foundation.
Communication – Putting policies in place that are easy to understand and not overly burdensome will put the organization on the road to success.
Here are the key points to communicate to your organization: Who, What, When, Where, and Why. Sounds simple huh? If you can concisely answer these questions for your organization it will help to prevent bad data from creeping into your system, ultimately producing bad predictions in the future.
Cleanup – Those regularly using a system are the same users who are maintaining the quality of the data.
The first step to clean data is to create entry points that prevent data from being messy in the first place. If, for instance, you expect a valid email address to be added to a user record, then create an “email” field and not just a generic text field. Second, impress upon your high-end users the need for scripts that find, report, and clean the data. If “email” is your most prized data possession, then running scripts to assert that your email addresses are both formatted correctly and active will keep the data valid and valuable. Reporting on the validity of any data will help an executive to drive policies, marketing, etc.
Data Typing – The endless amounts of clean addresses, emails, sales numbers, zip codes, etc., need to be evaluated and understood as three distinct types.
Referential data exists simply to be reported on for understanding by the end user. For example, a zip code can be categorical or referential. You may want to generate predictions by zip code as a reference, not as an actual value, so that you can point to a spreadsheet and say “This zip code has the most XYZ of any other zip code.”
Categorical data is taking your data and pushing that data into buckets. Categorical data is valuable when you have subjective data OR an overabundance of numerical data. Try using a simple text analyzer that allows you to categorize emails as “Happy”, “Sad”, “Frustrated”, “Mad”, etc. You gain a snapshot into the mood of the text and can tie those values to an end user. Furthermore, you could place millions of salary records into categories of “Upper”, “Middle” or “Lower”. This can easily be accomplished by distributing the data in a histogram and pushing those values into corresponding buckets.
Finally, numerical data must contain numbers and numbers only. If you introduce non-numerical data into a numerical algorithm then it will attempt to interpret your rogue data. This results in invalid predictions and ultimately poor decision making.
Analyzing Tools – Which tools should you use to analyze your data?
There are a plethora of data analysis tools, including scripting languages such as R and Python. Selecting a common analyzing tool is important for a team to be able to share resources and scripts. This also creates consistencies in the event that a third party is removed from your organization or an employee leaves. I have all too often seen situations where an analyst or programmer will use their own home-brew because that is what they are comfortable with. Later it bites the organization in the tail when those resources exit.
Now that you have taken the steps to establish a solid foundation with valid data, it’s time to accomplish those goals of making predictions on your data and finding valuable meaning. The thought of starting this process can be daunting. However, not taking on the leviathan will ultimately lead to a beast far more damaging than the one lurking in your organization today.