Have you ever experience getting in the wrong end of a one-way road while navigating with Google Maps? Well, at least I have – more than once perhaps. Many of you will also remember a major gaffe by Apple in 2012 when Apple Maps placed an Australian city of Mildrua inside a national park about 70 km from where it actually was. This resulted in Australian police rescuing several individuals. They warned the public to be careful about Apple Maps and called the issue “potentially life threatening”.
Such situations are not uncommon, they do happen. In military context also, surveillance data can throw up wrong warnings, false positives as well as false negatives can have serious ramifications, especially in a high tempo op of connected battlefield. In businesses, there are high stakes involved. Today, data and analytics underpins a competitive advantage of a company. More and more companies are embracing digital transformation and relying on data and analytics for decision making than gut feeling of experienced executives. Such high stakes decision needs that we need to trust the data and the algorithm used to carry out analytics. The algorithms may be a simple statistical regression to predict linear progression of production parameter, demand of ration or spares in a military unit, or, a complex deep neural network embedded in a surveillance device trying to classify movement in thick foliage as animal, own troops or enemy troops!
In case of simple algorithms, we may easily plot the data and visually verify the trend, but in case of algorithms employing neural nets, plotting a graph may not be possible especially when the dimensions of datasets are more than three! How do we trust our algorithms in such a scenario?
Even before talking about a complex deep learning lets talk about scenarios where a simple regression is going to provide you with incorrect results. Let’s ask ourselves what are the steps in an analytics project – shall we call it Data Analytics Life Cycle? There are six steps broadly defined as part of the lifecycle: –
- Framing business objective
- Data preparation.
- Descriptive analytics.
- Predictive analytics.
- Model validation.
- Model implementation.
First step of course is to objectively define the business problem, scope and approach. At this step we can ask for answers which data cannot provide. Situations where the analytics is not within the scope of the data. To prevent is one should carefully examine the ‘causality’ of what we want to predict/ prescribe actually depends upon the available data. This is where a subject matter expert should guide the data scientist to formulate the right problem. Interestingly, formulation of problem is more important and difficult than finding the answer! This is an art and not a science. Well, those who are good at bridging this gap and formulate good problems can be called data artists? Just a thought?
Second step involves preparation of data. Fetching the data, cleaning the data and validating. Most of the problems are found here – missing values, misrepresented values, incorrectly formatted fields – you name it, the problem is here. What major issues have you come across at this stage?
Third step is to carry out simple descriptive analysis where one can see simple descriptive statistics to better understand range and central tendencies. It is always a good idea to plot datasets. Intelligent analytics at this stage can highlight issues with data, though not all issues can be corrected, but most can be found out.
In case of geospatial data, there is one more complexity – because of heterogeneity of geospatial data, their primary statistical analysis may not be very useful and will not be able to provide a correct understanding of data. Different plotting and statistical techniques are required to understand it meaningfully.
If your data is still good enough for analytics you can go to fourth step.
Fourth stage of prescriptive analytics is when a data scientist actually gets to do what he learnt in the online courses, not before this! This is when he applies all the fancy ‘algorithm’ and come up with some results. This is when GPUs are thrown in against the data and in my opinion most of the GPU of the world are busy distinguishing between a cat and a dog! Which doesn’t always work as planned. Several cats are labelled as dogs and several cute dogs are labelled as cat. The situation is funny as long as we are talking about cats and dogs, but, what if we are talking about distinguishing between own ship and enemy ship by a smart bomb?
This is where the most complex algorithms post the most important questions. The issue of Explainability also creeps in as more complex algorithms act more like a black box. Their internal working is more of a mystery. It can be abstracted with algebraic expressions but difficult to intuitively understand. New techniques are being built which allow this explanation to happen. Techniques like Sensitivity Analysis is able to point out the most important features whose variations produce maximum variation in the output.
Fifth stage of model validation is a stage which brings in trust to the whole process. Ironically this step is not elaborated in many cases.
Final step of model implementation is primarily and engineering effort, but some unforeseen circumstance can crop up and make the model behave in an unexpected manner. Adequate amount of testing is necessary to make a smooth deployment.
While steps can be taken throughout the analytics process to instil confidence and trust in a human, following important points can be kept in mind:-
- Question everything – question the data, source of data, method of collection. Start questioning from the first steps of data collection, preparation for data collection, data collection strategy etc.
- Follow the trail of dat. How data was captured, who all worked with the data, what transformations were carried out?
- Follow the analytic process and always benchmark it against alternatives.
- Finally document everything for allowing another pair of eyes to review the analytics and possibly audit the process.
Analytics may become fancy and intelligent, but human diligence will always be needed to ensure quality and trust in the analytics.