It’s all in the data

It’s all in the data

SHARE

You have postulated that in unique identification, space-time-travel data is the ultimate metric, over other “key features” like name, address, date of birth etc.

Jeff Jonas
Jeff Jonas
IBM Distinguished Engineer
Chief Scientist, IBM Entity Analytics

You have postulated that in unique identification, space-time-travel data is the ultimate metric, over other “key features” like name, address, date of birth etc. Recently, India has undertaken an initiative called Universal ID. This project takes into account three criteria – name of the person, date of birth and place of birth. Is the place of birth in such a case the kind of location data that you have referred to? Incidentally, the project often faces problems like a person being registered as a voter in two different districts or somebody claiming an identity other than his own.
In my opinion, all geographies are different from each other. In the US for example, name and date of birth alone are not good enough for identity systems, especially if the place of birth is a big city. I don’t understand enough about the Indian domain, for example trends in names and distribution of names in India, so it would not be appropriate to say what would work here. But a statistician, with access to a wide set of data from India, would be able to tell you the possibilities of incidents of two people with same name, same date of birth and from the same city. It takes regional knowledge to be able to address this question.

But I can tell you that the more data that you have, the more accuracy you have. Let’s say there is an incident of two people with same name, same date of birth and same place of birth. If there was a piece of data that could reveal that they were in two different places at the same time, one worked in one city for ten years and the other one worked in another city for those same ten years, then this piece of information would reveal that these are probably two different people.

While classifying an image, we essentially look at patterns. We try to put together patterns to get an indication of what the image contains. Are there any statistical techniques that can be used while working on huge amounts of data?
My point of view is that even if you use the biggest computers on earth, the best maps, the best training datasets and you work on it for 100 years, the classification quality would still be substantially below your expectations. The reason is if one stares at just images and not at secondary data, substantial clues can be missed.

The scope of the problem changes if you use secondary data. You could use one-tenth of compute power and one-tenth of sophistication of the algorithm being used today and still get 100-fold increase in classification accuracy.

Of course these numbers are just to give a sense of scale. No matter how substantial the compute power or the complexity of the algorithm, the use of secondary data sets will create a magnitude of difference in the accuracy of feature extraction and classification.

Would you call it a convergence of evidence?
Yes I would. I have blogged about co-location – information co-location. If you put hypothesis, reference data and observation data from sensor ‘a’, and sensor ‘b’, in the same place, a very interesting thing happens. You get a general purpose sense-making capability that is capable of much higher quality prediction. A more technical note about the devil in the details is this: if I’ve observed two things, made an assertion about them and later I get a new record, then I’ve just learnt something new. So I say, now that I’ve learnt this, had I known this in the beginning, are there any earlier assertions that need to be modified? Doing this in real time is not trivial.

In a paper that you co-published at Cato Institute with Jim Harper, the Institute’s director of information policy studies, you have talked about the September 11 terrorist attacks and the whole business of not looking for patterns because that may tantamount to impinging upon the privacy of individuals. What are your views on predictive data mining vis-à-vis terrorism?
In that paper, (https://www.cato.org /pub_display. php?pub_id=6784) that Jim Harper and I co-published at CATO, we discussed that looking for new patterns by only using the training data on terrorism is insufficient as the number of terrorism events are so few.

In banking, where there are 800,000 cases of fraud and 30 million cases of non-fraud, there are large enough training sets that a computer can use to discover normal patterns. When you have real tiny datasets and not enough instances, the kind of novel discovery that gets produced isn’t that useful. So, one of our claims in that paper is that the frequency of terrorist events is insufficient to derive training data.


This theory is in sync with the aspect of processing where if one tries to train a computer to recognise features in a data set and if these are not adequate, one ends up with wrong conclusions. Do you agree?
People have been training based on the data from other images. I think this really must change. As an example, if you know that in a particular province there is only one vehicle that’s 17 meters long and it is a huge hauler, then having that knowledge as part of the pool can help guide feature extraction and classification.

Do you believe that there is a requirement for large amounts of data?
My point of view on that is that an organisation is only as smart as the observation available to them. That’s as smart as it can be.

So there’s no point in getting massive amounts of irrelevant data?
Actually I have a few principles related to this. One principle is: the smartest your organization can be is a function of the net sum of the observations available to it. With this in mind, the next question is what kind of data (observations) are within policy and law for you to have. Within this universe of data one often has to decide what data is unnecessary, irrelevant, etc.

In a sense you are applying the filter to the data…
Yes, but the filter is policy and situation dependent. When one stands at the side of a road, cars go by. They are of different brands. It’s in a person’s observation space but the person chooses not to act on it. But if the person hears a bang, he turns his head. What he might be looking to see if somebody is pointing a weapon, or smoke is coming out of the back of a car, or somebody is injured. In this scenario, the person has just changed his collection interest. I’ve been thinking deeply about these things, about how to make systems capable of increasing prediction – in traffic, healthcare and life sciences.

Social networking sites are increasingly getting all-pervasive. You have observed that Facebook allows one to look at the linkages of people. On one hand it uses that information to display the advertisements that they think the users would be interested in. On the other hand the site can also use these linkages to know more about users’ private life. Kindly elaborate on it?
When an organisation like a bank thinks about its customers and servicing them right, one of the things they might want to think about is their value – that of the person and his friends. If they can better understand that cluster and value of that cluster, then maybe they can be more competitive. Therefore, some organisations like banks, telcos and insurance companies, who, if it’s legal, and within their policy, would like to use this to offer their customers better services. I met one wireless carrier in Asia who had 80 percent churn a year (customer attrition). Telcos would like to have a better handle on the churn and one way to understand churn is to look at calling groups. Maybe if a third of a cluster cancels service, the rest of the group is at risk. If they get better at this kind of analysis, they can become more competitive.

When I think about social networks, or social circles, my question is, what kind of social circle? My next question is, for what purpose? I spent a lot of time with the privacy advocacy community and the main thing to keep in mind when dreaming up new capabilities is avoiding consumer surprise. Let them know what they are getting into and let them opt into things rather than requiring them to opt out.

The use of mobile phones, ATMs, credit cards etc generates a huge amount of data with geo-location, leaving behind a trail. In such a scenario, how can consumer security be ensured?
Yes the trail is there but fortunately it is isolated. Every time a person has an ATM transaction, he would expect his bank to know about these transactions but would not expect his barber to know. Every time a person calls from his cell phone, he would expect his telecom company to know that but not his bank. The question is, if that data is to flow across non-traditional lines, I think consumers need to be aware of it. Consumers should have an option of being part of that or not. Much of the tension right now is that convergence of data is so valuable that it is going to be, in many ways, irresistible to the consumer. Some organisation is going to say, statistically you will live an extra two months if you let us use your geolocational data about you.

In the geospatial context, position location is very important, and so is time. What else do you see as significant?
In addition to space and time, there is motion. In fact, I think of it as analytical super-food because when it is overlaid on other data, it helps one make sense of things, bringing order to the data in a way that would be much harder without geospatial data.

I don’t know if I have made this point strongly enough. The future is less about how to present a mass of data to humans. The big trend regarding geospatial will be really deep analytics performed without humans. There is going to be this deep computation resulting in insight that might then inform a human being where to look in a map. Geospatial analytics will be used for human attention directing. Geospatial analytics will inform sensors and it will inform cars and tractors and sometimes even people.