When Does it Make Sense to Use Computer Vision and AI

· October 5, 2018

We interact with software every day where inputs and outputs are well-defined. And as with gadgets like cars and smartphones, we have come to expect 100% correctness of software. Otherwise you cannot trust that your bank balance is correct, or that the flight reservation you booked last month was actually executed. Correctness is the most basic and minimal expectation we have of software. Correctness being a given, the focus for traditional software development is on other equally important things like time and space-efficiency, reliability, maintainability, scalability, etc.

Heraclitus said that you can never step into the same river twice. It is quite the same with a computer vision system deployed in real-life situations. In contrast to structured and well-defined input that characterizes traditional software systems, the input into computer vision systems comes from sensors that sense light from the real world. The inputs can never be exactly the same twice, even from the same camera. Trivial changes in ambient lighting, or the position and pose of objects being imaged can change the input significantly, to mention only two influencers on the input imagery. Such variables are affectionately known as “nuisance variables” in statistics lingo. Sometimes even the expected output is not clear - e.g. “is that hand drawn character a 7 or a 1?”. One can try to mathematically model all the different variations in input, but reality is messy and complex, and so one cannot help but abstract away certain influencers of input. But there, one runs into a problem highlighted by Einstein: “As far as the laws of mathematics refer to reality, they are not certain, and as far as they are certain, they do not refer to reality”. It is therefore clear that one cannot expect 100% correctness with such an approach.

Of course, modern computer vision has embraced the “machine learning” paradigm: Gather thousands of examples of the expected input-output pairs, feed them to a “machine”, typically a deep neural net, and hope that it can learn the transformation well on training data, and that it will transform novel input into the correct output. This “sort-of” works so long as novel input “generally looks like” training data. If that assumption is violated, things can fall apart. For example, if you train a car or pedestrian detector on images taken in daylight, and run it on night time footage, accuracy will in general drop significantly. There is no 100% correctness possible with this machine learning approach either, because except in highly controlled conditions, there will always be inputs that are very different from those in your training set.

While I am focusing on computer vision systems in this article, much of the discussion also applies to machine learning, and artificial intelligence based systems.

The Promise and Perils of Computer Vision Systems

Computer vision, and for that matter, even AI, remained as academic curiosities for decades, as researchers tickled themselves pink with cute theory and algorithms that worked on a limited set of data. Full disclosure: yours truly has also been guilty of this. Industry at large did not really take computer vision too seriously, although many large companies did, and do, have a staff of motivated computer vision scientists working on tough problems. Until a few years ago, it was clear that other than in niche areas like visual inspection or machine vision, computer vision “in the wild” was a promise yet to be fulfilled. Getting acceptable performance for a given task was, and still is, quite a challenge, needing a lot of art. So there was little hype, and businesses did not invest too heavily in computer vision systems. I hasten to add that pockets of teams in various labs around the world have been pioneers and have thought about industrial applications of computer vision in a systematic way. There have been more than a handful of success stories - my colleagues and I have delivered highly successful vision systems in many cases, while failing on some others. That got me thinking about where and why computer vision systems succeed or fail, and is the main focus of this article.

While previously there was some skepticism about the viability of computer vision systems, it is a different sentiment today: Computer vision applications are viewed with great promise and are even considered easy. To repeat a cliche, deep and convolutional neural networks, coupled with large amounts of training data, have revolutionized classic computer vision tasks such as object detection, classification, and image segmentation. New network architectures besting the state of the art for many standard computer vision tasks, are being churned out regularly with code and models freely distributed on top of software frameworks that are also free ! To be sure, this is a highly positive development for the field. This has started a gold rush - established companies and startups are racing to build computer vision systems for problems that now suddenly appear solvable. However, a trap is waiting: Very few people, outside of the computer vision community, salivating at the prospects of computer vision solving a business problem, actually understand its limits. Their expectation is that such systems will work reliably in every situation, much like they expect traditional software to work, because that is their only reference point. Even those that do acknowledge that computer vision cannot produce correct results in all situations, often do not know when and how to derive value out of them - and there is plenty of value to be derived. The computer vision research community is also partly to blame for implicitly overstating the effectiveness of its algorithms: you would be hard pressed to find a single paper that, while touting its superior performance curves, also talk seriously about its limits any more than in passing. Doing so otherwise would be cause for its rejection by reviewers. So everybody ends up playing the game, and a typical “top conference” paper essentially becomes a sales pitch, rather than a genuine attempt at understanding a problem and a proposed solution deeply.

So Venn Does it Really Make Sense to Use CV?

A quick personal background for where my insights come from: I have been building computer vision systems for industrial applications since about 2004. Starting as a fresh PhD, I cut my teeth at Siemens Corporate Research in the Real Time Vision and Modeling department where I was exposed to a slew of computer vision problems arising in real-world situations at many Siemens business units : Visual scene monitoring, change detection in highly dynamic scenes, person detection, tracking, people count estimation, and many others. Later, at a few companies in the valley, my team and I built several computer vision systems for 2D and 3D maps. We also got the opportunity to work with very noisy GPS and MEMS data from phones for maps related projects, ETA estimation, and activity recognition. Currently I work on problems in video and sensor based surveillance.

From my experience, it makes sense to embark on a computer vision solution to a problem when a confluence of factors holds true, as in the diagram below. I can classify projects I’ve worked on into successes and failures based on whether or not they fell in the intersection of the three factors. In fact, the focus of computer vision scientists, engineers, solution architects, and product managers should be about creatively re-formulating problems, so that they can fall in this intersection.

Feasibility

The set of problems that CV can solve well is expanding, but still relatively small. It is not hard to find examples where the current state of the art is inadequate for a reliable customer facing product. Tasks like gait based person identification, single and multi-person activity recognition in highly dynamic scenes from a single camera, change detection from satellite imagery, etc. remain largely unsolved, except in constrained settings. The precision and recall obtainable from state of the art algorithms, when deployed as-is in real-world settings, are pretty small. So clearly, if you seek a computer vision solution to your problem, the underlying algorithms need to have a minimal level of accuracy to deserve serious consideration. Of course, like I said before, one can stretch the limits of feasibility by trying to formulate problems ingeniously so that they become feasible. Computer vision scientists and engineers working in industry can be a lot more effective if they can simultaneously wear the product management hat. This will help them close the gap between what is theoretically possible, and what is practically useful.

Errors

This factor is a common bane of computer vision systems. You come up with an algorithm for a vision task, you implement it, test it, find good performance in most cases, and your demo is a huge hit with all stakeholders. The great demo turns out to be a big problem in disguise because it unconsciously sets up an expectation that your system is going to work in every situation. Of course, the “demo versus reality” problem plagues even traditional software development. The key difference is that for computer vision systems, perfect accuracy is pretty much impossible, whereas for traditional software systems, given enough time, an application can achieve perfect accuracy in every conceivable situation it is deployed in. Remember how clunky voice over IP was in the early 2000s. Given a decade or so of hardening, it became mainstream. What many pundits get wrong with computer vision systems, is that they assume that given enough time for hardening, these systems will become just as reliable. This is incorrect. We need to take as an axiom that a given computer vision system cannot be 100% accurate - recall the arguments in the first section, that reinforce this point.

The next logical question is if the lack of 100% accuracy automatically invalidates its use in a practical application. The answer is a resounding no. Computer vision systems are characterized by their precision and recall, or alternatively, their Receiver Operating Characteristics (ROC) curve. The curve tells us how effective the underlying algorithm is, and has a “Heisenberg’s uncertainty principle” like feel to it. Taking object detection as an example, one can never simultaneously detect 100% of all the instances of an object of interest in the image, and have no false positives in the output. If you want 100% of the objects detected, you will need to tolerate false positives. And if false positives are too costly, you will need to accept some missed detections. One can operate anywhere along the curve but is forced to accept some combination of missed detections and false positives.

Applications that need perfect accuracy can still be made possible with computer vision by leveraging an old trick: Use the combination of “machine + humans”. Computer vision provides some output, and an efficient cleanup-tool plus some minimal manual labor cleans up the output. One chooses an operating point along the precision/recall curve which optimizes the total cost. I will discuss an example of this approach below. This works pretty well as long as you don’t run into a pathological case where, in order to get perfect accuracy, the user needs to examine every image, or video clip, and verify the output. Especially if the work needed at each image is very small, this essentially negates any efficiencies that automation can provide. I will provide an example of this failure mode below.

Another class of product where computer vision can be used, despite the lack of perfect accuracy, is where imagery data is not exploited at all, even manually. Here, results obtained via computer vision can provide great value if correct. Along with quick and efficient tools that make manual clean-up of false positives cheap, this can create a viable solution. On the other hand, if manual work is out of the question, one chooses an operating point where false positives are zero or close to zero. In this case, computer vision output can be used as is because it will consist only, or almost only, of true positives, and that can be put to use immediately. Entirely new lines of business get opened up in such cases. My previous work has three great examples but which I can’t talk about, unfortunately, due to company confidentiality.

Human Labor and Alternative Solutions

This factor is also a common bane of computer vision systems. Computer vision researchers, engineers, and even product managers, can sometimes get unduly enamored by the coolness factor of a computer vision based solution to a problem, and fail to consider alternative technologies, or even purely manual tools for a task. If a manual tool can enable humans to perform a task swiftly, and cost effectively, it makes little sense to invest in a costly development process for a computer vision based solution, which will not give you perfect accuracy anyway. A manual tool will guarantee 100% accuracy, because it falls in the realm of traditional software development. Aside from manual tools, alternative technologies such as LiDAR, and other depth sensors, can make a problem much easier than for an imagery-only solution, and so should be considered where feasible. Here again, computer vision researchers, having come out of academia, sometimes unnecessarily approach a problem from a research perspective. The question needs to change from “How cool would it be to solve this with imagery alone ?” to “What alternative sensors and/or manual tools can help simplify the computer vision problem and get us to a solution?”. If it is clear that a manual workflow would be prohibitively expensive, and no alternative technology can solve the problem cost-effectively, then it would make sense to seek a computer vision solution, again, provided that the other two criteria, i.e. mistakes being eventually correctable, and the existence of a feasible computer vision algorithm for the task, are satisfied.

A Few Examples

About a decade ago, a couple of colleagues and I built a system for estimating wait times in queues at airports in real-time from overhead cameras. A completely manual solution was too cumbersome and costly. Secondly, the system needed to be accurate to a few minutes - there was no expectation of extremely high accuracy. Looking back now, it is clear that we were operating within at least two of the circles of the Venn diagram. All we needed was a feasible computer vision based algorithm. While today we have deep net based person detectors that perform well, in those days, machine learning based person detectors were not terribly accurate, and too slow for real-time use. And visual trackers were, and still are, an active research topic and far from mature. We got out of this bind by exploiting the fact that cameras were stationary, and leveraging our previous work on change detection in dynamic scenes, and multi-person detection from stationary cameras. Using mathematical modeling to reason about sizes and shapes of blobs output by the change detector, representing collections of people, and using simulations for generating training data, we were able to create a feasible computer vision solution. The solution achieved an error of less than a minute on complex and dynamic queues, crowded and sparse scenes, and with lots of sudden ambient light changes.

There are a few interesting examples from mine and my team’s work on 2D and 3D maps. Computer vision algorithms would detect outdoor scene elements such as natural scene text, building entrances, road-signs, etc. from street-level imagery and others such as building footprints from overhead imagery. Certain map elements such as the one-way-ness of roads, and turn restrictions at intersections need to be 100% correct. Errors in can be costly, and even dangerous: imagine routing someone up the wrong way or asking them to make an illegal left turn. This necessitates manual tools for verifying road attributes. Certain other map elements like speed limits, business names and their locations, building shapes, and entrance locations, are not 100% critical, and at the same time, too laborious to extract completely manually. For such tasks, one can trade off slight losses in recall for huge gains in manual labor. Needless to say computer vision systems targeting the former type of elements struggled to show value, while those tackling the latter were successes, and some of them huge. Some of my work with other types of sensor data fared similarly. Many of these successful computer vision systems predated deep neural networks. With a lot of training data and deep nets, I imagine these projects could have been even more successful.

Outside of mine and my colleagues’ work, a recent example where computer vision provided real value where images were not previously exploited, is a facial recognition system used by British police. A recent headline suggested that the system was deeply flawed, as it had a 92% false positive rate. A closer examination revealed the following figures: The total number of people attending an event where the system was used was about 170,000. The system picked out 2,470 faces as potentially criminal. Of these, 2,297 were false matches, leading to the 92% false positive number. Which means that the remaining, 173, were real criminals. If there were no facial recognition, the police would have needed to manually inspect 170,000 faces, and probably many times more because each face would likely have been seen many times - a gargantuan task. With the facial recognition system, the police needed to scan only 2,470 faces, a 68 x reduction in the manual work needed, making the system quite practical. Further, humans likely would not have been able to spot all the 173 criminals out of the 170,000 faces. To be sure, there is still the possibility that a dangerous criminal was there but went undetected. We have no way of knowing if this happened, but netting 173 criminals is a huge win: Likely, many crimes were averted and even lives saved. And this, with a computer vision system that had, remember, a 92% false positive rate !

Conclusion

Advances in computer vision, on top of ubiquitous imagery and deep learning, are creating exciting opportunities. At the same time, in some rather unique ways previously unseen in the computer software world, the pitfalls are plenty. The key is to be judicious about when to embark on, and when to stay away from, a computer vision solution to a given problem. Many insurmountable problems can suddenly become feasible with an ingenious re-formulation. I hope this article leaves the reader with a better understanding of the opportunities and pitfalls inherent in designing computer vision solutions, and a few ways around the pitfalls.