Teaching Machines to Read
Is it possible to have a computer read an unfamiliar passage of text, comprehend, and answer questions from it?
With all the hype around artificial intelligence and computer vision, it’s easy to fall into the trap of thinking that everything can be automated and every problem can be solved. Unfortunately it’s not quite as easy as that - our previous blog posts have discussed why computer vision is so challenging from both a mathematical and philosophical perspective, and why we can’t expect computer vision to be as good as biological vision (yet). But there are real applications of this technology, and it’s clear that computer vision is working in some places.
What sorts of questions should we be asking to figure out if computer vision is a good solution-fit for any technical or business problem?
A lot of machine learning and pattern recognition is based on the idea of being able to distinguish (i.e. detect or recognise) things from other things. For example, you might have video of rugby players on the sports field. Can computer vision distinguish the players from the crowd in the stands, or from the grass on the ground? “Discriminability” is a technical term that indicates how easily we can do this separation and distinguish different things.
Another example might be an app that looks at photos of shoes. Is there enough detail to separate photos of sandals from boots? What about separating photos of Nikes from Adidas? Assessing discriminability doesn’t necessarily require technical knowledge - it is about understanding whether or not there is enough underlying difference in the visual information to expect a computer vision algorithm to produce a reasonable result. We might logically reason that there are significant visual differences between sandals and boots, like the amount of material and the style of shoe, but sneakers made by Nike or Adidas might look similar enough to cause some misidentification or confusion.
An example of a case where there isn’t sufficient discriminability is the recent controversy around the computer vision “gaydar” that claimed to determine the sexual orientation of people based on photos of their face alone. Putting aside the ethical issues and quandaries associated with this research, subsequent replication of the study has shown that there are actually no underlying visual features that can distinguish whether someone is gay or not. In other words, there is no visual discriminability - there’s no data or information basis for an algorithm to make a robust and repeatable decision. There may be other tools or data sources we might be able to use (for example, social media habits) but computer vision is simply the wrong tool for the job.
Computer vision relies on having a good model of knowledge - for example, what does a chair look like, and what makes it visually different from a table? In order to build a model, we have to give the algorithm examples of the data that we want it to analyse. Think of it like reading a picture book to a toddler - we point at the cat in the book and say “this is a cat”, and eventually the toddler understands that the picture corresponds to the word cat. But imagine if we only ever showed that toddler one picture of a grey siamese cat, and then later in life they come across an orange tabby cat - would they be able to also identify it as a “cat”? The problem is that the word “cat” corresponds to a wide range of visual information with a lot of variety between size, colour, shape, breed, and so on. To only show the toddler one example of the cat is to limit their understanding of what a cat is.
The same thing happens with machine learning - if we ask the algorithm to build a model of knowledge from a single instance of data, then that model will be limited in its real world applicability when we show it real data that doesn’t quite match the original data. There is a lot of research effort being applied towards solving this problem (you might hear of terms like online learning, semi-supervised learning, or one-shot learning, which refer to algorithms that use very little amounts of data to build a model and then adapt over time). But at least for now, to successfully use a computer vision model you need a lot of training data to help the algorithm build a robust model that emulates the real world.
At a minimum, this means thousands (if not hundreds of thousands) of photos, segmented or labelled with the things you are interested in. If you wanted to use computer vision to monitor fishing stocks in a salmon farm, then you would need some photos segmented by a human to indicate where in the image the salmon are, and then a label that says how many fish there are in that image. When we feed these images into the algorithm, it uses the human-given labels as the “correct answer” and tries to learn so that the internal model represents those correct answers. If you don’t have access to existing images of what you’re looking for, then it will be very difficult for a computer vision algorithm to produce good results. Even if you do have images, you need to be prepared to face up to the cost of getting humans (who in some cases need to be domain experts) to label those images.
A lot of engineering depends on the tolerances around accuracy. If we are building a fruit sorting machine that separates apples and oranges, a couple of mistakes here and there probably won’t make a huge difference and people can pick out the wrong fruit later on. If we are building a surgery robot that cuts out brain tumours, then an error of even one millimetre could kill the patient. So we need to understand - what sort of accuracy rates do we need, and what is the cost or consequence of a mistake?
It is very rare for a computer vision algorithm to be 100% accurate. In fact, there are some researchers that argue it is impossible for a modern deep neural network-based algorithm to be 100% accurate. From a statistical perspective, unless we show the algorithm every single possible data combination when it is building the model, there will be some error in how the model represents the real world (even if it is very small). And if we had training data available for every possible case, we could easily solve the problem without the use of a deep neural network.
In some target applications the accuracy rates can be quite good - optical character recognition for printed characters is pretty close to 99% accurate and considered “solved”. But there are still a lot of challenging areas - determining whether a person is attacking another person in a piece of video footage is under 30% accurate. So before we expend a lot of effort into developing a computer vision algorithm to solve a problem, we should understand what sort of accuracy rate we might expect based on the performance of algorithms in similar or adjacent tasks, and what level of accuracy rate is acceptable for the end user. If these two don’t match, we can save a lot of time and money by moving on and trying to solve the problem in a different way. Developers can optimise their designs and work hard to incrementally increase the accuracy of a computer vision system, but this comes at a significant development cost and managers need to ask the hard question of whether an increase in accuracy of 0.1% is worth the spend.
Apart from designing the computer vision system itself, it’s important to consider how ready the rest of the technical environment is. AI systems consume data, and so ideally there is an automated way to collect and feed in the data. This can be a non-trivial task - we could consider an application where a farmer wants to use a drone to monitor where their cattle are on a daily basis. If the farmer has to manually pilot the drone, pull an SD card out when the drone returns to base, copy-and-paste a file onto the computer, connect to expensive satellite internet, and then upload the video footage to a web portal, then the farmer may opt not to use the system at all. Interoperability of technical systems can be surprisingly difficult, so we need to evaluate how easily the data can be collected and fed into the algorithm.
Additionally, the output of an AI algorithm is rarely directly usable - we might need to combine it with other data or information in order to make a complete decision. For example, we might have a computer vision system monitoring bus stops to figure out how many people are waiting for buses. Normally they send three buses down a particular route at 3pm, but based on the actual number of people waiting, an AI system might recommend that they only send two buses instead and redeploy the third bus somewhere else. The manager likely needs other information to make a decision, like how many staff and buses are currently available, whether there are any special events happening in the area, and what the weather is like outside. Making sure that this information is also upfront and visible to the manager helps them make a good decision, rather than assuming that the computer is correct and following the recommendation blindly.
The last question that we pose is not technical at all - it’s about the people and culture in the environment where the computer vision system might be implemented. You can spend a lot of money developing a complicated technical solution, but at the end of the day the end users have to be willing to use it. This is especially important in situations where computer vision is used as an analytical tool rather than for automated decision-making, and a human has to interpret the results and then decide on a course of action.
For example, IBM has been deploying their Watson artificial intelligence engine to medical applications for the last decade. They have produced some really impressive results, including achieving >90% accuracy at diagnosing certain types of cancers from radiology images. But at some hospitals where pilot trials are being conducted, the doctors ignored the AI system’s recommendations because if they disagreed with the AI, they preferred to trust their own judgement, even though the human doctors had lower accuracy rates than Watson. To make matters worse, patients consistently show a reluctance to trust the results provided from medical AI systems. As a result, some hospitals that were trialling the software have now decided not to use it, even though IBM would argue that it would lead to better clinical outcomes.
Managing expectations, and aligning those expectations as closely as possible to reality, is critical for encouraging adoption. While these algorithms are often hyped up, they are very rarely 100% accurate and the likelihood of there being some error is relatively high. People need to be prepared for this and know how to react appropriately. If people are likely to be replaced by automation, then they need to be given the knowledge and tools to prepare for that, or you might find that they will sabotage the system instead. It’s generally helpful to explain that these AI tools can help people do their jobs more efficiently, augmenting their existing roles.
In this article, we’ve covered a couple of screening questions that you can ask when considering whether a computer vision system might be appropriate for the problem you are trying to solve. They include technical considerations about whether computer vision is capable of solving the problem, requirements engineering considerations about the needed level of accuracy, and broader people and culture considerations about how the system will actually be used by people. These are not the only questions, and there are many more to be considered like computation speed requirements, scalability considerations, and budget limitations, but these five are a good place to start. In the next upcoming article, we’ll talk about some of the computer vision technology platforms that enable developers to build these solutions.
Think Computer Vision could be used to solve your technical or business problems?
Contact us to see how we can help you out!
Is it possible to have a computer read an unfamiliar passage of text, comprehend, and answer questions from it?
Computer vision has come a long way, but why is it so challenging? We dive into this question to see how far computer vision has come and what factors make it so tricky to complete.