The surveillance for security operations is normally performed by humans, which provide visual intelligence. However, this can be dull and dangerous, especially for security operations. In the future machines or robots might be able to automatically recognize “suspicious behavior” or identify and interpret the acts of a potential terrorist.
However, for robots and other machines to become more useful for security applications require early detection of human activities that indicate a possible threat so that security forces can take adequate countermeasures. The machine should also be able to detect people in the scene, track them, and recognize their activities. The research is focused on real time person detection and tracking, motion features, and 2D pose classification algorithms to identify both the likely activity and the possible intent.
Persistent surveillance has been pegged as a crucial capability in current and future military operations. DARPA launched Mind Eye program in 20111 to improve conditions for warfighters on the ground. The agency worked with the U.S. Army, industry and academia to create a way to educate video collection devices. “Although existing cameras and sensors capture activity in an area, the mounds of visual data they collect are overwhelming to analysts and warfighters alike. Once visual intelligence is achieved, these information mountains will become actionable knowledge molehills that can be sent to commanders and perhaps directly to warfighters’ handheld computers in the field.”
Automated recognition of human activities in video streams in real time
Early detection of human activities that indicate a possible threat is needed to protect military bases or other important infrastructure. Currently, human observers are much better than computers in detecting human activities in videos.
However, in many cases human operators have limitations. For example, many cameras often cover an area, and an operator can only watch one of them at a time. Also, fatigue may limit the time in which an operator can effectively perform. In military situations, resources are limited, and a full-time operator may not be available at all. For these reasons, it is desirable that computers assist in such surveillance in the future. But for that to become reality, the computer system must be able to detect people in the scene, track them, and recognize their activities.
Automated recognition of human activities is a true challenge because activities occur in many ways. There are activities that are performed by one person (running), by two people (fighting), one person with an item (pickup), two people with an item (exchange), and one person interacting with the environment (digging). Recognition of a wide range of human activities requires that the system be able to represent all of these elements. To identify the focus of attention, people must be distinguished from other parts of the scene. To capture walking patterns and associate multiple observations over time, for example, people must be tracked. To analyze their activity, their movement and appearance must be described. Then, using all of this information, we must determine what the people are doing.
Netherlands Organization for Applied Scientific Research’s iCub
Sebastiaan van den Broek, and others from Netherlands Organization for Applied Scientific Research (TNO), described in SPIE, about the real-time system developed by them for recognizing a set of human activities in video streams, under DARPA’s Mind Eye program. The system, demonstrated during a field trial organized by the US Army, has successfully detected activities such as: person digging and picking up items, or placing items in a scene. According to researchers they are now exploring methods to recognize scenarios, i.e., compounds or sequences of human activities, for instance, the placement of an improvised explosive device involves a sequence of actions.
“Automatically segmenting and recognizing an activity from videos is a challenging task, mainly because the execution of a similar activity could be performed in many different manners depending on the person or the place”, write Karinne Ramirez-Amaro, Michael Beetz, and Gordon Cheng. For example, if I prepare a pancake in my kitchen, then I may follow a predefined pattern. On the other hand, if I prepare a pancake in my office’s kitchen under time pressure then I will follow another pattern even though I execute the same task. They propose a framework that combines the information from different signals via semantic reasoning to enable robots to segment and recognize human activities by understanding what it sees from videos.
“Another important aspect of our system is its scalability and adaptability toward new activities, which can be learned on-demand. Our system has been fully implemented on a humanoid robot, the iCub to experimentally validate the performance and the robustness of our system during on-line execution of the robot.”
DARPA’s Mind’s Eye Program
DARPA’s Mind’s Eye Program aims to develop a smart camera surveillance system that can autonomously monitor a scene and report back human-readable text descriptions of activities that occur in the video. DARPA is sponsoring research to develop systems that will recognize human activities like walking, touching an object, or taking other actions. The aim of the program is to build better vision systems that could interpret meaningful actions in battlefield such as an enemy troop movement.
Michael C. Burl, Russell L. Knight, and Kimberly K. Furuya of Caltech for NASA’s Jet Propulsion Laboratory, have developed software for detection of carried and dropped objects in surveillance video. An important aspect is whether objects are brought into the scene, exchanged between persons, left behind, picked up, etc.
While some objects can be detected with an object-specific recognizer, many others are not well suited for this type of approach. For example, a carried object may be too small relative to the resolution of the camera to be easily identifiable, or an unusual object, such as an improvised explosive device, may be too rare or unique in its appearance to have a dedicated recognizer. Hence, a generic object detection capability, which can locate objects without a specific model of what to look for, is used. This approach can detect objects even when partially occluded or overlapping with humans in the scene.
Army’s Grant to University of Texas at San Antonio
University of Texas at San Antonio computer science professor Qi Tian won a $399,067 grant from the Department of the Army to make it easier to comb through surveillance videos. “It’s large-scale image retrieval search,” Tian said. “This isn’t just big data, this is very big data.”
UTSA officials used the 2013 Boston Marathon bombing as an example, where investigators scoured through massive amounts of surveillance video to correctly identify the Tsarnaev brothers among the thousands of spectators.
Qi Tian, a computer science professor at UT San Antonio, said he and his team are developing technology that will be able to capture an individual’s face in a crowd.
The solution will then cross-reference large amounts of surveillance videos from other locations throughout a city or a country to find accurate matches. “You can find the bad guys a little quicker,” he said. “Otherwise, you’re sitting and looking at an unimaginable number of surveillance videos, looking for this person.”
UTSA officials say there are also potential business applications for this technology.
“We teach the computer to see, to recognize a world or an object, and yes, a person’s face, which can be especially challenging. That’s the future. One day you might not need your credit card. You can pay with your face.”
Using Deep Learning to Make Video Surveillance Smarter
Camio, which offers an app that lets a smartphone or tablet act as a surveillance camera and works with some individual cameras, already uses machine learning to point out the most significant events captured by a user’s camera that day and to let users search for vehicles and passersby as they come and go.
“Now Camio is expanding its use of artificial neural networks—a machine-learning technique that draws on the way networks of neurons in the brain adapt to new information—to enable users to search their recordings for several trickier-to-identify objects like cats, dogs, bikes, trucks, and packages,” as reported by Rachel Metz.
While some other consumer surveillance cameras like Nest Cam can send users alerts based on motion, sound, and face detection, the use of deep learning could lead to much more nuanced observations of what’s being picked up by a camera lens.
Camio, based in San Mateo, California, determines that something being recorded by a user’s camera is interesting by detecting a significant amount of motion in a scene. Camio cofounder and CEO Carter Maslan says the company currently uses neural networks with each video camera that concurrently vote on what they think the user would consider interesting. The technology is proved right or wrong based on videos the user eventually opens, plays, and deletes. Users can help the system learn by giving clips a thumbs up or down.
Maslan says the heaviest computational work—and therefore the most expensive—involves figuring out exactly what’s happening within the bits of video that are determined to be interesting. Typically, he says, this is about a minute’s worth of footage each day, so Camio is just using neural networks to further analyze that bit of video, rather than slogging through what was shot over the whole day.
However, visual surveillance also throws up concerns about breach of human rights, including the right to privacy and data protection as well as the freedom of movement, expression and association.