As we were watching the materials from the ECCV 2012 from Firenze we found the talk by PhD candidate Jianxiong Xiao, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, MIT especially interesting to us. Its combination of Computer Vision, Digital Humanities and Computer Science provided a unique and funny lecture on an interesting subject. Of course we decided to add him to our PhD interviews series.
A portrait of Jianxiong Xiao composed of thousands of real photos from SUN database without any modification.
1. Please tell us a little bit about your background. Who is Jianxiong Xiao, when did you become interested in scene and object recognition, data-driven approach, dataset issues, etc? How did you come to Antonio Torralba’s Lab at MIT and what’s it like to be a PhD student there?
I was a normal kid until I had my first computer After that, I realized that I can implement any crazy ideas as software in a computer, and quickly my role transitioned between an intense user, a software engineer, a hacker, a reverse engineer and eventually a computer scientist. I am always fascinated by trying to address difficult and important questions that could have a huge positive impact on mankind. I believe that my talent comes with a responsibility to make our world a better place, not just my own life. I chose a scientific career to study “intelligence”, one of the most complicated questions in my opinion. To make this abstract problem very concrete, I use a computational approach, the goal of which is to build a computer system as smart as man, and at the same time make man even smarter with the assistance of a computer.
In particular, I chose to study “vision”, one of the major pieces of the mystery with about 30 percent of the brain devoted to it. While vision is a skill remarkably trivial for human beings and animals, it has been shown to be remarkably difficult for computers. For example, while we have voice recognition systems that work fairly well (e.g. Apple Siri), there are no vision systems that can perform at a comparable level of accuracy and robustness yet. While the field of “computer vision” has been around for about 50 years, we have made significant progress in the past decade, largely attributed to the huge amount of visual data available at our dispense. For example, face detection, the automatic prediction of a rectangular box around your face is now available in many cameras and smart phones, is the result of a big data approach to computer vision. While detecting face is a much simpler problem than recognizing scenes and objects (because the shape of a face is mostly rigid compared to a scene and an object such as a cat), I believe handling big visual data properly is the key to solving the problem of vision.
With this goal in mind, I came to Antonio Torralba’s Lab at MIT to pursue my PhD degree, because he is one of the major figures promoting big data. After three and a half years, I can safely say that it was one of the greatest matches that I could imagine, because we are both crazy scientists! One special aspect of MIT is that everyone’s life in MIT is very unique and diverse. My life in MIT is full of adventures with my research. Yes, I’ve had some tough moments with a lot of stress, not because of any external pressures, but because the problem we are trying to solve is so challenging and we always have a very high expectation of ourselves, so high that sometimes no one can reach it. However, having smart people working together with a lot of freedom and resources often leads to a fruitful collaborations and great inventions.
2. Your paper “Reconstructing the World’s Museums” co-authored with Yasutaka Furukawa, got the 12th European Conference on Computer Vision (ECCV2012) best student paper award. What is the main novelty proposed in your article? It seems it connects Computer Vision with the Digital Humanities, or at least the Arts. Actually the Google Art Project provided you with access to the Museum features, right? What was the background of this idea, why Museums in particular?
I decided to spend the summer of 2011 in Seattle at Google. I was fortunate enough to work with Dr. Yasutaka Furukawa in the very productive team lead by Prof. Steven Seitz. As the team is a part of Google Maps team, my goal was to think about what is next for better mapping, and to produce the next-generation of maps to make lives easier by leveraging computer vision technology. Soon I realized that photorealistic maps are a useful navigational guide, especially for people seeing the place for the first time. Nowadays, many mapping systems, e.g. Google Maps, Microsoft Bing Maps, Apple Maps, have photorealistic maps for outdoor environments while there is no counterpart for indoors. The closest option is the Google indoor street view image, which does not work well for large indoor environments such as museums.
Indoor environments are very important and likely to be the next big thing for consumer mapping industry, especially for places like Boston that have long winters and most people stay indoors. People often explore museums virtually with either blueprint-style 2D maps that lack photorealistic views of exhibits, or ground-level images, which are immersive but ill-suited for navigation. Therefore, we designed a technique for generating photorealistic 3D aerial maps for indoor scenes, which are useful for both viewing exhibits and guiding navigation in large museums. Since it is not possible to take aerial pictures of indoor scenes, we proposed a 3D reconstruction and visualization system that automatically produces clean, well-regularized, texture-mapped 3D models for large museums from ground-level photographs and 3D laser points. The key component is a new algorithm called “Inverse CSG” for reconstructing a scene in a Constructive Solid Geometry (CSG) representation consisting of volumetric primitives such as cuboids. Our system enables users to easily browse a museum, locate specific pieces of art, fly to a place of interest, view immersive ground-level panoramic views, and zoom out again, all with seamless 3D transitions. Our system is the first to construct indoor aerial maps and the first to provide seamless transition in 3D between aerial maps and ground-level images for indoor scenes. Further, it is really useful as a navigation tool to help map users find what they want easily and quickly.
We applied our system to various large museums, mostly because museums are a great place to demonstrate our idea! Let’s say a map user wants to locate a painting he has in mind. Most paintings are very visual and people remember the painting more visually than by their names. Thus, a photorealistic map would be very helpful to quickly find what they want. Additionally, one of the greatest benefits of an internship in Google is you can access the huge amount of data that they have captured. The Google Art Project had already done an amazing job at capturing a lot of museum data in great quality. At the same time, working with museums also makes my research experience more enjoyable. When writing and debugging the source code for my system, I enjoyed “touring” many museums at the same time. Besides providing an efficient navigation tool, this project also extends the boundary of how we appreciate art. As you know, when we observe a great painting, we look at the painting at different scales, from the whole painting to a specific region. Similarly, a museum as a whole is also a great piece of art, which we can only enjoy a specific region of, at a time. Our bird’s-eye view rendering of the museum provides a holistic way to appreciate a museum, i.e. to have a complete view of the art. When I was invited to give a keynote talk at an art-related workshop, many real artists and art lovers enjoyed this part the most, and loved to have this new way to see museums as a whole, to appreciate the museum art better. In a broader sense, our technology can also be used to preserve a museum digitally and provide a historical document over time for digital humanities.
3. Your talk at VideoLectures.Net has over 700 views since the video got published in mid November (12/11/2012). Do you think the video at VideoLectures will have an additional value in the dissemination of your paper and the promotion of your lab’s work at MIT?
Yes, in this Internet age with multi-media easily accessible, the most effective way to communicate an idea is to record a video. In particular, for scientific research, the best way to understand an idea is often to attend a talk, to hear about it from the creator’s point of view. In the conference, there was an audience of more than one thousand listening to the talk you mention. Despite that, given the size of the computer vision community, there are a lot more people around the world that cannot make it to the conference. A powerful idea usually leads to the next powerful idea. VideoLectures.Net makes the talk accessible to everyone with an Internet connection. Thus, it is definitely valuable to the scientific community! Many colleagues and I greatly appreciate your efforts to contribute to the scientific community in general.
Jianxiong Xiao @ ECCV 2012
4. In addition you have a ranking of 420 citations and an h-index of 9. How would you summarize your research in general? You most cited paper is called “SUN database”. What is it?
I am excited to conduct research in a fast-growing ﬁeld: computer vision, the main goal of which is to build machines that can automatically understand 3D scenes from 2D images. By “scene” we mean a place that a human can act within and navigate, such as an ofﬁce, a street, or a beach. Our visual world is extraordinarily complex, and this makes it difﬁcult for computers to understand scenes.
Big data has led to a major rethinking of visual scene recognition. The subject of my doctoral thesis is “A Big and Rich Data Approach to Scene Understanding: Unifying Recognition and Reconstruction.” I focus on exploiting big visual data to build computer systems that can understand visual scenes, both extracting 3D information and inferring the semantics for a large variety of environments. My work in visual scene understanding brings together insights from machine learning, computer graphics, cognitive psychology, and a great deal of data and computation.
Bottom-up and Top-down
I am best known for constructing the SUN database, a large collection of images that exhaustively spans most scene categories, from “abbey” to “zoo”. The idea is to create an exhaustive list of scene categories by going through an English dictionary and manually selecting all the terms that describe scenes, places, and environments, and to construct a large image collection representing the diversity of scenes that are often encountered in the real world. With this, we can train computers to better recognize scenes and objects. Containing 250,000 objects consistently annotated with polygonal outlines, the SUN database has become a vital resource for computer vision researchers, with more than 100 scientiﬁc publications using it in the past two years. Furthermore, unlike object-centric databases that are often biased towards having a single object in the center of an image, the scene-centric SUN database contains realistic scene pictures with great diversity. Due to its realism, the SUN database has been popular for many practical applications, including robotics, graphics and data mining.
5. You won the Google US/Canada PhD Fellowship in Computer Vision, where Google’s idea is to support 40 computer science students’ graduate studies in Australia, Canada, China, Europe, India, and the United States. How do you feel about this?
Google is a unique company that not only delivers great products, but also has a working plan to advance humanity to the next level. As we all see today, Google is pushing forward the cutting-edge technology to reality, such as the computer vision related projects: the Google autonomous car, image search, and Google glass etc. Google’s global fellowship program was created to support those willing to take on this noble endeavor. Especially, the Google U.S./Canada PhD Student Fellowship Program was created to recognize outstanding graduate students doing exceptional work in computer science, related disciplines, or promising research areas. It is a very prestigious fellowship because of the high value award and also because it is awarded to only one winner from each field per year.
Besides the PhD fellowship and my internship, Google has also been a long time supporter of my research in general. For example, Google provided us streetview images so that we build a first system in academic research to demonstrate that segmentation of street view images into semantic regions (e.g. trees, buildings and roads) is mature enough for real world applications. In the StreetSeg project we show that semantic segmentation for street view images seems to be quite accurate and robust, and ready to use massively at a city scale. This enables many potential applications for street-level scene understanding, such as autonomous driving. In CityRec, we showed that by introducing some recognition into 3D reconstruction, we are able to build an automatic pipeline to reconstruct clean 3D mesh models for buildings robustly, and deploy it massively at city scale using images captured by a camera mounted on a car. Without the data provided from Google, these amazing research ideas won’t be turned into reality so easily.
6. What do you think is the next “killer app” in vision science? Or to put it mildly, what is the next big thing in Computer Vision research in the next few years?
In the past decade, we have witnessed many amazing computer vision applications, such as depth reconstruction and body pose estimation in Microsoft Xbox Kinect and streetview recognition in Google autonomous car, etc. We can expect there will be many more amazing applications to come, if we can make significant progress on core computer vision algorithms. In particular for 3D reconstruction and understanding, I believe we are on the verge of a significant paradigm shift. For the past two decades, the dominant research focus for 3D reconstruction is in obtaining more accurate depth maps or 3D point clouds (e.g. Microsoft Kinect). However, even when a computer has a depth map, a robot still cannot manipulate an object, because there is no semantically meaningful representation of the 3D world. Essentially, 3D reconstruction is not just a pixel-level task. Obtaining a depth map to capture a distance at each pixel is analogous to inventing a digital camera to capture the color value at each pixel. The gap between low-level depth measurements and high-level shape understanding is just as large as the gap between pixel colors and high-level semantic perception. Moving forward, I would like to argue that we need a higher-level intelligence for 3D reconstruction. I would like to draw attention of the 3D reconstruction research community to put greater emphasis on mid-level and high-level 3D understanding, instead of exclusively focus on improving the low-level reconstruction accuracy, as is the current situation. For example, very recently, we proposed a 3D cuboid detector to localize the corners of all boxy objects in an image. I believe that a new era to study 3D reconstruction at a higher level is starting to come.
The other emerging trend is to study the viewpoint problem in 3D understanding. Computer vision is hard not only because the world is 3D, but also because our eyes and cameras see the world from specific viewpoints. Just like watching TV or shows in a theater, we need to recognize the screen side of the TV and the stage of a theater, in order to look at the TV from the right angle or sit inside a theater facing the direction towards the stage. With 3D and viewpoint, in the end, we want to unify recognition and reconstruction, two major branches of computer vision that are mostly pursued independently. This provides insights on developing more realistic representations for scenes.