We utilize multi-task learning framework to predict meteorological information from image and use it to detect discrepancy between metadata and visual content. Our multi-task learning model provides up to 15% relative improvement compare to traditional CNN networks (ResNet) on this task.
We propose a general neural network configuration that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. We think this two tasks are complementary, and experiments show our model can achieve better performance in both tasks.
We developed a novel framework for multimodal business venue recognition. We first mine a set of visual concept that is relavent to venue recognition from data. We then use these concepts to train our CNN network (BA-CNN) and use it to recognize business venues. Our model acheives 78.5% recognition rate on our test set.
We propose a novel coding framework called Cross-Age Reference Coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC is able to encode the low-level feature of a face image with an age-invariant reference space. To thoroughly evaluate our work, we introduce a new large-scale dataset for face recognition and retrieval across age called Cross-Age Celebrity Dataset (CACD). The dataset contains more than 160,000 images of 2,000 celebrities with age ranging from 16 to 62.[Project]
We address the IBM Challenge - NYC360 by mining multimodal data streams from different social media. I worked on the food recognition part in this project. We use weakly labeled images from Instagram to train a large convolutional nueral networks to recognize different food and find popular restaurants in New York City. We further use setiment analysis to find out the user opinion about the restaurants for real time recommendations. Our work won the ACM Multimedia Grand Challenge Multimodal Award, 2014
In order to organize large-scale face tracks, containing sequences of (detected) consecutive faces in the videos, we propose an efficient method to retrieve human face tracks using bag-of-faces sparse representation. Using the proposed method, a face track is encoded as a single bag-of-faces sparse representation and therefore allowing efficient indexing method to handle large-scale data. To further consider the possible variations in face tracks, we generalize our method to find multiple sparse representations, in an unsupervised manner, to represent a bag of faces and balance the trade-off between performance and retrieval time.[Dataset]
We aim to utilize automatically detected human attributes that contain semantic cues of the face photos to improve content-based face retrieval by constructing semantic codewords for efficient large-scale face retrieval. By leveraging human attributes in a scalable and systematic framework, we propose two orthogonal methods named attribute-enhanced sparse coding and attribute-embedded inverted indexing to improve the face retrieval in the offline and online stages.
We propose a novel way to search for face images according facial attributes and face similarity of the target persons. To better match the face layout in mind, our system allows the user to graphically specify the face positions and sizes on a query canvas, where each attribute or identity is defined as an icon for easier representation.[Demo]
We try to construct semantic codewords for face image using sparse coding with low-level feature (i.e. LBP) and partially available label information.[Demo]