ML@GT Blog

Learning to Cluster

“Can machines categorize new things by learning how to group similar things together?”

The following describes work by Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira, which will be presented at the 2018 International Conference on Learning Representations (ICLR) in Vancouver. Read the paper here.

Clustering is the task of partitioning data into groups, so that objects in a cluster are similar in some way. It sounds easy, and we human usually do it effortlessly. However, it can be an ambiguous task. Let’s consider grouping the following four test images into two clusters:


Was it obvious how to do it? Here is a hint: There are four possible criteria: color, pose, species, and size, and the general task of grouping these items does not specify which to use.

The multiple ways to do this probably confused you. Now try to imagine that you are in an exam and you have the following example clustering given to you first. What criteria would you choose for the test images now?


The test becomes much easier when context is given and you can practice, allowing you to learn the criteria from the examples and then apply the rules to group new images. The ICLR 2018  paper “Learning to cluster in order to transfer across domains and tasks” (L2C) describes how to mimic this process using neural networks.

Why does this method of learning to cluster matter? It is a way to use prior knowledge to discover new things. We rely on this ability to abstract things that are seen, in order to reason about new things without being overwhelmed by the innumerable details of the physical world. Therefore, a similar capability to “learn how to cluster” would be useful when building an artificial intelligence (A.I.) system as it would allow it to reason about unseen things (moving away from being able to reason about only things that it was already trained on).

Let’s look again at the steps necessary to solve the exam question:

  1. Learn the concepts necessary to group things from example clustering.
  2. Transfer the concepts to cluster new test images.

The L2C paper adopts the same two steps, all using deep neural networks. In the following sections, we will dive a little bit into the technical parts of the article.

Learning the concepts for categorization

The most common way to do categorization is to formulate it as a supervised classification task. The classifiers are learned from labeled examples and applied to the test images. In such a setting, the categories in examples and tests must be the same although they may look different across different datasets. This is known as transfer learning across domains (i.e., the same set of categories have to be classified in different datasets). However, knowledge in the learned classifiers is bound to specific categories and thus is not suitable for discovering new knowledge. The more general setting is transfer learning across tasks, which considers different sets of categories in different datasets, and is the focus of discussion here.

What kind of knowledge representation can be transferred across tasks? The L2C paper presents an idea that first converts the categorical knowledge “{Image:Label}” to pairwise relationships “{(Image A, Image B): Similarity Label}”, where similarity means whether the labels of Image A and B are the same or not. The pairwise relationship is category-agnostic; therefore it can be naturally transferred to a new dataset of unseen categories and used to group them.

How can we learn the pairwise relationships? One way is to learn a function to predict whether a pair of images belongs to the same category (similar) or not (dissimilar). The function can be a neural network, an SVM, or any other classifier. The learned function contains the knowledge that we want to carry to another dataset. Here is the roadmap for using it:

  1. Learn the similarity prediction function with examples.
  2. Get the predictions by applying the similarity prediction function on test images.
  3. Reconstruct the clusters of test images with the predictions in step 2.

The following diagram summarizes the process. The knowledge is transferred in a counter-clock direction. Here, XA is the labeled examples and XT is the test images. The notation {YA} and {YT} represent the categories in datasets, while the G is the pairwise similarity prediction function.


Reconstructing the clusters

With the pairwise similarity predictions, we now have more clues of how images should be grouped together. However, there are a few potential issues. First, there might be many mistakes in the similarity predictions (since they are transferred across different datasets). How badly could it affect the clustering results? Many clustering algorithms are vulnerable to noise. The L2C paper proposes to use a neural network-based constrained clustering, which shows great robustness against the noisy similarity predictions. The designed network can directly output the probability of how an image should be assigned to a particular cluster. The neural network is trained to make a pair of cluster distributions look similar if the images are predicted to be a similar pair, while making the distributions dissimilar otherwise. A figure from the previous ICLR workshop paper has an illustration of how this clustering network works:


Discovering new categories by grouping things

Once the neural network is trained to minimize the conflicts between the predicted similarity and the predicted cluster distribution, we can feed new images into the network to obtain the final cluster assignment. How well does this method perform for discovering new categories? A figure from the report “Deep Image Category Discovery using a Transferred Similarity Function” has a visualization of the results. In that work, the similarity prediction function is learned from about 900 categories in the ImageNet dataset. The clustering is performed on a randomly held-out 10 categories. The images are put close to each other if their chance of being in the same cluster is high. Look at the pictures of birds, the neural network figured out how to define species! In the ICLR 2018 paper, it demonstrates state of art results across many different datasets, comparing to a number of clustering algorithms. It also shows that the method helps the first problem we discussed (domain shift) where the object categories are the same but look different across datasets, even though it did not directly target that problem.



“Learning to Cluster” is a data-driven clustering strategy. It learns the meta-knowledge necessary for defining categories from data, and then uses this knowledge to discover new ones. In the era where unlabeled data are exponentially increasing, transferring knowledge from labeled data is crucial for data mining and many other applications. We hope that the introduced line of work provides you a new perspective that redefines the classical idea of clustering.

Acknowledgments and Further Reading

This work was supported by the National Science Foundation and National Robotics Initiative (grant # IIS-1426998). Further details and experiments are available at

[Post written by Yen-Chang Hsu and Zsolt Kira]

Monday, April 30, 2018 - 10:51

From Object Interactions to Fine-grained Video Understanding

Video understanding tasks such as action recognition and caption generation are crucial for various real-world applications in surveillance, video retrieval, human behavior understanding, etc. In this work, we present a generic recurrent module to detect relationships and interactions between arbitrary object groups for fine-grained video understanding. Our work is applicable to various open domain video understanding problems. In this work, we validate our method on two video understanding tasks with new challenging datasets: fine-grained action recognition on Kinetics and visually grounded video captioning on ActivityNet Captions.

In the following post, we will first introduce the concept and motivation of the proposed method for human action recognition. Second, we will show how the same concept can be further extended to generate a sentence description of a video. For details of the proposed method, please refer to our paper here.

From object interactions to human action recognition

Recent approaches to video understanding have demonstrated significant improvements over public datasets such as UCF101, HMDB51, Sports1M, THUMOS, ActivityNet, and YouTube8M. They often focus on representing the overall visual scene (coarse-grained) as a sequence of inputs that are combined with temporal pooling methods, e.g. CRF, LSTM, 1D Convolution, attention, and NetVLAD. Given the state-of-the-art methods, it’s relatively easy for machine to predict playing tennis and playing basketball by relying on overall scene representation.

State-of-the-art video understanding methods can easily distinguish two different human activities simply rely on the background scene representations.

However, human actions often involve complex interactions across several objects in the scene. These approaches ignore the fine-grained details of the scene and do not infer interactions between various objects in the video. For example, in the figure below, the two snapshots of video frames share similar background scene representations and the representations of the person, i.e. the difference between skiing and snowboarding is how the person interacts with ski and snowboard.

The difference between human actions is how human interacts with certain objects, instead of overly rely on scene representation. For instance, the two video frames have similar scene representation but their human activities are semantically different.

A question that naturally comes with the example above is that: can this problem be solved if machines can detect the objects that are being interacted with?

The answer is No, since there can be many different possible interactions between human and common objects. For instance, precisely distinguishing the differences between dribbling basketball, dunking basketball, and shooting basketball requires the model to identify how a basketball interacts with the player. Therefore, the goal of this work is not only to detect the objects being interacted with but also identify how they are being interacted with.

We want even more than detecting pairwise object interaction!

Typically, object interaction methods (in image domain) focus on pairwise interactions (left). In this work, we efficiently model the interactions between arbitrary subgroups of objects, in which we detect the inter-object relationships in one group and attentively select the objects with significant relationships (i.e. those that serve to improve action recognition or captioning in the end) (right). We define this interaction between various groups of selected object relationships as higher-order interactions.

We go beyond pairwise interactions to high-order interactions: interaction between groups of objects with inter-relationship.

Why are object interactions and temporal reasoning challenging?

We first define objects to be a certain regions in the scene that might be used to determine visual relationships and interactions. This can be a rigid object, person, or even regions from the background scene.

Unfortunately, we can only have features, not the classes of the objects

To understand the relationships/interactions between potential objects, ideally, we need to first identify what are these objects in the scene. Running the state-of-the-art object detectors will,  however, fail to successfully identify the objects because there exists a cross-domain problem. Furthermore, we are limited by the object classes that were pre-trained in a particular object detection dataset, e.g. 80 classes in MS-COCO. As a result, it is very likely that the detected objects are labeled as most common objects, like person and cars, or the object detector may miss a potential interest object completely just because it was not trained to detect it.

Our objective is to efficiently model the relationships/interactions between arbitrary (groups of) objects in space and integrate with temporal reasoning.

Limited by these constraints, we can only use the feature representations obtained by Region Proposal Network (RPN). Note that we do not track the corresponding object across time since linking objects through time can be computationally expensive and may not be suitable if the video sequence is long.

As a result, we have variable-lengths of object sets residing in a high-dimensional space that span across time. Our objective is to efficiently detect higher-order interactions from these rich yet unordered object representation sets across time.

Recurrent Higher-Order Interaction (HOI)

Toward this end, we propose Recurrent Higher-Order Interaction module to dynamically select K groups of arbitrary objects with detected inter-object relationships via learnable attention mechanism. This attentive selection module uses the overall image context representation, the current set of (projected) objects, and previous object interactions to perform K attentive selections via efficient dot-product operations. The higher-order interaction between groups of selected objects is then modeled via concatenation and the following LSTM cell. Please refer to our paper for further details of the proposed method.

Our proposed Recurrent Higher-Order Interaction module dynamically selects K groups of arbitrary objects with detected inter-object relationships via learnable attention mechanism.

What objects and interactions are detected?

Given the nature of the proposed method in selecting the objects for detecting their interactions, we can qualitatively show what are the objects and their interaction that are detected when predicting human actions.

Qualitative analysis for action recognition on Kinetics: Tobogganing.

In the figure above, The top row indicates the original video frame with selected objects (ROIs). The edge of each bounding box of an object is weighted by their importance in making the correct action recognition. We can visualize the regions that the machine sees by setting the weights to the corresponding regions as the transparent ratio. The brighter the region is, the more important this region is. The 3rd row indicates the weight distribution of objects (30 objects in this example). The value in y-axis indicates the importance of a particular object.

In this figure, we show the proposed method correctly predicting Tobogganing.

Identifying Tobogganing essentially needs three elements: toboggan, snow scene, and a human sitting on top of the toboggan. The three key elements are accurately identified and their interactions are highlighted as we can see from t = 1 to t = 3. Note that the model is able to continue tracking the person and toboggan throughout the whole video, even though they appear extremely small towards the end of the video. We can also notice that our method completely ignore the background scene in the last several video frames as they are not informative since they can be easily confused by other 18 action classes involving snow and ice, e.g. Making snowman, Ski jumping, Skiing cross-country, Snowboarding, etc.

From Object interactions to video captioning

In the second part of the blog, we will discuss how the method proposed for modeling object interactions can be extended for generating a sentence description for a video.

Video captions are composed of multiple visual relationships and interactions. We detect higher-order object interactions and use them as basis for video captioning.

Our motivation is quite straightforward. We argue that a sentence description of a scene (for images and videos) can be decomposed into several relationships components. Therefore, we hypothesize that given a set of detected object relationships and interactions, we can then composed them into a complete sentence description.

Our model efficiently explores and grounds caption generation over interactions between arbitrary subgroups of objects, the members of which are determined by a learned attention mechanism as we shown in recognizing human actions.

Overview of the proposed model for video captioning.

We first attentively models object inter-relationships and discovers the higher-order interactions for a video. The detected higher-order object interactions (fine-grained) and overall image representation (coarse-grained) are then temporally attended as the visual cue for each word generation.

Qualitative analysis for video captioning on ActivityNet Captions: The man is then shown on the water skiing.

The same as we show how the model focus on objects and interactions for action recognition. We can also demonstrate how the model uses the objects and interactions for generating each of the words. In the figure above, timestep t indicates the video timestep. We can see that the proposed method often focuses on the person and the wakeboard, and most importantly it highlight the interaction between the two, i.e. the person steps on the wakeboard. It then progressively generates: The man is then shown on the water skiing.

Distinguishing interactions when common objects are presented

A common problem with the state-of-the-art captioning models is that they often lack the understanding of the relationships and interactions between objects, and this is often the result of the training data bias. For instance, when the model detects both person and a horse. The caption predictions are very likely to be: A man is riding on a horse, regardless whether if this person has different types of interactions with the horse.

We are thus interested in finding out whether if the proposed method has the ability to distinguish different types of interactions when common objects are presented in the scene. In the example figure shown below, each video shares a common object in the scene – horse. We show the verb (interaction) extracted from a complete sentence as captured by our proposed method.

(a) People are riding horses.
(b) A woman is brushing a horse.
(c) People are playing polo on a field.
(d) The man ties up the calf.

Our proposed method is able to distinguish different types of interactions when common objects (horse) are presented.

While all videos involve horses in the scene, our method successfully distinguishes the interactions of the human and the horse by grounding the objects as well as the interactions.

To summarize, We introduce a computationally efficient, fine-grained video understanding approach for discovering higher-order object interactions. Our work on large-scale action recognition and video captioning datasets demonstrate that learning higher-order object relationships provides high accuracy over existing methods at low computation costs. To the best of our knowledge, this is the first work of modeling object interactions on open domain large-scale video datasets.

This post is based on the following paper:

Attend and Interact: Higher-Order Object Interactions for Video Understanding. Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, Hans Peter Graf. CVPR 2018. (PDF)

[Blog post by Chih-Yao Ma, re-posted with permission from]

Tuesday, April 3, 2018 - 00:13