Visual-Semantic-Pose Graph Mixture Networks for Human-Object Interaction Detection

Human-Object Interaction (HOI) Detection infers the action predicate on a <subject, predicate, object>$triplet. Whilst contextual information has been found critical in this task, even with the advent of deep learning, researchers still grapple to understand how to best leverage contextual cues for inference. What is the best way to integrate visual, spatial, semantic, and pose information? Many works have used a subset of cues or limited their analysis to single subject-object pair for inference. Few works have studied the disambiguating contribution of subsidiary relations made available via graph networks. In this work, we contribute a two-stream (multi-branched) network that effectively aggregates a series of contextual cues.

In a first study, we propose a dual graph attention network to dynamically aggregate the visual, instance spatial, and semantic cues from primary subject-object relations as well as subsidiary ones to enhance inference. Subsequently, we incorporate human pose features and propose a second network stream that runs a pose-based modular network.

In this work, we first propose a dual graph attention network to dynamically aggregate the visual, instance spatial, and semantic cues from primary subject-object relations as well as subsidiary ones to enhance inference. Subsequently, we incorporate human pose features and propose a second network stream that runs a pose-based modular network. The result is a graph mixture network that processes a wide set of contextual cues effectively. We call our model: Visual-Semantic-Pose Graph Mixture Networks (VSP-GMNs). Our final model outperforms state-of-the-art on the challenging HICO-DET dataset by significant margins of almost 10%, especially in long-tail cases that are harder to interpret. We also achieve a competitive performance on the smaller V-COCO dataset.

Method

Visual-Semantic Graph Attention Networks (VS-GATs)

How subsidiary relations facilitate HOI detection (Insight): On the left, with the features from [human-knife], the model can easily infer “hold” and “lick” predicates for this tuple, while the message (from spatial features) from subsidiary relations [knife-table] inhibits the model from choosing “cut”. On the right, if we just foucs on the features from [human-cake], the model may output similar scores for the “cut” and “light” predicates since they share similar embedding features. However, messages from subsidiary relations [human-knife] and [knife-cake] promote <human,cut,cake>.

Proposed Model: In our first study, we explore the disambiguating power of subsidiary scene relations and intrinsic semantic regularities via a double Graph Attention Network that aggregates visual-spatial and semantic information in parallel. This graph-based attention network structure explicitly enables the model to leverage rich information by integrating and broadcasting information through the attention mechanism. Our work is the first to use dual attention graphs. We call our system: Visual-Semantic Graph Attention Networks (VS-GATs).

Resutls:

Quantitative Results and Comparisons

Experiments show that VS-GAT sets the SOTA results in all three categories on HICO-DET. We achieve gains of +0.98 (5.1%), +0.89 (5.7%), and +0.62 (2.96%). Our model also obtains comparable results of 50.6 mAP on V-COCO.

In Fig. 3, we also visualize the performance distribution of our model across objects for a given interaction. As mentioned in [13], it still holds that interactions that occur with just a single object (e.g. ‘kick ball’ or ‘flip skateboard’) are easier to detect than those predicates that interact with various objects. Compared to [13], the median AP of interaction like ‘cut’ and ‘clean’ shown in Fig. 3 outperform those in [13] by a considerable margin because our model does not only use single relation features but subsidiary ones as well.

Qualitative Results

Fig. 4 shows some <subject,predicate,object> triplets’ detection results on HICO-DET test dataset. From the results, our proposed model is ableto detect various kinds of HOIs such as: single person-single object, multi person-same object, and multi person-multi objects.

Pose-based Modular Network (PMN)

In our second study, we subsequently studied the effects of fine-grained human poses for HOI via relative spatial pose features and absolute pose features.

Relative spatial pose features:

To provide the model with more detailed spatial information, we construct relative spatial pose features which consist of the coordinate offset between each person’s keypoints and the center of object bounding box.

Absolute pose features:

Generally, a person will have different postures when performing different actions. For instance, the human pose when sitting or when standing are very different. Other times, similar postures may occur when a person acts with different objects (riding a horse or a bicycle). These intuitions indicate that a human’s pose intrinsic properties are also useful for HOI detection. So we also use the absolute pose features, which consist of the keypoint coordinates normalized to the center of the human bounding box.

Model:

Furthermore, we propose a Pose-based Modular Network (PMN) which explores the constructed pose features (Fig. 5) and is fully compatible with existing networks for HOI detection. The module consists of one branch that processes the relative spatial pose features of each joint independently and another branch which uses graph convolutions to update the absolute pose features of each joint. We then fuse the processed features followed by an action classifier as depicted in Fig. 6.

Results

Quantitative Results and Comparisons

Results show we surpasses all SOTA metrics on HICO-DET improves the result from VS-GATs by 0.98 mAP (~4.6%), 1.57 mAP (~9.8%), 0.75 mAP (~3.5%) for the Full, Rare and Non-Rare categories respectively. On V-COCO, the 51.8 mAP improves VS-GATs by 2 mAP (~4.0%) and yields competitive result.

Qualitative Results

The visualization results of Fig. 7 compare VSP-GMN (VS-GATs + PMN) with VS-GATs along on the V-COCO test set. For instance, VS-GATs alone may output false positives when multiple persons and objects are close to each other. In the first image, the 2th, 4th, and 6th person from left-to-right perform the action of ”sking” on their neighbors’ skis. Not so with VSP-GMNs.