OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (2024)

Table of Contents
1 Introduction 2 Related Works 2.1 Action Recognition under Occlusion 2.2 Causal Inference in Computer Vision 3 OccludeNet Dataset 3.1 Dataset Construction 3.2 Diversity and Statistics of Occluded Dataset 3.3 Dataset Characteristics Analysis 4 Causal Action Recognition 4.1 Modeling Occluded Scenes 4.2 Causal Intervention Strategies 4.3 Counterfactual Reasoning and Learning 5 Experimental Results 5.1 Benchmarking and Analysis 5.2 Causal Action Recognition 6 Conclusion Acknowledgments References Appendix A Overview of Appendix Appendix B Construction of OccludeNet-M Appendix C Data Samples C.1 Video Samples C.2 Occluder Samples Appendix D Data Annotation Details Appendix E Additional Results and Visualizations E.1 Benchmarking and Analysis E.2 Class Correlation Profiling E.3 Causal Action Recognition E.4 Plug-and-Play Integration Appendix F Potential Applications of OccludeNet F.1 Security and Surveillance Video F.2 Occlusion in Virtual Scenes F.3 Occlusion Localization F.4 Multi-modal Analysis Appendix G Action Classes under Different Parent Categories Arts and Crafts Athletics – Jumping Athletics – Throwing + Launching Auto Maintenance Ball Sports Body Motions Cleaning Clothes Communication Cooking Dancing Eating + Drinking Electronics Garden + Plants Golf Gymnastics Hair Hands Head + Mouth Heights Interacting with Animals Juggling Makeup Martial Arts Miscellaneous Mobility – Land Mobility – Water Music Paper Personal Hygiene Playing Games Racquet + Bat Sports Snow + Ice Swimming Touching Person Using Tools Water Sports Waxing

Guanyu Zhou1,Wenxuan Liu2,1,∗,Wenxin Huang3,Xuemei Jia4,Xian Zhong1,∗,and Chia-Wen Lin5
1Wuhan University of Technology  2Peking University  3Hubei University
4Wuhan University  5National Tsing Hua University
lwxfight@163.com, zhongx@whut.edu.cn

Abstract

The lack of occlusion data in commonly used action recognition video datasets limits model robustness and impedes sustained performance improvements. We construct OccludeNet, a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos under various natural environments. OccludeNet features dynamic tracking occlusion, static scene occlusion, and multi-view interactive occlusion, addressing existing gaps in data. Our analysis reveals that occlusion impacts action classes differently, with actions involving low scene relevance and partial body visibility experiencing greater accuracy degradation. To overcome the limitations of current occlusion-focused approaches, we propose a structural causal model for occluded scenes and introduce the Causal Action Recognition (CAR) framework, which employs backdoor adjustment and counterfactual reasoning. This framework enhances key actor information, improving model robustness to occlusion. We anticipate that the challenges posed by OccludeNet will stimulate further exploration of causal relations in occlusion scenarios and encourage a reevaluation of class correlations, ultimately promoting sustainable performance improvements. The code and full dataset will be released soon.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (1)

1 Introduction

Action recognition is essential for understanding human behavior[16, 21] and has shown impressive results on closed-set datasets like Kinetics[19], UCF101[48], and HMDB51[23]. However, these datasets primarily focus on ideal conditions, filtering out ambiguous samples, which limits their effectiveness in real-world scenarios. This poses challenges, especially when actors are occluded, leading to degraded representations.

Current occlusion video datasets lack multi-view perspectives and effective motion information.To improve the robust ablity of action recognition under occlusion environment, a preliminary exploration has been conducted ofK-400-O[14]. However, it applies large-scale occlusions to the entire video frame under a single view. This strategy of adding occlusion akin to blind video reconstruction, neglecting the interrelations among scene elements and actions. Diverse contextual elements and class correlations under occlusions have not received the necessary attention.

Based on this, we present OccludeNet, a comprehensive occlusion video dataset that spans 424 action classes and incorporates diverse natural environment and occlusion types (see Fig.1). These occlusion types include dynamic tracking occlusions, static scene occlusions, and interactive occlusions. OccludeNet-D captures object-tracking occlusions involving actors, while OccludeNet-S includes static occlusions where actors are partially occluded by scene elements. OccludeNet-I features single-view interactive occlusions under variable lighting, and OccludeNet-M offers multi-view interactive occlusions.By integrating both synthetic and real-world data, we enhance the dataset’s validity and scalability.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (2)

Existing de-occlusion models typically focus on minimizing the impact of occluded regions. However, the boundaries between occluders, backgrounds, and actors are often ambiguous (see Fig.2). Traditional methods tend to rely on statistical correlations[68], thereby failing to capture the causal relations between occluders, backgrounds, visible actor parts, and predictions[24, 11].

We hypothesize that occlusion features serve as a confounder in the causal relation between actor attributes and model predictions[41]. To address this, we introduce a Causal Action Recognition (CAR) method using a structural causal model that incorporates occlusion elements. Through counterfactual reasoning, our approach redirects the model’s focus to the causal influence of unoccluded actor features. We quantify this causal effect by calculating the difference between predictions under occlusion and counterfactual predictions, interpreting the improvement in correct class probability as a positive treatment effect[41]. This method counteracts the accuracy loss due to occlusion by introducing an additional supervised signal that combines relative entropy loss with cross-entropy, thereby enhancing the model’s robustness to occlusions across various datasets.

In summary, our contributions are threefold:

  • We construct an innovative dataset OccludeNet, designed specifically for occlusion-aware action recognition. It uniquely covers diverse occlusion scenarios, including dynamic tracking occlusions, static scene occlusions, and multi-view interactive occlusions.OccludeNet derives its strenth from its comprehensive coverage of occlusion types, making it highly effective in addressing real-world problems with significant research value and practical applicability.

  • We propose an insightful causal action recognition (CAR) framework that directs models to focus on unoccluded actor parts, enhancing robustness against occlusions.

  • Our study reveals that occlusion strategies centered on actors influence action class recognition to varying degrees, highlighting the necessity of tailored approaches for addressing occlusions.

DatasetTypeClipsClassesOccludersO-DViewDynamicDurationFPSUncroppedInter-class
K-400-O[14]Syn.40,00040050100%Single-ViewGeometric12.0024.00×\times××\times×
UCF-101-O[14]Syn.3,78310150100%Single-ViewGeometric-Raw×\times××\times×
UCF-19-Y-OCC[14]Real57019--Single-View-4.00-5.0023.98-30.00×\times××\times×
OccludeNet-DSyn.233,7694002,7880%-100%Single-ViewTracking0.50-10.156.00-30.00\checkmark\checkmark
OccludeNet-SReal2565Variant0%-100%Single-ViewStatic8.00-10.0025.00-30.00\checkmark\checkmark
OccludeNet-IReal3457Variant0%-100%Single-ViewInteractive5.00-10.0030.00\checkmark\checkmark
OccludeNet-MReal1,24212Variant0%-100%Multi-ViewInteractive5.00-10.0030.00\checkmark\checkmark

2 Related Works

2.1 Action Recognition under Occlusion

Recent developments in action recognition under occlusion primarily focus on improving model robustness. Approaches include training classifiers on independently processed HOG blocks from multiple viewpoints[58] and addressing blurred perspectives in multi-view settings[35]. Shi et al.[47] propose a multi-stream graph convolutional network to handle diverse occlusion scenarios. Chen et al.[6] demonstrate that pre-training on occluded skeleton sequences, followed by k-means clustering of embeddings, is effective for self-supervised skeleton-based action recognition. Liu et al.[36] introduce a pseudo-occlusion strategy to address real-world, ill-posed problems. Occlusion Skeleton[61] provides a valuable resource for research in this domain. Benchmarking studies evaluate the performance of action recognizers under occlusion[14]. Despite these advancements, challenges remain, such as labor-intensive annotation processes and the absence of explicit causal modeling between actions and occlusion effects. There is an urgent need for large-scale occluded datasets. Tab.1 compares our OccludeNet with other occluded datasets, detailing their scale, occlusion dynamics, characteristics, and sample details.

2.2 Causal Inference in Computer Vision

Causal inference has gained significant attention in computer vision, with applications in autonomous driving[65] and robotics[41, 68, 62]. Causal reasoning is increasingly integrated into deep learning tasks such as explainable artificial intelligence (XAI)[64], fairness[63, 67], and reinforcement learning[72, 38]. In compositional action recognition, counterfactual debiasing techniques help mitigate dataset biases[52]. Counterfactual reasoning has also been shown to enhance visual question answering, producing unbiased predictions by reducing the influence of uncontrolled variables in statistical correlations[39, 70]. In this work, we explore causal inference to improve robustness in action recognition under occlusions, opening a new direction in computer vision research.

3 OccludeNet Dataset

We introduce OccludeNet, a mixed-view (multi-view and single-view) video dataset for action recognition, which includes three common types of occlusion (see Fig.1(a)).This section provides an overview of the dataset’s construction, statistics, and features.

3.1 Dataset Construction

We construct a large-scale dataset that includes mixed-view occlusion videos captured under diverse natural lighting conditions, combining both synthetic and real-world data for enhanced validity and scalability. This dataset centers on key motion information, encompassing dynamic tracking occlusion (OccludeNet-D), static real occlusion (OccludeNet-S), and mixed-view interactive occlusions (OccludeNet-I and OccludeNet-M) (see Fig.1(b)).

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (3)

OccludeNet-D.For dynamic tracking occlusion, we use videos from Kinetics-400[19], introducing occlusions as primary elements to challenge recognition while maintaining identifiable actions. The construction process for OccludeNet-D is illustrated in Fig.3. We further examine and evaluate the generalization capacity of synthetic data in Sec.5.1.

OccludeNet-S.To focus on static scene occlusion, we select videos from UCF-Crime[50] featuring clearly occluded actors and segment them into clips aligning the durations of OccludeNet, providing realistic scenarios.

OccludeNet-I.For single-view video data across diverse lighting conditions, we use an RGB camera to record continuous action sequences from fixed camera positions. We invite individuals of varying body types and clothing styles to perform actions while wearing masks for anonymity. All videos are recorded at 1920×1080192010801920\times 10801920 × 1080 resolution, with audio retained for potential multi-modal analysis.

OccludeNet-M.For multi-view video data, we use both an RGB and a near-infrared camera, positioned at three angles spaced 120 apart to record the same action sequence.The setup and conditions are consistent with OccludeNet-I, ensuring uniformity in data collection.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (4)OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (5)OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (6)
(a) Occlusion Degree(b) Occlusion Area Ratio(c) Occlusion Duration Ratio
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (7)OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (8)
(a) Parent Classes(b) Top 10 Easiest and Hardest Classes

Quality Control.

All videos are carefully screened to ensure clips accurately represent the intended action class with sufficient occlusion, excluding those that are overly blurry or unrecognizable. The recorded data also ensures diversity by including the visual characteristics of the actors. All contributors are the authors of this paper and are professionally accountable for maintaining video quality for the collected data. For synthetic data, we use occluders with resolutions exceeding 30,000 pixels to maintain visual clarity. The entire dataset is uniformly annotated, facilitating efficient experimentation across various models.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (9)

3.2 Diversity and Statistics of Occluded Dataset

OccludeNet-D.To balance diversity in the occlusion scale with dataset size, we randomly select one-third of the original videos and generate three versions of each by adding occluders at scales of 0.25, 0.50, and 0.75 relative to the actor’s bounding box. Occluders, centered on the actor’s bounding box using a detection model, include items such as backpacks (153), handbags (118), suitcases (888), and dogs (1,629), selected from Microsoft COCO[31]. This process results in a dataset comprising 233,769 video clips across 400 action classes, offering more extensive coverage than existing occluded datasets.

OccludeNet-S.Static occlusions, commonly seen in surveillance footage, result from restricted viewpoints. We select clips with prominent occlusions from UCF-Crime[50], aligning their durations with those in Kinetics-400[19]. These clips feature static occluders–typically scene objects–that introduce varying degrees of visual obstruction. The dataset includes 256 clips spanning five classes: abuse (34), burglary (54), fighting (82), robbery (54), and stealing (32), with audio preserved to support multi-modal analysis.

OccludeNet-I.We engage a diverse group of actors with varying body shapes and clothing styles to ensure data diversity. In OccludeNet-I, the interactive occlusion modes, occluded body parts, occlusion levels, and durations are unique across all clips. This subset includes seven action classes: affix poster, armed, basketball dribble, basketball shot, bike ride, indoor run, and walk. The dataset is split into training (80%), validation (5%), and testing (15%) sets.

OccludeNet-M.The actor setup on OccludeNet-M mirrors that of OccludeNet-I. Likewise, interactive occlusion methods, occluded body parts, occlusion levels, and durations differ entirely across video clips taken from different viewing angles. OccludeNet-M includes 12 action classes: armed, basketball dribble, basketball shot, bike ride, distribute leaflets, fall, indoor run, outdoor run, steal, throw things, walk, and wander. The dataset is split into training (80%), validation (5%), and testing (15%) sets.

3.3 Dataset Characteristics Analysis

Preliminary Analysis of OccludeNet.

OccludeNet is designed to advance action recognition research under occlusion conditions. It includes diverse occlusion types, including object occlusion, scene occlusion, and viewpoint variation, with occlusion levels ranging from light to heavy. The dataset captures dynamic occlusion behaviors, which can occur abruptly, gradually intensify, or change at various points within a video. These diverse occlusion scenarios create a challenging and varied testing environment. Furthermore, OccludeNet emphasizes the complex impact of multiple occlusion factors on model performance, highlighting the need for robust models capable of handling such scenarios. Enhancing model robustness requires a deep understanding of how various scene features interact and influence action recognition.

Impact of Individual Occlusion Factors.

Our dataset encompasses a wide range of occlusion conditions (see Fig.4):

  • Occlusion Degree: The ratio of the occluder’s size relative to the actor’s bounding box., assessing the impact of occlusion size on samples.

  • Occlusion Area Ratio: The percentage of the actor’s bounding box that is occluded, providing a precise measure of occlusion extent.

  • Occlusion Duration Ratio: The fraction of time the actor is occluded in each video clip, examining its effect on action recognition performance and increasing temporal occlusion diversity.

Class Correlation Profiling.

As shown in Fig.5, illustrates that the drop in recognition accuracy on OccludeNet-D varies across action classes. Occlusions applied to actions with low background relevance or limited human visibility (e.g., upper body only) result in significant accuracy reductions. Further analysis indicates that actions involving specific body parts are particularly vulnerable to occlusion, leading to the most pronounced accuracy declines.

4 Causal Action Recognition

We introduce the Causal Action Recognition (CAR) framework, depicted in Fig.6. This framework models the interactions among actors, occlusions, and backgrounds from the perspective of diverse contextual elements to enhance action recognition performance.This section details the CAR, covering occlusion scene modeling (Sec.4.1), intervention techniques (Sec.4.2), and counterfactual reasoning (Sec.4.3).

4.1 Modeling Occluded Scenes

We model occluded scenes using a structural causal model, represented by a directed acyclic graph, where nodes correspond to variables and edges represent causal relations. The prediction P𝑃Pitalic_P is defined as P=f(A,B,F,O,ϵP)𝑃𝑓𝐴𝐵𝐹𝑂subscriptitalic-ϵ𝑃P=f(A,B,F,O,\epsilon_{P})italic_P = italic_f ( italic_A , italic_B , italic_F , italic_O , italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ), where A𝐴Aitalic_A denotes actor features, B𝐵Bitalic_B represents background features, F𝐹Fitalic_F stands for contextual features, O𝑂Oitalic_O captures occlusion features, and ϵPsubscriptitalic-ϵ𝑃\epsilon_{P}italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is a noise term. The background, contextual, and actor features are modeled as B=g(O,ϵB)𝐵𝑔𝑂subscriptitalic-ϵ𝐵B=g(O,\epsilon_{B})italic_B = italic_g ( italic_O , italic_ϵ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ), F=h(O,ϵF)𝐹𝑂subscriptitalic-ϵ𝐹F=h(O,\epsilon_{F})italic_F = italic_h ( italic_O , italic_ϵ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ), A=k(O,ϵA)𝐴𝑘𝑂subscriptitalic-ϵ𝐴A=k(O,\epsilon_{A})italic_A = italic_k ( italic_O , italic_ϵ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ), Each feature is influenced by the occlusion O𝑂Oitalic_O and respective noise components. This structure captures the impact of occlusion on the prediction P𝑃Pitalic_P, highlighting the complex interplay between occlusion and other variables. This causal framework is integral to the CAR, designed to enhance occlusion robustness.

4.2 Causal Intervention Strategies

To isolate the direct causal effect of actor features A𝐴Aitalic_A on predictions P𝑃Pitalic_P, we employ causal interventions using Pearl’s do-calculus[41], treating occlusion O𝑂Oitalic_O as a confounding factor. By intervening on A𝐴Aitalic_A, we compute the counterfactual prediction P(do(A=a))𝑃do𝐴𝑎P(\mathrm{do}(A=a))italic_P ( roman_do ( italic_A = italic_a ) ), which represents the predicted outcome when A𝐴Aitalic_A is set to a specific value, independent of changes in O𝑂Oitalic_O.

To mitigate the confounding influence of O𝑂Oitalic_O on the relation between A𝐴Aitalic_A and P𝑃Pitalic_P, we apply a refined back-door adjustment[41]. This defines a counterfactual prediction C𝐶Citalic_C where A𝐴Aitalic_A is manipulated while O𝑂Oitalic_O remains fixed. Specifically, we intervene in the actor’s feature node through segmentation and feature erasure, thereby blocking the causal path leading to this node.

For implementation, we utilize advanced segmentation and tracking models, such as Grounded-SAM[33, 43], Segment-and-Track-Anything[8, 20], and EfficientVit-SAM[2]. These models are integrated into the Grounded-Segment-and-Track-Anything module, ensuring precise actor segmentation across frames and enabling controlled manipulation of actor features for causal analysis.

4.3 Counterfactual Reasoning and Learning

Counterfactual reasoning allows us to compare model predictions under normal and manipulated conditions, quantifying the causal effect of modifying A𝐴Aitalic_A. Specifically, the causal effect is computed as:

Peffect=EAA~[P(A=𝑨,O=𝑶)P(do(A=𝒂),O=𝑶)],subscript𝑃effectsubscript𝐸similar-to𝐴~𝐴delimited-[]𝑃formulae-sequence𝐴𝑨𝑂𝑶𝑃do𝐴𝒂𝑂𝑶\small P_{\mathrm{effect}}=E_{A\sim\tilde{A}}[P(A=\bm{A},O=\bm{O})-P(\mathrm{%do}(A=\bm{a}),O=\bm{O})],italic_P start_POSTSUBSCRIPT roman_effect end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_A ∼ over~ start_ARG italic_A end_ARG end_POSTSUBSCRIPT [ italic_P ( italic_A = bold_italic_A , italic_O = bold_italic_O ) - italic_P ( roman_do ( italic_A = bold_italic_a ) , italic_O = bold_italic_O ) ] ,(1)

where 𝑨𝑨\bm{A}bold_italic_A is the observed actor state, 𝒂𝒂\bm{a}bold_italic_a is the counterfactual state, and AA~similar-to𝐴~𝐴A\sim\tilde{A}italic_A ∼ over~ start_ARG italic_A end_ARG represents the distribution of the manipulated actor state.

We compute a corrected prediction Y𝑌Yitalic_Y by comparing the original prediction P𝑃Pitalic_P with its counterfactual counterpart C𝐶Citalic_C. Let pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the logits (pre-softmax outputs) for the original and counterfactual predictions, respectively. The original prediction probabilities are calculated as:

P=(exp(p1)i=1nexp(pi),exp(p2)i=1nexp(pi),,exp(pn)i=1nexp(pi)),𝑃subscript𝑝1superscriptsubscript𝑖1𝑛subscript𝑝𝑖subscript𝑝2superscriptsubscript𝑖1𝑛subscript𝑝𝑖subscript𝑝𝑛superscriptsubscript𝑖1𝑛subscript𝑝𝑖\small P=\left(\frac{\exp\left(p_{1}\right)}{\sum_{i=1}^{n}\exp\left(p_{i}%\right)},\frac{\exp\left(p_{2}\right)}{\sum_{i=1}^{n}\exp\left(p_{i}\right)},%\dots,\frac{\exp\left(p_{n}\right)}{\sum_{i=1}^{n}\exp\left(p_{i}\right)}%\right),italic_P = ( divide start_ARG roman_exp ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , divide start_ARG roman_exp ( italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , … , divide start_ARG roman_exp ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) ,(2)

ensuring that P𝑃Pitalic_P sums to 1 across all classes.

The corrected prediction Y𝑌Yitalic_Y is derived from the difference between the logits:

Y=(exp(pici)j=1nexp(pjcj))i=1n.𝑌superscriptsubscriptsubscript𝑝𝑖subscript𝑐𝑖superscriptsubscript𝑗1𝑛subscript𝑝𝑗subscript𝑐𝑗𝑖1𝑛\small Y=\left(\frac{\exp\left(p_{i}-c_{i}\right)}{\sum_{j=1}^{n}\exp\left(p_{%j}-c_{j}\right)}\right)_{i=1}^{n}.italic_Y = ( divide start_ARG roman_exp ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .(3)

We quantify the causal effect between P𝑃Pitalic_P and Y𝑌Yitalic_Y using the relative entropy loss (i.e. Kullback-Leibler divergence), combined with the cross-entropy loss balanced by a hyper-parameter α𝛼\alphaitalic_α:

=i=1nPilogP^i+αi=1nPilog(PiYi),superscriptsubscript𝑖1𝑛subscript𝑃𝑖subscript^𝑃𝑖𝛼superscriptsubscript𝑖1𝑛subscript𝑃𝑖subscript𝑃𝑖subscript𝑌𝑖\small\mathcal{L}=-\sum_{i=1}^{n}P_{i}\log\hat{P}_{i}+\alpha\sum_{i=1}^{n}P_{i%}\log\left(\frac{P_{i}}{Y_{i}}\right),caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ,(4)

where P^isubscript^𝑃𝑖\hat{P}_{i}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the label of class i𝑖iitalic_i.

Our framework incorporates counterfactual supervision during fine-tuning, enhancing robustness to occlusions without additional inference steps and ensuring practical efficiency in model deployment.

MethodVenueBackboneK-OD-25D-50D-75O-SO-IO-MK-400
I3D[4]CVPR’17ResNet-5033.668.4864.2657.2615.0910.199.8671.28
X3D[13]CVPR’20ResNet-5034.367.6063.3455.4316.9819.442.9070.47
MViTv2[29]CVPR’22Transformer56.777.3574.6469.8326.4224.071.1679.01
VideoMAE[53]NeurIPS’22Transformer58.177.6775.7271.1432.083.704.0679.34
UniFormerv2[25]ICCV’23Transformer-80.8277.6871.7920.757.4111.3082.60
VideoMAEv2[55]CVPR’23Transformer-80.8178.8574.667.552.7817.3982.15
VideoMamba[27]ECCV’24Mamba-74.5471.6866.4322.9622.5310.5576.28
InternVideo2[57]arXiv’24Transformer-83.3381.6178.0420.1216.193.6184.34

5 Experimental Results

5.1 Benchmarking and Analysis

Experimental Setup.

We evaluate eight state-of-the-art models on Kinetics-400[19], K-400-O[14], and our OccludeNet testing sets. The models tested include InternVideo2[57], VideoMamba[27], UniFormerv2[25], VideoMAEv2[55], VideoMAE[53], MViTv2[29], X3D[13], and I3D[4]. Kinetics-400 consists of original videos used to generate OccludeNet-D, ensuring consistent video counts across different occlusion levels. For evaluation consistency, we use the mmaction2[9] framework and pretrained model weights for all models except VideoMamba and InternVideo2. Tab.2 summarizes the experimental results across all models.

Influencing Factors Analysis.

Examining performance across various occlusion factors, we find that accuracy decreases for all models as occlusion degree increases (see Fig.4(a)), with CNN-based models such as I3D[4] and X3D[13] experiencing the steepest declines. Fig.4(b) shows that accuracy falls as the occlusion area ratio rises, indicating that occluder’s coverage of the actor’s bounding box occludes critical features. Additionally, Fig.4(c) demonstrates that higher occlusion durations–reflecting a greater proportion of occluded frames–further reduce accuracy, as models have fewer visible actor features to recognize actions.

Action Classes under Different Parent Classes.

Parent classes categorize similar actions (e.g., playing drums, trombone, and violin under “Music”)[19]. Fig.5(a) shows that occlusion has the greatest impact on actions related to body motion, eating & drinking, head & mouth, and touching persons. These actions often involve body parts occupying large portions of the frame with minimal background cue, making them particularly vulnerable to occlusion effects.

Action Classes with Different Characteristics.

Fig.5(b) highlights performance variations across different action classes on OccludeNet-D. Classes with greater accuracy drops often feature low background relevance and partial body visibility, including diet-related actions (eating burgers, eating watermelon), head-focused actions (brushing teeth, curling hair, gargling), hand-related actions (nail clipping, clapping, sign language interpretation), and instrumental music actions (playing flute, saxophone). In these cases, models rely heavily on actor features rather than background cues. Conversely, actions with significant background relevance (e.g., swimming, surfing, rock climbing) are less impacted by occlusion.

AccuracyUnoccludedSlightModerateHeavy
Top-177.3075.3272.3166.65
Top-593.0091.8989.8785.69

Generalization Ability of Synthetic Data.

To assess the impact of our occluded dataset on action recognition, we perform an ablation study, with results shown in Tab.3. The findings show that while occluded video data reduces model accuracy, human observers can still recognize the action class. This highlights the increased difficulty of action recognition with state-of-the-art models in the presence of diverse occlusion conditions.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (10)

To assess the generalization ability of synthetic data, we train models on OccludeNet-D and evaluate them on real-world occluded data. As shown in Fig.7, models trained on OccludeNet-D exhibit superior accuracy and show performance gains across various datasets. These results demonstrate that OccludeNet-D significantly enhances model robustness to occlusion and improves generalization.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (11)
MethodBackboneD-25D-50D-75Kinetics-400K-400-OEPIC
StillMix[24]Swin Transformer55.35 / 79.5953.72 / 78.5551.63 / 76.2757.63 / 81.21--
FAME[11]Swin Transformer55.13 / 79.8854.21 / 78.7251.64 / 76.3357.41 / 80.98--
UniFormerv2[25]CNN + Transformer88.37 / 97.6987.16 / 97.2585.13 / 96.5088.31 / 97.7157.75 / 77.4735.81 / 79.05
UniFormerv2 + CARCNN + Transformer88.63 / 97.7787.61 / 97.7485.50 / 96.6688.98 / 97.9957.75 / 77.8037.16 / 81.76

Dataset Sustainability Discussion.

Fig.8 shows that models exhibit fewer inter-class accuracy differences on K-400-O than on OccludeNet-D. The heavy occlusion strategy on K-400-O[14] occludes most actor and background information, rendering action recognition nearly impossible, even for humans. This approach resembles blind prediction rather than recognition based on meaningful visual cues. In contrast, OccludeNet-D retains sufficient visual information for humans to recognize action classes despite occlusion. This strategy supports sustainable improvements in action recognition accuracy under occlusion and contributes to enhanced model robustness.

5.2 Causal Action Recognition

To investigate the causal relation between actors and occlusion backgrounds, we compare our approach with other background modeling techniques, including StillMix[24] and FAME[11]. We fine-tune UniFormerv2-B/16[25], pre-trained on CLIP-400M+K710, for 5 epochs on our dataset to ensure a fair comparison. For OccludeNet-S, we use pre-trained weights from three models: X3D-S[13] (CNN architecture), UniFormerv2-B (Transformer architecture), and VideoMamba-M[27] (Mamba architecture), and further fine-tune them on the dataset for 5 and 10 epochs using both the baseline and CAR.

MethodS-5S-10O-IO-M
X3D[13]26.4220.7552.786.09
X3D + CAR26.4228.3054.638.70
Improvement\uparrow 0.00\uparrow 7.55\uparrow 1.85\uparrow 2.61
UniFormerv2[25]79.2584.9114.8119.13
UniFormerv2 + CAR81.1386.7815.7423.77
Improvement\uparrow 1.88\uparrow 1.87\uparrow 0.93\uparrow 4.64
VideoMamba[27]26.2533.6422.5332.46
VideoMamba + CAR27.0434.7524.9234.49
Improvement\uparrow 0.79\uparrow 1.11\uparrow 2.39\uparrow 2.03

Results and Generalization Ability Analysis.

We evaluate the effect of CAR on occlusion robustness by applying it to OccludeNet, Kinetics-400, K-400-O, and Epic-Kitchens[10]. Results are shown in Tab.4 and Tab.5. Our CAR achieves the highest accuracy on all datasets, surpassing the baseline without causal reasoning. These results indicate that our method effectively mitigates performance degradation in occluded scenes.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (12)

As visualized in Fig.9, our CAR effectively shifts the model’s attention away from occluders and irrelevant backgrounds, focusing more on the actor’s action characteristics.

α𝛼\alphaitalic_αD-25D-50D-75
088.37 / 97.6987.16 / 97.2585.13 / 96.50
0.588.18 / 97.7287.36 / 97.4685.23 / 96.56
1.088.63 / 97.7787.61 / 97.7485.50 / 96.66
2.088.24 / 97.7573.02 / 93.6585.19 / 96.57

Causal Inference Ablation Study.

We conduct ablation experiments to validate the influence of causal factors. Since the ratio α𝛼\alphaitalic_α is empirically derived and does not correspond to conventional magnitudes, we evaluate various α𝛼\alphaitalic_α values to assess their impact on the final results (see Tab.6). The findings demonstrate that selecting an optimal range for α𝛼\alphaitalic_α significantly improves the model’s robustness to occlusions.

6 Conclusion

We introduce OccludeNet, a large-scale video dataset designed to simulate dynamic tracking, static scene, and multi-view interactive occlusions. Our analysis shows that occlusion strategies affect action classes differently, with greater accuracy degradation in actions with low scene relevance and partial body visibility. To address this, we propose the CAR framework, which uses a structural causal model of occluded scenes. By incorporating counterfactual reasoning and backdoor adjustment, CAR enhances robustness by focusing on unoccluded actor features.

Acknowledgments

We extend our sincere gratitude to all individuals who contributed to the creation and recording of the dataset used in this research. Special thanks to Shilei Zhao, Zhengwei Yang, Zhuo Zhou, Mengdie Wang, Liuzhong Ma, Yi Zhang, and Huantao Zheng, as well as the volunteers from XIAN Group, for their invaluable assistance in data collection and recording. Their dedication and hard work were instrumental to the success of this project.

This work was supported in part by the National Natural Science Foundation of China under Grant 62271361 and the Hubei Provincial Key Research and Development Program under Grant 2024BAB039.

References

  • Angelini etal. [2020]Federico Angelini, Zeyu Fu, Yang Long, Ling Shao, and SyedMohsen Naqvi.2d pose-based real-time human action recognition with occlusion-handling.IEEE Trans. Multim., 22(6):1433–1446, 2020.
  • Cai etal. [2022]Han Cai, Chuang Gan, and Song Han.Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition.arXiv:2205.14756, 2022.
  • Carreira and Zisserman [2017a]João Carreira and Andrew Zisserman.Quo vadis, action recognition? A new model and the kinetics dataset.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4724–4733, 2017a.
  • Carreira and Zisserman [2017b]João Carreira and Andrew Zisserman.Quo vadis, action recognition? A new model and the kinetics dataset.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4724–4733, 2017b.
  • Cen etal. [2024]Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, etal.Segment anything in 3d with nerfs.Advances in Neural Information Processing Systems, 36, 2024.
  • Chen etal. [2023]Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, and Rainer Stiefelhagen.Unveiling the hidden realm: Self-supervised skeleton-based action recognition in occluded environments.arXiv:2309.12029, 2023.
  • Cheng etal. [2023a]Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang.Segment and track anything.arXiv:2305.06558, 2023a.
  • Cheng etal. [2023b]Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang.Segment and track anything.arXiv:2305.06558, 2023b.
  • Contributors [2020]MMAction Contributors.Openmmlab’s next generation video understanding toolbox and benchmark, 2020.
  • Damen etal. [2018]Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray.Scaling egocentric vision: The EPIC-KITCHENS dataset.In Proc. Eur. Conf. Comput. Vis., 2018.
  • Ding etal. [2022a]Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong.Motion-aware contrastive video representation learning via foreground-background merging.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 9706–9716, 2022a.
  • Ding etal. [2022b]Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi Chen, Jue Wang, and Hongkai Xiong.Motion-aware contrastive video representation learning via foreground-background merging.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 9706–9716, 2022b.
  • Feichtenhofer [2020]Christoph Feichtenhofer.X3D: expanding architectures for efficient video recognition.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 200–210, 2020.
  • Grover etal. [2023]Shresth Grover, Vibhav Vineet, and YogeshS. Rawat.Revealing the unseen: Benchmarking video action recognition under occlusion.In Adv. Neural Inf. Process. Syst., 2023.
  • Herath etal. [2017a]Samitha Herath, MehrtashTafazzoli Harandi, and Fatih Porikli.Going deeper into action recognition: A survey.Image Vis. Comput., 60:4–21, 2017a.
  • Herath etal. [2017b]Samitha Herath, MehrtashTafazzoli Harandi, and Fatih Porikli.Going deeper into action recognition: A survey.Image Vis. Comput., 60:4–21, 2017b.
  • Hwang etal. [2023]Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong.Tutel: Adaptive mixture-of-experts at scale.In Proc. Conf. Mach. Learn. Syst., 2023.
  • Jocher etal. [2020]Glenn Jocher, Alex Stoken, Jirka Borovec, Liu Changyu, Adam Hogan, Laurentiu Diaconu, Francisco Ingham, Jake Poznanski, Jiacong Fang, Lijun Yu, etal.ultralytics/yolov5: v3. 1-bug fixes and performance improvements.Zenodo, 2020.
  • Kay etal. [2017]Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman.The kinetics human action video dataset.arXiv:1705.06950, 2017.
  • Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC. Berg, Wan-Yen Lo, Piotr Dollár, and RossB. Girshick.Segment anything.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 3992–4003, 2023.
  • Kong and Fu [2022a]Yu Kong and Yun Fu.Human action recognition and prediction: A survey.Int. J. Comput. Vis., 130(5):1366–1401, 2022a.
  • Kong and Fu [2022b]Yu Kong and Yun Fu.Human action recognition and prediction: A survey.Int. J. Comput. Vis., 130(5):1366–1401, 2022b.
  • Kuehne etal. [2011]Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, TomasoA. Poggio, and Thomas Serre.HMDB: A large video database for human motion recognition.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 2556–2563, 2011.
  • Li etal. [2023a]Haoxin Li, Yuan Liu, Hanwang Zhang, and Boyang Li.Mitigating and evaluating static bias of action representations in the background and the foreground.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 19854–19866, 2023a.
  • Li etal. [2023b]Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao.Uniformerv2: Unlocking the potential of image vits for video understanding.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 1632–1643, 2023b.
  • Li etal. [2023c]Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao.Uniformerv2: Unlocking the potential of image vits for video understanding.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 1632–1643, 2023c.
  • Li etal. [2024]Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao.Videomamba: State space model for efficient video understanding.In Proc. Eur. Conf. Comput. Vis., 2024.
  • Li etal. [2019]Tianhong Li, Lijie Fan, Mingmin Zhao, Yingcheng Liu, and Dina Katabi.Making the invisible visible: Action recognition through walls and occlusions.In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 872–881, 2019.
  • Li etal. [2022a]Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer.Mvitv2: Improved multiscale vision transformers for classification and detection.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4794–4804, 2022a.
  • Li etal. [2022b]Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer.Mvitv2: Improved multiscale vision transformers for classification and detection.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4794–4804, 2022b.
  • Lin etal. [2014a]Tsung-Yi Lin, Michael Maire, SergeJ. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick.Microsoft COCO: common objects in context.In Proc. Eur. Conf. Comput. Vis., pages 740–755, 2014a.
  • Lin etal. [2014b]Tsung-Yi Lin, Michael Maire, SergeJ. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C.Lawrence Zitnick.Microsoft COCO: common objects in context.In Proc. Eur. Conf. Comput. Vis., pages 740–755, 2014b.
  • Liu etal. [2024]Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang.Grounding DINO: marrying DINO with grounded pre-training for open-set object detection.In Proc. Eur. Conf. Comput. Vis., 2024.
  • Liu etal. [2023a]Wenxuan Liu, Xian Zhong, Zhuo Zhou, Kui Jiang, Zheng Wang, and Chia-Wen Lin.Dual-recommendation disentanglement network for view fuzz in action recognition.IEEE Trans. Image Process., 32:2719–2733, 2023a.
  • Liu etal. [2023b]Wenxuan Liu, Xian Zhong, Zhuo Zhou, Kui Jiang, Zheng Wang, and Chia-Wen Lin.Dual-recommendation disentanglement network for view fuzz in action recognition.IEEE Trans. Image Process., 32:2719–2733, 2023b.
  • Liu etal. [2025]Wenxuan Liu, Xuemei Jia, Xian Zhong, Kui Jiang, Xiaohan Yu, and Mang Ye.Dynamic and static mutual fitting for action recognition.Pattern Recognit., 157:110948, 2025.
  • Ma etal. [2024]Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang.Segment anything in medical images.Nature Communications, 15(1):654, 2024.
  • Madumal etal. [2020]Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere.Explainable reinforcement learning through a causal lens.In Proc. AAAI Conf. Artif. Intell., pages 2493–2500, 2020.
  • Niu etal. [2021a]Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen.Counterfactual VQA: A cause-effect look at language bias.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 12700–12710, 2021a.
  • Niu etal. [2021b]Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen.Counterfactual vqa: A cause-effect look at language bias.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12700–12710, 2021b.
  • Pearl [2010]Judea Pearl.Causal inference.In Adv. Neural Inf. Process. Syst. Workshop, pages 39–58, 2010.
  • Rao etal. [2021]Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou.Counterfactual attention learning for fine-grained visual categorization and re-identification.In Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021.
  • Ren etal. [2024]Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang.Grounded SAM: assembling open-world models for diverse visual tasks.arXiv:2401.14159, 2024.
  • Saleh etal. [2021]Kaziwa Saleh, Sándor Szénási, and Zoltán Vámossy.Occlusion handling in generic object detection: A review.In 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 000477–000484. IEEE, 2021.
  • Selvaraju etal. [2020]RamprasaathR. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.Grad-cam: Visual explanations from deep networks via gradient-based localization.Int. J. Comput. Vis., 128(2):336–359, 2020.
  • Shi etal. [2023a]Wuzhen Shi, Dan Li, Yang Wen, and Wu Yang.Occlusion-aware graph neural networks for skeleton action recognition.IEEE Trans. Ind. Informatics, 19(10):10288–10298, 2023a.
  • Shi etal. [2023b]Wuzhen Shi, Dan Li, Yang Wen, and Wu Yang.Occlusion-aware graph neural networks for skeleton action recognition.IEEE Trans. Ind. Informatics, 19(10):10288–10298, 2023b.
  • Soomro etal. [2012a]Khurram Soomro, AmirRoshan Zamir, and Mubarak Shah.UCF101: A dataset of 101 human actions classes from videos in the wild.arXiv:1212.0402, 2012a.
  • Soomro etal. [2012b]Khurram Soomro, AmirRoshan Zamir, and Mubarak Shah.UCF101: A dataset of 101 human actions classes from videos in the wild.arXiv:1212.0402, 2012b.
  • Sultani etal. [2018]Waqas Sultani, Chen Chen, and Mubarak Shah.Real-world anomaly detection in surveillance videos.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 6479–6488, 2018.
  • Sun etal. [2021a]Pengzhan Sun, Bo Wu, Xunsong Li, Wen Li, Lixin Duan, and Chuang Gan.Counterfactual debiasing inference for compositional action recognition.In Proc. ACM Int. Conf. Multimedia, pages 3220–3228, 2021a.
  • Sun etal. [2021b]Pengzhan Sun, Bo Wu, Xunsong Li, Wen Li, Lixin Duan, and Chuang Gan.Counterfactual debiasing inference for compositional action recognition.In Proc. ACM Int. Conf. Multimedia, pages 3220–3228, 2021b.
  • Tong etal. [2022a]Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.In Adv. Neural Inf. Process. Syst., 2022a.
  • Tong etal. [2022b]Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.In Adv. Neural Inf. Process. Syst., 2022b.
  • Wang etal. [2023a]Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao.Videomae V2: scaling video masked autoencoders with dual masking.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 14549–14560, 2023a.
  • Wang etal. [2023b]Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao.Videomae V2: scaling video masked autoencoders with dual masking.In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 14549–14560, 2023b.
  • Wang etal. [2024]Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang.Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv:2403.15377, 2024.
  • Weinland etal. [2010a]Daniel Weinland, Mustafa Özuysal, and Pascal Fua.Making action recognition robust to occlusions and viewpoint changes.In Proc. Eur. Conf. Comput. Vis., pages 635–648, 2010a.
  • Weinland etal. [2010b]Daniel Weinland, Mustafa Özuysal, and Pascal Fua.Making action recognition robust to occlusions and viewpoint changes.In Proc. Eur. Conf. Comput. Vis., pages 635–648, 2010b.
  • Wu etal. [2020a]Yuan Wu, Haoyue Qiu, Jing Wen, and Rui Feng.OSD: an occlusion skeleton dataset for action recognition.In Proc. IEEE Int. Conf. Big Data, pages 3355–3360, 2020a.
  • Wu etal. [2020b]Yuan Wu, Haoyue Qiu, Jing Wen, and Rui Feng.OSD: an occlusion skeleton dataset for action recognition.In Proc. IEEE Int. Conf. Big Data, pages 3355–3360, 2020b.
  • Xie etal. [2020]ZongWu Xie, Qi Zhang, ZaiNan Jiang, and Hong Liu.Robot learning from demonstration for path planning: A review.Sci. China Tech. Sci., 63(8):1325–1334, 2020.
  • Xu etal. [2019]Depeng Xu, Yongkai Wu, Shuhan Yuan, Lu Zhang, and Xintao Wu.Achieving causal fairness through generative adversarial networks.In Proc. Int. Joint Conf. Artif. Intell., pages 1452–1458, 2019.
  • Xu etal. [2020]Guandong Xu, TriDung Duong, Qian Li, Shaowu Liu, and Xianzhi Wang.Causality learning: A new perspective for interpretable machine learning.CoRR, abs/2006.16789, 2020.
  • Yang etal. [2018]DianGe Yang, Kun Jiang, Ding Zhao, ChunLei Yu, Zhong Cao, ShiChao Xie, ZhongYang Xiao, XinYu Jiao, SiJia Wang, and Kai Zhang.Intelligent and connected vehicles: Current status and future perspectives.Sci. China Tech. Sci., 61:1446–1471, 2018.
  • Zhang etal. [2023a]Chaoning Zhang, FachrinaDewi Puspitasari, Sheng Zheng, Chenghao Li, Yu Qiao, Taegoo Kang, Xinru Shan, Chenshuang Zhang, Caiyan Qin, Francois Rameau, etal.A survey on segment anything model (sam): Vision foundation model meets prompt engineering.arXiv:2306.06211, 2023a.
  • Zhang and Bareinboim [2018]Junzhe Zhang and Elias Bareinboim.Fairness in decision-making - the causal explanation formula.In Proc. AAAI Conf. Artif. Intell., pages 2037–2045, 2018.
  • Zhang etal. [2023b]Kexuan Zhang, Qiyu Sun, Chaoqiang Zhao, and Yang Tang.Causal reasoning in typical computer vision tasks.arXiv:2307.13992, 2023b.
  • Zhong etal. [2023]Xian Zhong, Aoyu Yi, Wenxuan Liu, Wenxin Huang, Chengming Zou, and Zheng Wang.Background-weakening consistency regularization for semi-supervised video action detection.In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  • Zhou etal. [2024]Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, and Xuming Hu.Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality.arXiv:2410.04780, 2024.
  • Zhu etal. [2019]Shengyu Zhu, Ignavier Ng, and Zhitang Chen.Causal discovery with reinforcement learning.arXiv:1906.04477, 2019.
  • Zhu etal. [2020]Shengyu Zhu, Ignavier Ng, and Zhitang Chen.Causal discovery with reinforcement learning.In Proc. Int. Conf. Learn. Represent., 2020.

Appendix A Overview of Appendix

This supplementary material provides the following for a comprehensive understanding of our main paper:

  • Construction of OccludeNet-M

  • Data Samples

  • Data Annotation Details

  • Additional Results and Visualizations

  • Plug-and-Play Integration

  • Potential Applications of OccludeNet

  • Action Classes under Different Parent Categories

  • Author Statement

  • Hosting, Licensing, and Maintenance Plans

Appendix B Construction of OccludeNet-M

Fig.10 presents a schematic diagram illustrating the filming process of the multi-view interactive occlusion dataset OccludeNet-M. We employ both RGB and near-infrared cameras, positioning three cameras at angles of 120 apart around the actor to record the same action sequence. Occlusion conditions include varying degrees of single-view occlusion and simultaneous occlusions across multiple views. These include different environments, different motion amplitudes, different scales, etc. The length of each type of video ranges from 5 seconds to more than 10 seconds, meeting the requirements of video for time length diversity. Videos are recorded at 1920×1080192010801920\times 10801920 × 1080 resolution to maintain consistent viewpoint data. Audio is included to support multimodal analysis.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (13)

Appendix C Data Samples

C.1 Video Samples

Fig.11 showcases occluded video examples from OccludeNet dataset, highlighting varying occlusion degrees, duration ratios, and dynamics to demonstrate the dataset’s diversity.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (14)

C.2 Occluder Samples

We selected four common objects from Microsoft COCO dataset[31] to create dynamic tracking occlusions on the human body. Using a segmentation model, we extracted their foregrounds. Each occluder sample retains its file name for traceability and includes pixel annotations for future research. Examples are shown in Fig.12.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (15)

Appendix D Data Annotation Details

We provide comprehensive annotations for OccludeNet dataset. Specifically, for dynamic tracking occlusion (OccludeNet-D), each sample is annotated with the action class, file name, occluder type, occluder file name, occluder pixel ratio, occluder size ratio, occlusion duration, video duration, fps, and clip generation time. For static scene occlusion (OccludeNet-S), annotations include the action class, file name, fps, and video duration for each video sample. In the case of interactive occlusions (OccludeNet-I and OccludeNet-M), each video sample is annotated with the action class and file name. The annotated CSV files are illustrated in Fig.13.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (16)

Appendix E Additional Results and Visualizations

E.1 Benchmarking and Analysis

Fig.14 illustrates that, on OccludeNet dataset, the Top-5 accuracy of all seven models decreases significantly as occlusion increases. Notably, CNN-based models exhibit the least robustness to occlusion, whereas transformer-based models demonstrate slightly greater resilience.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (17)

E.2 Class Correlation Profiling

We conduct further class-related analyses, as depicted in Fig.15 and Fig.16. Actions involving specific body parts, such as eating & drinking, body motions, head & mouth activities, and interacting with persons, exhibit limited dynamic range, with actors occupying large portions of the frame and displaying low correlation with background information. Consequently, occlusions have a more pronounced impact on the recognition accuracy of these classes. In contrast, action classes like water sports, high-altitude activities, juggling, and racquet & bat sports are more closely associated with their backgrounds (e.g., water environments, and mountainous terrains). For these classes, models can leverage background information to correctly identify actions, making them relatively less susceptible to occlusion.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (18)
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (19)

E.3 Causal Action Recognition

Fig.17 provides additional visualizations using the Grad-CAM method[45] to interpret the model’s attention distribution. On OccludeNet dataset, our Causal Action Recognition (CAR) method directs more attention to the unoccluded regions of the actor and the key aspects of the action compared to the baseline model. In contrast, the baseline model exhibits weaker focus and erroneously emphasizes occluders. Additionally, on Kinetics-400 dataset without occlusions, CAR exhibits a more focused attention distribution on the actor’s action features. These results demonstrate that our CAR method effectively enhances the model’s robustness to occlusion and improves attention distribution.

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (20)

E.4 Plug-and-Play Integration

Our Causal Action Recognition (CAR) method can be seamlessly integrated into existing action recognition models with minimal environmental dependencies. The CAR framework comprises two main components:

  1. 1.

    Training Process Intervention: This involves generating counterfactual samples, modifying prediction values, and adjusting the loss function.

  2. 2.

    Grounded-Segment-and-Track-Anything Module: This module handles precise actor segmentation and tracking.

For the training process intervention, it is sufficient to add a function interface for counterfactual sample generation and modify the structure of the prediction values and loss function accordingly. Regarding the Grounded-Segment-and-Track-Anything module, users need to install the environment dependencies for Grounded-SAM[43] and EfficientVit[2] using a package management tool. After installing these dependencies, users can directly import the provided code package and the necessary model weights. Detailed instructions are available in our open-source repository.

Appendix F Potential Applications of OccludeNet

F.1 Security and Surveillance Video

Surveillance and security videos frequently feature occlusions due to limited viewing angles, intentional obstructions, and other factors. OccludeNet dataset provides essential data to enhance the robustness of surveillance video models against occluded scenes.

F.2 Occlusion in Virtual Scenes

The rise of short-form and live videos has led to an exponential increase in virtual occlusions, such as subtitles, stickers, and dynamic tracking effects. OccludeNet can simulate these types of occlusions, enabling research to effectively address challenges posed by virtual occlusions.

F.3 Occlusion Localization

Incorporating occlusions allows future models to achieve a deeper understanding of video elements, facilitating tasks like occlusion localization, where models identify and pinpoint occluded regions within frames.

F.4 Multi-modal Analysis

All samples in OccludeNet dataset include audio, making them suitable for multi-modal tasks and analysis. This supports the growing trend in video analysis towards integrating multiple modalities to enhance performance.

Appendix G Action Classes under Different Parent Categories

We adopt the parent class hierarchy of action classes from Kinetics-400 dataset, as detailed below:

Arts and Crafts

  • arranging flowers

  • blowing glass

  • brush painting

  • carving pumpkin

  • clay pottery making

  • decorating the Christmas tree

  • drawing

  • getting a tattoo

  • knitting

  • making jewelry

  • spray painting

  • weaving basket

Athletics – Jumping

  • high jump

  • hurdling

  • long jump

  • parkour

  • pole vault

  • triple jump

Athletics – Throwing + Launching

  • archery

  • catching or throwing frisbee

  • disc golfing

  • hammer throw

  • javelin throw

  • shot put

  • throwing axe

  • throwing ball

  • throwing discus

Auto Maintenance

  • changing oil

  • changing wheel

  • checking tires

  • pumping gas

Ball Sports

  • bowling

  • catching or throwing baseball

  • catching or throwing softball

  • dodgeball

  • dribbling basketball

  • dunking basketball

  • golf chipping

  • golf driving

  • golf putting

  • hurling (sport)

  • juggling soccer ball

  • kicking field goal

  • kicking soccer ball

  • passing American football (in game)

  • passing American football (not in game)

  • playing basketball

  • playing cricket

  • playing kickball

  • playing squash or racquetball

  • playing tennis

  • playing volleyball

  • shooting basketball

  • shooting goal (soccer)

  • shot put

Body Motions

  • air drumming

  • applauding

  • baby waking up

  • bending back

  • clapping

  • cracking neck

  • drumming fingers

  • finger snapping

  • headbanging

  • headbutting

  • pumping fist

  • shaking head

  • stretching arm

  • stretching leg

  • swinging legs

Cleaning

  • cleaning floor

  • cleaning gutters

  • cleaning pool

  • cleaning shoes

  • cleaning toilet

  • cleaning windows

  • doing laundry

  • making bed

  • mopping floor

  • setting table

  • shining shoes

  • sweeping floor

  • washing dishes

Clothes

  • bandaging

  • doing laundry

  • folding clothes

  • folding napkins

  • ironing

  • making bed

  • tying bow tie

  • tying knot (not on a tie)

  • tying tie

Communication

  • answering questions

  • auctioning

  • bartending

  • celebrating

  • crying

  • giving or receiving award

  • laughing

  • news anchoring

  • presenting weather forecast

  • sign language interpreting

  • testifying

Cooking

  • baking cookies

  • barbequing

  • breading or breadcrumbing

  • cooking chicken

  • cooking egg

  • cooking on campfire

  • cooking sausages

  • cutting pineapple

  • cutting watermelon

  • flipping pancake

  • frying vegetables

  • grinding meat

  • making a cake

  • making a sandwich

  • making pizza

  • making sushi

  • making tea

  • peeling apples

  • peeling potatoes

  • picking fruit

  • scrambling eggs

  • tossing salad

Dancing

  • belly dancing

  • breakdancing

  • capoeira

  • cheerleading

  • country line dancing

  • dancing ballet

  • dancing charleston

  • dancing gangnam style

  • dancing macarena

  • jumpstyle dancing

  • krumping

  • marching

  • robot dancing

  • salsa dancing

  • swing dancing

  • tango dancing

  • tap dancing

  • zumba

Eating + Drinking

  • bartending

  • dining

  • drinking

  • drinking beer

  • drinking shots

  • eating burger

  • eating cake

  • eating carrots

  • eating chips

  • eating doughnuts

  • eating hotdog

  • eating ice cream

  • eating spaghetti

  • eating watermelon

  • opening bottle

  • tasting beer

  • tasting food

Electronics

  • assembling computer

  • playing controller

  • texting

  • using computer

  • using remote controller (not gaming)

Garden + Plants

  • blowing leaves

  • carving pumpkin

  • chopping wood

  • climbing tree

  • decorating the Christmas tree

  • egg hunting

  • mowing lawn

  • planting trees

  • trimming trees

  • watering plants

Golf

  • golf chipping

  • golf driving

  • golf putting

Gymnastics

  • bouncing on trampoline

  • cartwheeling

  • gymnastics tumbling

  • somersaulting

  • vault

Hair

  • braiding hair

  • brushing hair

  • curling hair

  • dying hair

  • fixing hair

  • getting a haircut

  • shaving head

  • shaving legs

  • trimming or shaving beard

  • washing hair

  • waxing back

  • waxing chest

  • waxing eyebrows

  • waxing legs

Hands

  • air drumming

  • applauding

  • clapping

  • cutting nails

  • doing nails

  • drumming fingers

  • finger snapping

  • pumping fist

  • washing hands

Head + Mouth

  • balloon blowing

  • beatboxing

  • blowing nose

  • blowing out candles

  • brushing teeth

  • gargling

  • headbanging

  • headbutting

  • shaking head

  • singing

  • smoking

  • smoking hookah

  • sneezing

  • sniffing

  • sticking tongue out

  • whistling

  • yawning

Heights

  • abseiling

  • bungee jumping

  • climbing a rope

  • climbing ladder

  • climbing tree

  • diving cliff

  • ice climbing

  • jumping into pool

  • paragliding

  • rock climbing

  • skydiving

  • slacklining

  • springboard diving

  • swinging on something

  • trapezeing

Interacting with Animals

  • bee keeping

  • catching fish

  • feeding birds

  • feeding fish

  • feeding goats

  • grooming dog

  • grooming horse

  • holding snake

  • ice fishing

  • milking cow

  • petting animal (not cat)

  • petting cat

  • riding camel

  • riding elephant

  • riding mule

  • riding or walking with horse

  • shearing sheep

  • training dog

  • walking the dog

Juggling

  • contact juggling

  • hula hooping

  • juggling balls

  • juggling fire

  • juggling soccer ball

  • spinning poi

Makeup

  • applying cream

  • doing nails

  • dying hair

  • filling eyebrows

  • getting a tattoo

Martial Arts

  • arm wrestling

  • capoeira

  • drop kicking

  • high kick

  • punch

  • punching bag

  • punching person

  • side kick

  • sword fighting

  • tai chi

  • wrestling

Miscellaneous

  • digging

  • extinguishing fire

  • garbage collecting

  • laying bricks

  • moving furniture

  • spraying

  • stomping grapes

  • tapping pen

  • unloading truck

Mobility – Land

  • crawling baby

  • driving car

  • driving tractor

  • faceplanting

  • hoverboarding

  • jogging

  • motorcycling

  • parkour

  • pushing car

  • pushing cart

  • pushing wheelchair

  • riding a bike

  • riding mountain bike

  • riding scooter

  • riding unicycle

  • roller skating

  • running on treadmill

  • skateboarding

  • surfing crowd

  • using segway

  • waiting in line

Mobility – Water

  • crossing river

  • diving cliff

  • jumping into pool

  • scuba diving

  • snorkeling

  • springboard diving

  • swimming backstroke

  • swimming breast stroke

  • swimming butterfly stroke

  • water sliding

Music

  • beatboxing

  • busking

  • playing accordion

  • playing bagpipes

  • playing bass guitar

  • playing cello

  • playing clarinet

  • playing cymbals

  • playing didgeridoo

  • playing drums

  • playing flute

  • playing guitar

  • playing harmonica

  • playing harp

  • playing keyboard

  • playing organ

  • playing piano

  • playing recorder

  • playing saxophone

  • playing trombone

  • playing trumpet

  • playing ukulele

  • playing violin

  • playing xylophone

  • recording music

  • singing

  • strumming guitar

  • tapping guitar

  • whistling

Paper

  • bookbinding

  • counting money

  • folding napkins

  • folding paper

  • opening present

  • reading book

  • reading newspaper

  • ripping paper

  • shredding paper

  • unboxing

  • wrapping present

  • writing

Personal Hygiene

  • brushing teeth

  • taking a shower

  • trimming or shaving beard

  • washing feet

  • washing hair

  • washing hands

Playing Games

  • egg hunting

  • flying kite

  • hopscotch

  • playing cards

  • playing chess

  • playing monopoly

  • playing paintball

  • playing poker

  • riding mechanical bull

  • rock scissors paper

  • shuffling cards

  • skipping rope

  • tossing coin

Racquet + Bat Sports

  • catching or throwing baseball

  • catching or throwing softball

  • hitting baseball

  • hurling (sport)

  • playing badminton

  • playing cricket

  • playing squash or racquetball

  • playing tennis

Snow + Ice

  • biking through snow

  • bobsledding

  • hockey stop

  • ice climbing

  • ice fishing

  • ice skating

  • making snowman

  • playing ice hockey

  • shoveling snow

  • ski jumping

  • skiing (not slalom or crosscountry)

  • skiing crosscountry

  • skiing slalom

  • sled dog racing

  • snowboarding

  • snowkiting

  • snowmobiling

  • tobogganing

Swimming

  • swimming backstroke

  • swimming breast stroke

  • swimming butterfly stroke

Touching Person

  • carrying baby

  • hugging

  • kissing

  • massaging back

  • massaging feet

  • massaging legs

  • massaging person’s head

  • shaking hands

  • slapping

  • tickling

Using Tools

  • bending metal

  • blasting sand

  • building cabinet

  • building shed

  • changing oil

  • changing wheel

  • checking tires

  • plastering

  • pumping gas

  • sanding floor

  • sharpening knives

  • sharpening pencil

  • welding

Water Sports

  • canoeing or kayaking

  • jetskiing

  • kitesurfing

  • parasailing

  • sailing

  • surfing water

  • water skiing

  • windsurfing

Waxing

  • waxing back

  • waxing chest

  • waxing eyebrows

  • waxing legs

OccludeNet: A Causal Journey into Mixed-View Actor-Centric Action Recognition under Occlusions (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6307

Rating: 4 / 5 (61 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.