Description

This course explores the techniques and methodologies for the recognition and analysis of human actions in images and video sequences. Designed for senior-year computer science students, the course covers both classical and contemporary approaches to action recognition, integrating theory with hands-on application. This course is particularly well-suited for students who are interested in pursuing research or careers in computer vision, artificial intelligence, or related fields.

The students will be exposed to a curated selection of influential papers that cover a range of methodologies for action recognition. Each week, students will read assigned papers, critically analyze them, and participate in in-depth class discussions to explore the strengths, weaknesses, and potential applications of the proposed methods.

In addition to the discussions, students will choose one or more papers to implement. This hands-on project will allow them to reproduce key results, experiment with variations of the methods, and possibly propose and test their own improvements. Through this process, students will gain a deeper understanding of the technical challenges and considerations involved in human action recognition, as well as experience in implementing and evaluating research ideas.

By the end of the course, students will have not only gained knowledge of the latest trends and techniques in action recognition but also developed practical skills in reading, critiquing, and implementing complex research papers.

Prerequisites

A working knowledge of machine learning and deep learning (CSCI 4050, CSCI 4052, or equivalent) and familiarity with basic computer vision techniques (CSCI 3240U, preferrably CSCI 4220U, or equivalent).

Topics

Fundamentals of Action Recognition: Introduction to the field, including definitions, challenges, and applications in areas such as surveillance, human-computer interaction, and sports analytics.
Feature Extraction and Representation: Study of feature extraction methods, such as optical flow, spatial-temporal interest points, and deep feature representations using Convolutional Neural Networks (CNNs) and Transformers.
Machine Learning and Deep Learning Techniques: Application of machine learning models, including Support Vector Machines (SVMs) and Random Forests, as well as deep learning architectures like RNNs, LSTMs, and 3D CNNs for action recognition tasks.
Spatio-Temporal Modeling: Techniques for capturing and modeling the spatial and temporal dimensions of action in videos, including the use of spatio-temporal graphs, attention mechanisms, and multi-stream networks.
Datasets and Evaluation: Overview of popular datasets used in the field, such as UCF101, Kinetics, and AVA, and discussion of evaluation metrics like accuracy, F1 score, and mean Average Precision (mAP).
Applications and Case Studies: Exploration of real-world applications and case studies in action recognition, from autonomous vehicles to entertainment and beyond.
Emerging Trends: Discussion of the latest trends in the field, such as zero-shot learning, self-supervised learning, and the integration of action recognition with other computer vision tasks like object detection and scene understanding.

Learning Outcomes

By the end of this course, students will be able to:

Understand and implement key algorithms and models used for action recognition in images and videos.
Analyze and critique the strengths and weaknesses of various action recognition approaches.
Apply advanced machine learning and deep learning techniques to solve real-world action recognition problems.
Develop and evaluate action recognition systems using state-of-the-art tools and datasets.
Stay abreast of the latest research and trends in the rapidly evolving field of action recognition.

Grading

In-class discussions 20%
Midterm exams 40%
Course project 40%

A student must get 50% in the course project to pass the course. Furthermore, a student must get 50% in the two midterms to pass the course. Class attendence is not optional.

Important dates

Midterm 1 on Oct 4
Study break during the week of Oct 14
Midterm 2 on Nov 18
Project selection due by Oct 18 You may lose up to 10% of the course project grade if project selection isn’t finalized by Oct 18. You may lose up to an additional 20% of the course project grade if the project selection isn’t finalized by Oct 25.
Project report due on Dec 8, by 11:59 pm
You may be asked to record a 3 minutes long project presentation that will be submitted before the last week of lectures.

Ontario Tech University’s academic calendar that lists important dates (and deadlines) is available at here.

Background

Introduction to Action Recognition
Reasoning under uncertainty (Ch. 13 from Russell and Norvig)
Hidden Markov Models (Sec. 15.1 to 15.3 from Russell and Norvig)
Linear filtering
Image gradients
Image pyramids
- Slides
CNNs for object recognition
- Notes on convolution
Normalization layers
Two streams network
Towards attention

Papers

How to read a paper?

The list presented below is by no means complete.

Each week, a paper will be assigned to one or more students, who will lead the discussion on that paper.

Classic Papers

“Recognizing human action in time-sequential images using hidden markov models”
Yamato, J., Ohya, J., & Ishii, K. (1992)
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[PDF Link]
“Recognizing human actions: A local SVM approach”
Schüldt, C., Laptev, I., & Caputo, B. (2004)
Conference on Pattern Recognition (ICPR)
[PDF Link]
“On Space-time interest points”
Laptev, I. (2005)
IEEE International Conference on Computer Vision (ICCV)
[PDF Link]
“Actions as space-time shapes”
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007)
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
[PDF Link]

Deep Learning and Modern Approaches

“Two-stream convolutional networks for action recognition in videos”
Simonyan, K., & Zisserman, A. (2014)
Advances in Neural Information Processing Systems (NeurIPS)
[PDF Link]
“Learning spatiotemporal features with 3D convolutional networks”
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015)
IEEE International Conference on Computer Vision (ICCV)
[PDF Link]
“Quo vadis, action recognition? A new model and the Kinetics dataset”
Carreira, J., & Zisserman, A. (2017)
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[PDF Link]
“Non-local neural networks”
Wang, X., Girshick, R., Gupta, A., & He, K. (2018)
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[PDF Link]
“SlowFast networks for video recognition”
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019)
IEEE International Conference on Computer Vision (ICCV)
[PDF Link]
“Learning Transferable Visual Models From Natural Language Supervision”
Radford A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021)
Proceedings of the 38th International Conference on Machine Learning (PMLR)
[PDF Link]

Emerging Trends and Reviews

“A survey on deep learning for human action recognition”
Herath, S., Harandi, M., & Porikli, F. (2017)
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
[PDF Link]
“Attention mechanisms in action recognition: A review”
Girdhar, R., & Ramanan, D. (2021)
Journal of Computer Vision and Image Understanding
[PDF Link]

Course project

The course project is an independent exploration of a specific problem within the context of this course.

The topic of the project will be decided in consultation with the instructor.

Project grade will depend on the ideas, how well you present them in the report, how well you position your work in the related literature, how thorough are your experiments and how thoughtful are your conclusions.

Teams of up to two students are allowed.

Three minutes project videos

You are required to prepare a three-minutes video that provides an overview of your project. You may frame these videos as pitch videos to investors–having broad understanding of computers science, information technology, and artificial intelligence landscape–who are considering investing in your business that is built around the technology that you have developed in your project.

Final Report

For your final project write-up you must use ACM SIG Proceedings Template (available at the ACM website). Project report is at most 12 pages long, plus extra pages for references.

Alternately, you can use the following template (from “Tech Report ala MIT AI Lab (1981):

LaTeX Source
PDF
Here is the overleaf template that I have used to create this course project template.

Resources

Textbook

We will use the following textbook to cover the fundamentals needed to understand the material covered in this course.

Foundations of Computer Vision by A. Torralba, P. Isola, and W.T. Freeman, The MIT Press, 2024.
[Artificial Intelligence: A Modern Approach, Third Edition by S. Russell and A. Norvig, Prentice Hall.]

News

Course Info

Syllabus

Lectures

Communication

Office hours

Canvas (requires login)