Online quality assessment of human movements from Kinect skeleton data

The objective of this project is to evaluate the quality of human movements from visual information which has use in a broad range of applications, from diagnosis and rehabilitation to movement optimisation in sports science. Observed movements are compared with a model of normal movement and the amount of deviation from normality is quantified automatically.

Description of the proposed method

The figure below illustrates the pipeline of our proposed method.

 

overview
Figure 1: Pipeline of the proposed method

 

Skeleton extraction

Skeletondepth_325
Figure 2: Example of skeleton extracted from a depth map

We use a Kinect camera, that measures distances and provides a depth map of the scene (see Fig. 2) instead of the classic RGB image. A skeleton tracker [1] can use this depth map to fit a skeleton on the person being filmed. We then normalise the skeleton to compensate for people having various heights. This normalised skeleton is the basis of our movement analysis technique.

Robust dimensionality reduction

A skeleton contains 15 joints, forming a vector of 45 coordinates. Such vector has a quite high dimensionality but also redundant information. We use a manifold learning method, Diffusion Maps [2], to reduce the dimensionality and extract the significant information from this skeleton.

Skeletons extracted from depth maps tend to suffer from a high amount of noise and outliers. Therefore, we modify the original Diffusion Maps [2] to add the extension that Gerber et al. [3] proposed for dealing with outliers in Laplacian Eigenmaps that are another type of manifold.

Our manifold provides us with a new representation mathbf{Y}[\latex] of the pose, derived from the normalised skeleton, with a much lower dimensionality (typically 3 dimensions instead of the initial 45) and significantly less noise and outliers. We use this new pose feature mathbf{Y}[\latex] to assess the quality of the movement.

Assessment against a statistical model of normal movement

We build two statistical models from our new pose feature, which describe respectively normal poses and normal dynamics. These models are represented by probability density functions (pdf) which are learnt, using Parzen window estimators, from training sequences that contain only normal instances of the movement.

The pose model is in the form of the pdf \(f_{Y}\left(y\right)\) of a random variable \(Y\) that takes as value \(y=\mathbf{Y}\) our pose feature vector \(\mathbf{Y}\). The quality of a new pose \(y_t\) at frame \(t\) is then assessed as the log-likelihood of being described by the pose model, i.e. $$\mbox{llh}_{\mbox{pose}}= \log f_{Y}\left(y_t\right) \; .$$

The dynamics model is represented as the pdf \(f_{Y_t}\left(y_t|y_1,\ldots,y_{t-1}\right)\) which describes the likelihood of a pose \(y_t\) at a new frame \(t\) given the poses at the previous frames. In order to compute it, we introduce \(X_t\) with value \(x_t \in \left[0,1\right]\), which is the stage of the (periodic or non-periodic) movement at frame \(t\). Note, in the case of periodic movements, this movement stage can also be seen as the phase of the movement’s cycle. Based on Markovian assumptions, we find that $$ f_{Y_t}\left(y_t|y_1,\ldots,y_{t-1}\right) \approx f_{Y_t}\left(y_t|\hat{x}_t\right) f_{X_t}\left(\hat{x}_t|\hat{x}_{t-1}\right) \; ,$$ with \(\hat{x}_t\) an approximation of \(x_t\) that minimises \(f_{\left\{X_0,\ldots,X_t\right\}}\left(x_0,\ldots,x_t|y_1,\ldots,y_t\right)\). \(f_{Y_t}\left(y_t|x_t\right)\) is learnt from training sequences using Parzen window estimation, while \(f_{X_t}\left(x_t|x_{t-1}\right)\) is set analytically so that \(x_t\) evolves steadily during a movement. The dynamics quality is then assessed as the log-likelihood of the model describing a sequence of poses within a window of size \(\omega\): $$\mbox{llh}_{\mbox{dyn}} \approx \frac{1}{\omega} \sum_{i=t-\omega+1}^{t} \log\left( f_{Y_i}\left(y_i|x_i\right) f_{X_i}\left(x_i|x_{i-1}\right) \right)\; .$$

Two thresholds on the two likelihoods, determined empirically, are used to classify the gait being normal and abnormal. Thresholds on the derivatives of the log-likelihoods allow refining the detections of abnormalities and of returns to normal.

Results

Gait on stairs

In order to analyse the quality of gait of subjects walking up stairs, we build our model of normal movement using 17 training sequences from 6 healthy subjects having no injury or disability, from which we extract 42 gait cycles.

We first prove the ability of our model to generalise to the gait of new subjects by evaluating the 13 normal gait sequences of 6 new subjects. As illustrated in Figs. 3 and 4, the normal gaits of new persons are well represented by the model, with the two likelihoods (middle and bottom rows) staying above the thresholds (dotted lines). In only one sequence out of all 13 did the likelihood drop slightly under the threshold (frames 45–47 of Fig. 4) due to particularly noisy skeletons.

normal_gait_example1 normal_gait_example2
Figure 3: Example 1 of normal gait – The model of normal movement can represent well the gait of a new subject, with the two likelihoods (middle and bottom rows) staying above the thresholds (dotted lines). Green: Normal, Red: Abnormal. Figure 4: Example 2 of normal gait – In frames 45–47, a particularly noisy skeleton leads to the likelihood dropping slightly under the thresholds. As a result, this part of the gait is classified as abnormal. Green: Normal, Red: Abnormal.

Next, we apply our proposed method to three types of abnormal gaits:

  • “Left leg Lead” (LL) abnormal gait: the subjects walk up the stairs always initially using their left leg to move to the next upper step (illustrated in Fig. 5).
  • “Right leg Lead” (RL) abnormal gait: the subjects walk up the stairs always initially using their right leg to move to the next upper step (illustrated in Fig. 6).
  • “Freeze of Gait” (FoG): the subjects freeze at some stage of the movement (illustrated in Fig. 7).

In all three cases, the pose of the subject is always normal, but its dynamics is affected by either the use of the unexpected leg (LL and RL) or by the (temporary) complete stop of the movement.

In our tests, these abnormal events are detected by our method with a rate of 0.93, with the likelihood dropping at all but 2 gait cycles in the LL and RL cases, and during the stops in the FoG case. Table 1 summarises the detection rates of abnormal events by our method.

abnormal_gait_example1_LL abnormal_gait_example2_RL abnormal_gait_example3_FoG
Figure 5: Example of “Left leg Lead” abnormal gait – Every time the subject uses an unexpected leg, the movement’s stage stops evolving steadily and the dynamics likelihood (bottom row) drops below its threshold (dotted line). Green: Normal, Red: Abnormal, Blue: Refined detection of normal, Orange: Refined detections of abnormal. Manual detections are presented as shaded blue areas. Figure 6: Example of “Right leg Lead” abnormal gait – Every time the subject uses an unexpected leg, the movement’s stage stops evolving steadily and the dynamics likelihood (bottom row) drops below its threshold (dotted line). Green: Normal, Red: Abnormal, Blue: Refined detection of normal, Orange: Refined detections of abnormal. Manual detections are presented as shaded blue areas. Figure 7: Example of “Freeze of gait” – The subject freezes twice during the sequence, resulting in the movement’s stage not evolving anymore at these times, and the dynamics likelihood dropping dramatically. Green: Normal, Red: Abnormal, Blue: Refined detection of normal, Orange: Refined detections of abnormal. Manual detections are presented as shaded blue areas.
Table 1: Results on detection of abnormal events
Type of event Number of occurences False Positives True Positives False Negatives Proportion missed
LL 21 0 19 2 0.10
RL 25 0 23 2 0.08
FoG 12 2 12 0 0
All 58 2 54 4 0.07

Sitting and standing

We also apply our proposed method to the analysis of sitting and standing movements. Two separate (bi-component) models are built, to represent sitting and standing movements respectively. They are executed concurrently, and their analyses are triggered when their respective starting conditions are detected. We use the very simple starting condition of the first coordinate of \(\mathbf{Y}\) staying at its starting value for a few frames, and then deviating. Our stopping condition is similar.

For our experiments, a qualified physiotherapist simulates abnormal sitting and standing movements, such as a loss of balance while standing up that leads to an exaggerated inclination of the torso, as illustrated in Figs. 9 and 10.

normal_sit_stand abnormal_stand1 abnormal_stand2
Figure 8: Example of normal sitting and standing movements – The two sitting and standing models are used iteratively and are triggered automatically when their starting conditions are detected. Figure 9: Example abnormal standing movement – The subject loses their balance and leans forward. Green: Normal, Red: Abnormal, Orange: Refined detections of abnormal. Figure 10: Example of difficult standing movement – The subject fails on their first attempt to stand up. This failure is detected and the tracking stops. It resumes on the second attempt, and detects the torso leaning forward exaggeratedly. Green: Normal, Red: Abnormal, Orange: Refined detections of abnormal.

 

Sport boxing

We analyse boxing movements consisting of a cross left punch (a straight punch thrown from the back hand in a southpaw stance) and a return to initial position. We use the same strategy than for sitting and standing movements, with two separate models that are triggered iteratively and automatically when their respective starting conditions are observed.

In our testing sequence, the subject alternates between 3 normal and 3 abnormal punches. Different types of abnormalities that are typical beginner mistakes are simulated for each set of 3 abnormal punches. The results, presented in Fig. 11, show that as in previous experiments, abnormal movements are correctly detected, as well as return to normality. Note that in this experiment, most abnormal movements are due to a wrong pose of the subject and therefore trigger strong responses from the pose model. The level of abnormality is also be quantified by the variations of \(\mbox{llh}_{\mbox{pose}}\) and \(\mbox{llh}_{\mbox{seq}}\) that correspond to different amplitudes of pose mistakes. For example, non-rotating hips (first 2 sets of anomalies) affect the whole body thus they trigger a stronger response than a too high punching elbow (fourth set of anomalies).

Boxing
Figure 11: Example of analysis of sport movements: cross left punch in boxing.

Publications and datasets

Our proposed method for assessing movement quality is presented in the following article:

The dataset used in this article can be downloaded in full (depth videos + skeleton) here, and a lighter version with skeleton only here. It may be used on the condition of citing our paper “Online quality assessment of human movement from skeleton data, BMVC2014” and the SPHERE project.

References

[1] OpenNI skeleton tracker. URL http://www.openni.org/documentation.

[2] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006

[3] S. Gerber, T. Tasdizen, and R. Whitaker. Robust non-linear dimensionality reduction using successive 1-dimensional Laplacian eigenmaps. In Proceedings of the 24th international conference on Machine learning, pages 281–288. ACM, 2007

Automated Driver Assistance Systems

Real Time Detection and Recognition of Road Traffic Signs

Researchers

Dr. Jack Greenhalgh and Prof. Majid Mirmehdi

Overview

We researched automatic detection and recognition of text in traffic signs. Scene structure is used to define search regions within the image, in which traffic sign candidates are then found. Maximally stable extremal regions (MSER) and hue, saturation, value (HSV) colour thresholding are used to locate a large number of candidates, which are then reduced by applying constraints based on temporal and structural information. A recognition stage interprets the text contained within detected candidate regions. Individual text characters are detected as MSERs and grouped into lines before being interpreted using optical character recogntion (OCR). Recognition accuracy is vastly improved through the temporal fusion of text results across consecutive frames.

Publications

Jack Greenhalgh, Majid Mirmehdi, Detection and Recognition of Painted Road Markings. 4th International Conference on Pattern Recognition Applications and Methods, January 2015, Lisbon, Portugal. [pdf]

Jack Greenhalgh, Majid Mirmehdi, Recognizing Text-Based Traffic Signs. Transactions on Intelligent Transportation Systems, 16 (3), 1360-1369, 2015 [pdf]

Jack Greenhalgh, Majid Mirmehdi, Automatic Detection and Recognition of Symbols and Text on the Road Surface, Pattern Recognition: Applications and Methods, 124-140, 2015

Jack Greenhalgh, Majid Mirmehdi, Real Time Detection and Recognition of Road Traffic Signs. Transactions on Intelligent Transportation Systems, Vol.13, no.4, pp.1498-1506, Dec.2012 [pdf]

Jack Greenhalgh, Majid Mirmehdi, Traffic Sign Recognition Using MSER and Random Forests. 20th European Signal Processing Conference, pages 1935-1939. EURASIP, August 2012, Bucharest, Romania. [pdf]

 

Data

Here is some data for the detction and recognition of text-based road signs. This dataset consists of 9 video sequences, with a total of 23,130 frames, at a resolution of 1920 X 1088 pixels. Calibration parameters for the camera used to capture the data are also provided.

https://www.cs.bris.ac.uk/home/greenhal/textdataset.html

Hardware-accelerated Video Fusion

This projects aim at producing a low-power demonstrator for real-time video fusion using a hybrid SoC device that combines a low-power Cortex A9 multi-core processor and a FPGA fabric. The methodology involves using a fusion algorithm developed at Bristol based on Complex dual-tree wavelet transforms.  These transforms work in forward and inverse mode together with configurable fusion rules to offer high quality fusion output.

The complex dual-tree wavelet transforms represents  around 70% of total complexity. The wavelet accelerator designed at Bristol removes this complexity and accelerates the whole application by a factor of x4.  It also has a significant positive impact in overall energy. There is a negligible increase in power due to the fact that the fabric works in parallel with the main processor. Notice that if the optimization criteria is not performance or energy but power then the processor and fabric could reduce its clock frequency and voltage and obtain a significant reduction in power for the same energy and performance levels.

This project has built a system extended with frame capturing capabilities using thermal and visible light cameras.  In this link you can see the system working in our labs : hardware accelerated video fusion

This project has been funded by the Technology Strategy Board under their energy-efficient computers program with Qioptiq Ltd as industrial collaborator.

This research will be presented and demonstrated at FPL 2015, London in September.

Video super-resolution

Motion compensated video super-resolution is a technique that uses the sub-pixel shifts between multiple low resolution images of the same scene to create higher resolution frames with improved quality. An important concept is that due to the sub-pixel displacements of picture elements in the low resolution frames, it is possible to obtain high frequency content beyond the Nyquist limit of the sampling equipment. Super-resolution algorithms exploit the fact that as objects move in front of the camera sensor, picture elements captured in the camera pixels might not be visible in the next frame if the movement of the element does not extend to the next pixel. Super-resolution algorithms track and position these additional picture elements in the high-resolution frame. The resulting video quality is significantly improved compared with techniques that only exploit the information in one low-resolution frame to create one high resolution frame.

Super-Resolution techniques can be applied to many areas, including intelligent personal identification, medical imaging, security, surveillance and can be of special interest in applications that demand low-power and low-cost sensors. The key idea is that increasing the pixel size improves the signal to noise ratio and reduces the cost and power of the sensor.  Larger pixels enable more light to be collected and in addition the blur introduced by diffraction is reduced. Diffraction is a bigger issue with smaller pixels, so again sensors with larger pixels will perform better, giving sharper images with higher contrast in the fine details, especially in low-light conditions.

Benefits include that increasing the pixel size means that fewer pixels can be located in the sensor and this reduces the sensor resolution.  The low-resolution sensor needs to process and transmit a lower amount of information which results in lower power and cost.  Super-resolution algorithms running in the receiver side can then be used to recover high-quality and high-resolution videos maintaining a constant frame rate.

Overall, super-resolution enables the system that captures and transmits the video data to be based on low-power and low-cost components while the receiver still obtains a high-quality video stream.

This project has been sponsored by the Centre for Defence Enterprise and DSTL under the Generic Enablers for Low-Size, Weight, Power and Cost (SWAPC) Intelligence, Surveillance, Target Acquisition and Reconnaissance (ISTAR) program.

Click to see some examples :

1:  before  car number plate in and after super-resolution car number plate SR

2:  before vehicles in and after super-resolution vehicles SR

and learn about the theory behind the algorithm:  Chen, J, Nunez-Yanez, JL & Achim, A 2014, ‘Bayesian video super-resolution with heavy-tailed prior models’. IEEE Transactions on Circuits and Systems for Video Technology, vol 24., pp. 905-914

Perceptual Quality Metrics (PVM)

RESEARCHERS

Dr. Fan (Aaron) Zhang

INVESTIGATOR

Prof. David Bull, Dr. Dimitris Agrafiotis and Dr. Roland Baddeley

DATES

2012-2015

FUNDING

ORSAS and EPSRC

SOURCE CODE 

PVM Matlab code Download.

INTRODUCTION

It is known that the human visual system (HVS) employs independent processes (distortion detection and artefact perception – also often referred to near-threshold supra-threshold distortion perception) to assess video quality for various distortion levels. Visual masking effects also play an important role in video distortion perception, especially within spatial and temporal textures.

Algorithmic diagram for PVM.
It is well known that small differences in textured content can be tolerated by the HVS. In this work, we employ the dual-tree complex wavelet transform (DT-CWT) in conjunction with motion analysis to characterise this tolerance within spatial and temporal textures. The DT-CWT has been found to be particularly powerful in this context due to its shift invariance and orientation selectivity properties. In highly distorted video content, for compressed material, blurring is one of the most commonly occuring artefacts. This is detected in our approach by comparing high frequency subband coefficients from the reference and distorted frames, also facilitated by the DT-CWT. This is motion-weighted in order to simulate the tolerance of the HVS to blurring in content with high temporal activity. Inspired by the previous work of Chandler and Hemamiand Larson and Chandler, thresholded differences (defined as noticeable distortion) and blurring artefacts are non-linearly combined using a modified geometric mean model, in which the proportion of each component is adaptively tuned. The performance of the proposed video metric is assessed and validated using the VQEG FRTV Phase I and the LIVE video databases, and shows clear improvements in correlation with subjective scores, over existing metrics such as PSNR, SSIM, VIF, VSNR, VQM and MOVIE, and in many cases over STMAD.

RESULTS

Figure: Scatter plots of subjective DMOS versus different video metrics on the VQEG database.
Figure: Scatter plots of subjective DMOS versus different video metrics on the LIVE video database.

REFERENCE

  1. A Perception-based Hybrid Model for Video Quality Assessment F. Zhang and D. Bull, IEEE T-CSVT, June 2016.
  2. Quality Assessment Methods for Perceptual Video Compression F. Zhang and D. Bull, ICIP, Melbourne, Australia, September 2013.

 

Parametric Video Coding

RESEARCHERS

Dr. Fan (Aaron) Zhang

INVESTIGATOR

Prof. David Bull, Dr. Dimitris Agrafiotis and Dr. Roland Baddeley

DATES

2008-2015

FUNDING

ORSAS and EPSRC

INTRODUCTION

In most cases, the target of video compression is to provide good subjective quality rather than to simply produce the most similar pictures to the originals. Based on this assumption, it is possible to conceive of a compression scheme where an analysis/synthesis framework is employed rather than the conventional energy minimization approach. If such a scheme were practical, it could offer lower bitrates through reduced residual and motion vector coding, using a parametric approach to describe texture warping and/or synthesis.

methodDiagram-1200x466

Instead of encoding whole images or prediction residuals after translational motion estimation, our algorithm employs a perspective motion model to warp static textures and utilises texture synthesis to create dynamic textures. Texture regions are segmented using features derived from the complex wavelet transform and further classified according to their spatial and temporal characteristics. Moreover, a compatible artefact-based video metric (AVM) is proposed with which to evaluate the quality of the reconstructed video. This is also employed in-loop to prevent warping and synthesis artefacts. The proposed algorithm has been integrated into an H.264 video coding framework. The results show significant bitrate savings, of up to 60% compared with H.264 at the same objective quality (based on AVM) and subjective scores.

RESULTS

 

 

REFERENCE

  1. Perception-oriented Video Coding based on Image Analysis and Completion: a Review. P. Ndjiki-Nya, D. Doshkov, H. Kaprykowsky, F. Zhang, D. Bull, T. Wiegand, Signal Processing: Image Communication, July 2012.
  2. A Parametric Framework For Video Compression Using Region-based Texture Models. F. Zhang and D. Bull, IEEE J-STSP, November 2011.