In-situ interactive model building

intmodThe current system allows online building of 3D wireframe models through a combination of user interaction and automated methods from a handheld camera-mouse. Crucially, the model being built is used to concurrently compute camera pose, permitting extendable tracking while enabling the user to edit the model interactively. In contrast to other model building methods that are either off-line and/or automated but computationally intensive, the aim here is to have a system that has low computational requirements and that enables the user to define what is relevant (and what is not) at the time the model is being built. OutlinAR hardware is also developed which simply consists of the combination of a camera with a wide field of view lens and a wheeled computer mouse.

 

Situation awareness through hand behaviour analysis

sitawareActivity recognition and event classification are of prime relevance to any intelligent system designed to assist on the move. There have been several systems aimed at the capturing of signals from a wearable computer with the aim of establishing a relationship between what is being perceived now and what should be happening. Assisting people is indeed one of the main championed potentials of wearable sensing and therefore of significant research interest.

Our work currently focuses on higher-level activity recognition that processes very low resolution motion images (160×120 pixels) to classify user manipulation activity. For this work, we base our test environment on supervised learning of the user’s behaviour from video sequences. The system observes interaction between the user’s hands and various objects, in various locations of the environment, from a wide angle shoulder-worn camera. The location and object being interacted with are indirectly deduced on the fly from the manipulation motions. Using this low-level visual information user activity is classified as one from a set of previously learned classes.

Towards Robust Real-time Visual SLAM

Our project investigates how to improve feature matching within a single camera Real-time Visual SLAM system. SLAM stands for Simultaneous Localisation and Mapping, when a camera position is estimated simultaneously with sparse point-wise representation of a surrounding environment. The camera is hand-held in our case, hence it is important to maintain camera track during or quickly recover after unpredicted and erratic motions. The range of scenarios we would like to deal with includes severe shake, partial or total occlusion and camera kidnapping.

One of the directions of our research is an adaptation of distinctive but in the same time robust image feature descriptors. These descriptors are the final stage of the Scale Invariant Feature Transform (SIFT). This descriptor forms a vector which describes a distribution of local image gradients through specially positioned orientation histograms. Such representation was inspired by advances in understanding the human vision system. In our implementation a scale selection is stochastically guided by the estimates from the SLAM filter. This allows to omit a relatively expensive scale invariant detector of the SIFT scheme.

When the camera is kidnapped or unable to perform any reliable measurement a special relocalisation mode kicks in. It attempts to find a new correct camera position by performing many-to-many feature search and use robust geometry verification procedure to ensure that a pose and found set of matches are in consensus. We investigate a way of speeding up the feature search by splitting the search space based on feature appearances.

The software based on our findings is incorporated into the Real-time Visual SLAM system which is used extensively within the Visual Information Laboratory.

 

Place Recognition From Disparate Views

Visual place recognition methods which use image matching techniques have shown success in recent years, however their reliance on local features restricts their use to images which are visually similar and which overlap in viewpoint. We suggest that a semantic approach to the problem would provide a more meaningful relationship between views of a place and so allow recognition when views are disparate and database coverage is sparse. As initial work towards this goal we present a system which uses detected objects as the basic feature and demonstrate promising ability to recognise places from arbitrary viewpoints. We build a 2D place model of object positions and extract features which characterise a pair of models. We then use distributions learned from training examples to compute the probability that the pair depict the same place and also an estimate of the relative pose of the cameras. Results on a dataset of 40 urban locations show good recognition performance and pose estimation, even for highly disparate views.

systemdiagramsmall-822x386

Notable Results

To assess the performance of our system, we collected a dataset of 40 locations, each with between 2 and 4 images from widely different viewpoints. Since we are simply learning distributions over comparisons of places, not about the places themselves, we decided to train the system on a subset of the test dataset to maximise use of the data. To verify that the results were not biased, we tried repeatedly training the system on a random 50% subset of the dataset and running the test again. We found that the learned probability distributions were very similar each iteration, and that the recognition performance did not change by more than about 2%.

A place recognition experiment was then performed. Each image from the dataset was compared against every other image to compute the the posterior probability that the images depict the same place. The table below states the performance of our system under several conditions. The “grouped” score is simply the percentage of test images for which an image from the same place was chosen as the most likely match, simulating a place recognition scenario in which we have made a small number of previous observations of each place. It is interesting however to consider a harder case in which, for each test image, there is only a single matching image in the database. The “pairwise” score simulates this situation by removing all but one of the matching images for each test image.

We also observed that some discriminative ability of the system is provided by the different object classes – so a place with objects of class “sign” and “bollard” cannot possibly match with a place containing only “traffic light” objects. Whilst this is a legitimate place recognition scenario, we wanted to observe the discriminative ability of the features alone. Thus, we also tested the system on a “restricted class” subset of the dataset with 30 locations, all of which contained the same two object classes, meaning that almost every image was capable of valid object correspondences with every other image. Clearly this is a harder case, however the table shows that performance was still reasonable.

Grouped Pairwise
Restricted class dataset 67.9% 54.5%
Full dataset 73.1% 61.8%
GIST (Oliva and Torralba, 2001) 19.2% 21.4%

ViewNet

Context enhanced networked services by fusing mobile vision and location

vnet

ViewNet is a 1.5M GBP project jointly funded by the UK Technology Strategy Board, the EPSRC and industrial partners. The aim is to develop the next generation of distributed localisation and user-assisted mapping systems, based on the fusion of multiple sensing technologies, including visual SLAM, inertial devices, UWB and GPS.

The target application is the rapid mapping and visualisation of previously unseen environments. It is a multidisciplinary collaboration between the University and a consortium of market leading technology companies and government agencies led by 3C Research. The project is being led by Andrew Calway and Walterio Mayol-Cuevas from the Computer Vision Group and Angela Doufexi and Mark Beach from the Centre for Communications Research in Electrical and Electronic Engineering.

More information