University of Bristol

# Weakly Supervised Learning of Visual Semantic Attributes

David Hanwell

Many of the adjectives we use in everyday language denote visual attributes of objects. For example colours (red’, ruby’, vermilion’), patterns (stripy’, chequered’), and others (leopard skin’, `dark’). We refer to these as “visual semantic attributes” – they define associations between concepts of thought and language (semantic), and the visual appearance of objects in our physical world (visual).

Automating the recognition of such attributes in photos and video would allow rich textual descriptions to be generated, especially when combined with object detection. This would be useful for various applications; image archiving, generating descriptions for products from their photos (for example on ebay), and perhaps even for surveillance (for example “find the man in the blue stripy shirt in these CCTV feeds”). Typically, models of such attributes are trained using image data which has been specifically chosen or acquired for the task, and has been manually labelled, annotated and segmented.

There are already billions of images available on the internet, with many more being added every day. Although these images have not been manually annotated or segmented, many of them have text associated with them, for example items for sale on ebay have both an image and a text description. We call such images ‘weakly labelled’ – each has a corresponding set of words, some of which may describe some, all or none of the image content.

The question we are trying to answer is: Can we use weakly labelled images to train computer models to recognise visual semantic attributes?

There would be several potential benefits to this approach:

• Without the need to acquire bespoke images, or to manually label, annotate or segment them, the process of learning an attribute would be faster, easier and cheaper.
• Since the training data would be acquired from many different sources, and the corresponding text written by many different people, the resulting models would be unbiased by any one person’s notion of an attribute.
• Due the huge range of images available, with different objects, lighting, and variations on each attribute, each model may better capture possible variation of an attribute, and thus be better able to recognise it in novel images.

### Weakly Supervised Learning of Semantic Colour Terms

In order to take advantage of such easily obtainable data, there are a number of challenges which must be overcome. If we acquire a set of images from a web search, using an attribute term, not all of the returned images will contain the attribute, and even in those which do, there will be regions which do not. Here are some examples of images returned from a search for ‘Blue’:

We have proposed a method of learning colour terms from weakly labelled data such as this. Below are some examples of images from the test-set of [1], which have been segmented into regions semantic colour terms using the proposed method. This work has been accepted for publication in IET Computer Vision [2].

### QUAC: Quick Unsupervised Anisotropic Clustering

We would like to extend the above work on the weakly supervised learning of colour terms, to more complex attributes using other features. To do this, we propose to first cluster the features extracted from each image, before performing weakly supervised learning using paramaterisations of the clusters. This way the volume of data per image will be greatly reduced, allowing many more images to be used for learning.

To this end, we propose QUAC – Quick Unsupervised Anisotropic Clustering. The algorithm works by finding elliptical (or hyper-elliptical) clusters one at a time, removing the data corresponding to each cluster after it is found. It has several advantages over other clustering algorithms:

• It does not assume isotropy, and is capable of finding highly elongated clusters.
• It is unsupervised, having only a single parameter, and not requiring the number of clusters to be set.
• Its complexity is linear in the number of data, resulting in good performance on larger data sets.

Below is an image of five pairs of differently coloured trousers against a white background. We have taken the Green and Blue channels of this image to create a 2D data set, also shown below. The five elongated clusters corresponding to the five pairs of trousers are clearly visible, as is the less elongated cluster in the bottom right corner of the feature space, corresponding to the white background. An animation shows how QUAC finds the clusters. This work has been accepted for publication in Pattern Recognition [3]. Source code is available below.