Multimedia Computing and Computer Vision Lab

Login  

Home

     

Courses

     

People

     

Research

     

Publications

     

Student Theses

     

Source Code / Datasets

     

Contact

     

Available Thesis Topics

From Multimedia Computing Lab - University of Augsburg


Object Detection

File:Msc_thesis_example_company_logo_detection.jpg

Introduction

Object detection is a discipline with a long and rich tradition in the computer vision community. At the Multimedia and Computer Vision Lab we are interested in applying these principles to the specific task of detecting company logos. This task draws a lot of commercial interest. One example being companies performing market research.

This specific application of object detection has its own set of unique challenges: Usually, company logos are not the intended object to be captured when taking an image. Most of the time, company logos get caught in the image by accident. Sometimes however, there are exceptions. As a result company logo detection has to deal with many small object instances and a large variation in scale.

At the Multimedia and Computer Vision Lab, we use deep neural networks for this task. We use object detection pipelines such as Faster RCNN and SSD as basis for our own algorithms.


Available Topics

Feature Pyramid Networks with SSD

Most network architectures consist of a series of learnable convolutions with intermediate downsampling (pooling) layers. This results in a hierarchy of features of different resolutions.

SSD starts with a dense grid of candidate boxes of multiple scales and aspect ratios. For each candidate box, SSD predicts class membership and performs a bounding box regression which allows deformations of these candidate boxes to better fit an object. These predictions are based on a single layer of this feature hierarchy suitable for the object size.

We would like to use the information from multiple layers of this hierarchy for these predictions. In order to do that, we want to apply an approach called "Feature Pyramid Networks" (FPNs). FPNs introduce an additional top-down path which combines the information from multiple feature maps and have been successfully used in the context of the Faster R-CNN pipeline. Your task would be to investigate whether these ideas can also be applied for SSD.

Your tasks will be:

  1. Familiarize yourself with the Caffe Deep Learning Framework and SSD
  2. Bring FPNs into an existing SSD implementation (provided by us)
  3. Train and evaluate your implementation on the FlickrLogos-47 dataset
  4. Compare and analyze performance/runtime/memory with the non-FPN approach

Basic knowledge of C++ is advantageous since both the Caffe Framework and our SSD implementation are based on it.

If you are interested and want more information, please contact Christian Eggert.

Class-specific bounding box regression in SSD

SSD starts with a dense grid of candidate boxes (often called anchors) of multiple scales and aspect ratios. For each candidate box, SSD predicts class membership and performs a bounding box regression which allows deformations of these candidate boxes to better fit an object.

Currently, this regression is learned on a per-anchor basis: For every combination of scale and aspect ratio a different regression is being learned. For this topic we would like to extend this bounding box regression to a per-anchor and per-class basis.

Your tasks will be:

  1. Familiarize yourself with the Caffe Deep Learning Framework and SSD
  2. Extend our existing SSD implementation (provided by us) with the capability to predict independent bounding box regression for each anchor and each class
  3. Train and evlaute your implementation on the FlickrLogos-47 dataset
  4. Compare and analyze performance/runtime/memory with the class-independent approach

Solid knowledge of C++ is required since you will be extending an existing codebase.

If you are interested and want more information, please contact Christian Eggert.

Image Captioning

Image Captioning in computer vision. Image taken from [1].

Introduction

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.

Available Topics

Generating Image Captions for Unkown Objects

Figure 1: Examples of generated captions (LRCN) and improved sentences (LSTM-C) by using a copying mechanism that replaces unknown words by detected objects within the caption. Figure taken from [2].

As seen in [1], deep neural networks a capable of generating simple captions for given images when given a large dataset in the order of a million labeled image-caption pairs. However, as every data driven machine learning approach these models can’t create captions for objects never seen before, i.e. the dataset doesn’t contain images or captions of a certain type. Training such a model takes days to weeks on modern GPUs, but a requirement may be that this model can describe new object categories afterwards with little to no extra effort.

Ting et al. use a hybrid model that combines an object detection model with a captioning model. In figure 1 you can see a picture and its ground-truth caption. A standard captioning model (LRCN) may produce a caption that is not accurate (e.g. “a red fire hydrant is parked in the middele of a city”). Combined with an object detection pipeline, which detects a bus in this case this hybrid model can generate a much more accurate caption “a bus driving down a street next to a building”.

Your tasks will be:

  1. Familiarize yourself with the Tensorflow Framework and the Show and Tell model
  2. Integrate the LSTM copying mechanism introduced in [2] into a given Tensorflow Model (Show and Tell)
  3. Train and evaluate your implementation on the MSCOCO dataset by holding out 8 objects out of the dataset
  4. Compare and analyze the performance of your implementation against the paper [2].

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python.

If you are interested and want more information, please contact Philipp Harzig.

Visual Question Answering

Figure 2: Examples of question/image input pairs and generated answers. Picture taken from [3].

Building on top of general image captioning another more challenging task in computer vision has come up recently. It is called visual question answering a question referencing some of the input image’s contents is part of the input. The model then tries to answer the question as accurate as possible. See figure 2 for image-question-answer examples. Most publications have agreed on one approach to tackle this problem. The question and image are both embedded in a vector representation, then combined in some way and the answer is generated as the most likely answer out of 3000 to 5000 possible answers. Therefore, the problem is modeled as a classification problem, i.e. all possible answers are assigned a probability and the most likely answer is selected out of all answers.

Your tasks will be:

  1. Familiarize yourself with the Tensorflow Framework and a classification pipeline already implemented in Tensorflow (e.g. the Slim Framework)
  2. Implement a model that embeds an image with a deep convolutional neural network (DCNN) like the ResNet or Inception network and a question with an LSTM network. This model should combine both representations and use a classification model that selects the most probable answer for this combination.
  3. Train and evaluate your implementation on the VQA-v2 dataset.
  4. Compare and analyze the performance vs. state-of-the-art approaches.

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python. If you are interested and want more information, please contact Philipp Harzig.

Literature for Image Captioning

[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[2] Yao, Ting, et al. "Incorporating copying mechanism in image captioning for learning novel objects." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.

[3] Teney, Damien, et al. "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge." arXiv preprint arXiv:1708.02711 (2017).



Computer Aided Diagnosis

Introduction

Computer Aided Diagnosis (CADx) helps doctors interpret medical images, such as 2 dimensional x-rays, MRI scans, CT scans and ultrasound images. CADx has accomplished major progress, thanks to the big advances in computer vision in recent years. Task range from image preprocessing, segmentation, detection, evaluation, classification to diagnosis of diseases.

Available Topics

Pneumonia Diagnosis

File:Msc_thesis_example_pneunomia_diagnosis.jpg

The National Institutes of Health (NIH) has released a dataset of 112,120 frontal-view thorax x-rays of 30,805 unique patients. The images were labeled with up to 14 thoracic diseases. The disease labels were generated by automated text processing with a combination of machine learning algorithms and rule based approaches. One of the labeled diseases is pneumonia. Pneumonia is an inflammatory condition of the lung caused by infection with viruses or bacteria. It affects about 7% of the world wide population and results in about 4 million deaths per year. According to the WHO, chest X-rays are the best method of detecting pneumonia. The diagnosis depends on expert radiologists, which are scarcely available in developing countries. Developing efficient CADx Algorithms for pneumonia detection could have a great positive impact on these countries.

Your tasks will be:

  1. Familiarize yourself with the Tensorflow Framework and SLIM
  2. Implement a deep neural net using SLIM to detect pneumonia
  3. Train and evaluate your implementation on the ChestXray-NIHCC dataset
  4. Compare and analyze the performance of your implementation against the paper [1].

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python. If you are interested and want more information, please contact Stephan Brehm.

Literature

[1] Pranav Rajpurkar, et al. “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning” arxiv 2017 https://arxiv.org/abs/1711.05225


Multilabel Chest X-ray Diagnosis

File:Msc_thesis_example_multilabel_diagnosis.png

The National Institutes of Health (NIH) has released a dataset of 112,120 frontal-view thorax x-rays of 30,805 unique patients. The images were labeled with up to 14 thoracic diseases. The disease labels were generated by automated text processing with a combination of machine learning algorithms and rule based approaches. The diseases range from Atelectasis, Cardiomegaly to Pneumonia and Pneumothorax. One X-ray can show multiple diseases. So you will be facing a multi-label multi-class classification task.

Your tasks will be:

  1. Familiarize yourself with the Tensorflow Framework and SLIM
  2. Implement a deep neural net using SLIM to detect all 14 diseases
  3. Train and evaluate your implementation on the ChestXray-NIHCC dataset

Python and Numpy knowledge is advantageous as Tensorflow models are implemented in Python. If you are interested and want more information, please contact Stephan Brehm.