Multimedia Computing and Computer Vision Lab

Login  

Home

     

Courses

     

People

     

Research

     

Publications

     

Student Theses

     

Source Code / Datasets

     

Contact

     

Research

From Multimedia Computing Lab - University of Augsburg


Deep Image Captioning

Generating captions that describe the content of an image is a task emerging in computer vision. Lastly, Recurrent Neural Networks (RNN) in the form of Long Short-Term Memory (LSTM) networks have shown great success in generating captions matching an image's content. In contrast to traditional tasks like image classification or object detection, this task is more challenging. A model not only needs to identify a main class but also needs to recognize relationships between objects and describe them in a natural language like English. Recently, an encoder/decoder network presented by Vinyals et al. [1] won the Microsoft COCO 2015 captioning challenge.

[1] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Image Captioning of Branded Products

In a collaboration with the GfK Verein (link), we introduced a pipeline capable of automatically generating captions for images from social media. In particular, we look at images that contain an object which is related to a brand by depicting a logo of this brand on it.

Test images from our dataset. Our model generates “a female hand holds a can of cocacola above a tiled floor.”, “a hand is holding a kinderriegel bar.”, “a hand is holding a can of heinz.”, and “a young woman is holding a nutella jar in front of her face.” for the top left, top right, bottom left, and bottom right image, respectively.

In this project, we focus on correctly identifing the brand contained in the image, but state of the art models like Vinyals et al. tend to produce rather generalized descriptions. In contrast, we want our model to correctly mention the name of the brand contained in the image within the sentence. Simultaneously, we predict attributes that describe the involvement of the human with the brand, whether the branded product appears in a positive or negative context, and whether the interaction is functional or emotional.

References:

Philipp Harzig, Stephan Brehm, Rainer Lienhart, Carolin Kaiser, René Schallner, Multimodal Image Captioning for Marketing Analysis, IEEE MIPR 2018 Miami, FL, USA, April 2018, [PDF]


For more information please contact Philipp Harzig.

Medical Image Captioning

Image Captioning also started to become popular in automatically generating doctor’s reports for thorax x-ray images. Annotating chest x-rays is a tedious and time-consuming job, which involves a lot of domain knowledge. In the recent year, more and more approaches were introduced that try to automatically generate paragraphs of text, which read like a doctor’s report. However, data is really scarce and annotations cannot be gathered as easily as for tasks like generic image captioning or image classification, because domain experts are needed to create a textual impression of a patient’s chest x-ray. Second, real medical data has to conform to privacy laws and, therefore, anonymized. The only publicly available dataset, which combines chest x-ray images with doctor’s reports only contains 7470 sample, of which only half has a unique doctor’s report (there are mostly two chest x-ray images showing a different view per report).

Two examples from the Indiana University Chest X-Ray collection. The upper row shows a normal case without findings, while the bottom row shows a case with findings. We highlighted the sentences with our human abnormality annotation, i.e., normal sentences are highlighted in blue and abnormal sentences are written in green.

In our research, we focus on correctly identifying abnormalities, as the fraction of sentences describing the abnormalities are very rare. We want to improve the captioning quality on a correct identification of abnormalities, and, not based on a machine translation metric like BLEU.


For more information please contact Philipp Harzig.

Visual Question Answering

Building on top of general image captioning another more challenging task in computer vision has come up recently. It is called visual question answering a question referencing some of the input image’s contents is part of the input. The model then tries to answer the question as accurate as possible. Most publications have agreed on one approach to tackle this problem. The question and image are both embedded in a vector representation, then combined in some way and the answer is generated as the most likely answer out of 3000 to 5000 possible answers. Therefore, the problem is modeled as a classification problem, i.e. all possible answers are assigned a probability and the most likely answer is selected out of all answers. In our research, we work on a model that doesn’t rely on answering the question based on a predefined answer set, i.e., an answer to a question has an higher variability than only 3000 possible answers. We employ an LSTM to dynamically generate answers. These answers show to have a greater variability than the ground-truth, and, in addition also new – previously unseen – answers are generated that correctly answer the question.

Images associated with question and generated answers by our model. All answers shown are new ones not contained in the training set. Figures (a) - (d) show correct answers not detected by the official evaluation script. The second row shows wrong answers. Especially, (e) and (f) show sentences, where the end of sentence token was generated to early (dataset bias of short answers). (g) and (h) show wrong answers.

References:

Philipp Harzig, Christian Eggert, Rainer Lienhart Visual Question Answering With a Hybrid Convolution Recurrent Model, ACM International Conference on Multimedia Retrieval 2018 (ACM ICMR 2018) Yokohama, June 2018 [PDF]


For more information please contact Philipp Harzig.

Deep Image Augmentation and Manipulation

Right: Learned Semantic transformation of the object in the Left Image

Todays world is hugely driven by data. Data, however, in certain scenarios is extremely scarce. A common solution in these scenarios is data augmentation. Data augmentation in general, means transforming data to increase the total amount of data available. Classic data augmentation approaches for images include, cropping, resizing, rotating, illumination changes etc. In this line of research we are focusing on learning Image augmentation techniques. This allows for semantically meaningful changes to an image. For example, we could change the color of an object of interest like the car that is shown above.

Due to recent advances in the development of Deep Generative Models, we are nowadays able to do semantic data augmentation and produce new photorealistic examples.

For more information please contact Stephan Brehm.

Deep Image Synthesis

Images created from scratch by a Deep Neural Network

Todays world is hugely driven by data. Data, however, in certain scenarios is extremely scarce. This line of research focuses on creating new data from scratch.

For decades realistic image generation from nothing has been nothing more than a computer vision scientists dream. Today, with recent advances in Deep Generative Modelling we are moving closer to the goal of creating photorealistic images.

For more information please contact Stephan Brehm.

Deep Sports Pose

Automatically estimated poses of a swimmer during the start phase (left) and a long jump athlete (right).

Video recordings of athletes are an important tool in many sport types, including swimming and long/triple jump, to evaluate performance and assess possible improvements. For a quantitative evaluation the video material often has to be annotated manually, leading to a vast workload overhead. This limits such an analysis to top-tier athletes only. In this joint project with the Olympic Training Centers (OSPs) Hamburg/Schleswig-Holstein and Hessen we research deep neural network based human pose estimation and video event detection that can be applied to various sport types and environments. We evaluate our research using the very different examples of start phases in swimming and long/triple jump recordings. Our main focus lies on time-continuous predictions and the fusion of multiple synchronous camera streams. The goal of the project is to provide a reliable and automatic pose and event detection system that makes quantitative performance evaluation accessible to more athletes more frequently.


This joint project is funded by the Federal Institute for Sports Science (Bundesinstitut für Sportwissenschaft, BISp) based on a resolution of the German Bundestag, starting January 2018.

For more information please visit the project page or contact Moritz Einfalt.

References:

  • Moritz Einfalt, Dan Zecha, Rainer Lienhart.
    Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming.
    IEEE Winter Conference on Applications of Computer Vision 2018 (WACV18), Lake Tahoe, NV, USA, March 2018. [arXiv][PDF]
  • Rainer Lienhart, Moritz Einfalt, Dan Zecha. Mining Automatically Estimated Poses from Video Recordings of Top Athletes. IJCSS, Dec. 2018.[arXiv]

Deep Skijump Pose

Top: Continuously estimated joint trajectories are synchronized with force measurements in an effort to train a deep learning based force prediction network. Bottom: Original joint detections over multiple camera views on the left, rectified poses on the right.


In a joint effort with the Institute of Applied Training Science in Leipzig (Institut für angewandte Trainingswissenschaften, IAT) we develop a training feedback system for improving the jump posture of professional ski jumpers. In this project, we research deep learning algorithms for a continuous athlete pose and ski pose estimation and for tracking the body gravity center of ski jumpers. We use this tracking information to infer kinematic and ballistic flight parameters and to approximate external force sensor measurements, allowing for immediate training feedback with a large set of performance relevant parameters.

This joint project was funded by the Federal Institute for Sports Science (Bundesinstitut für Sportwissenschaft, BISp) based on a resolution of the German Bundestag.

For more information please visit the project page or contact Dan Zecha.

References:

  • Dan Zecha, Christian Eggert, Moritz Einfalt, Stephan Brehm, Rainer Lienhart.
    A Convolutional Sequence to Sequence Model for Multimodal Dynamics Prediction in Ski Jumps.
    First International ACM Workshop on Multimodal Content Analysis in Sports (ACM MMSports'18), part of ACM Multimedia 2018. Seoul, Korea, October 2018. [PDF]
  • Dan Zecha, Moritz Einfalt, Christian Eggert, Rainer Lienhart.
    Kinematic Pose Rectification for Performance Analysis and Retrieval in Sports.
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2018. Salt Lake City, USA, June 2018. [PDF]


Deep Swim Pose

Sample detection and extracted stroke rate.

The success of a professional athlete depends quite strongly on the assessment and active improvement of his or her technique. In the field of competitive swimming, a quantitative evaluation is highly desirable to supplement the typical qualitative analysis. However, quantitative (manual) evaluations are very time consuming and therefore only used in individual cases.

In a joint project with the Institute of Applied Training Science in Leipzig (Institut für angewandte Trainingswissenschaften, IAT), we are developing a system for detecting a swimmer in a swimming channel and continuously estimating his or her pose in order to capture (inner-)cyclic structures and derive kinematic parameters for a biomechanical analysis. Human pose recovery in aquatic environments faces a lot of challenges, from heavily cluttered fore- and background to partial occlusion.

The purpose of this project is to build a human pose detector based on recent advancements in the field of deep learning. Accurately estimated joint positions are used for a precise and reliable derivation of different kinematic parameters.


For more information please visit the project page or contact Dan Zecha.


References:

  • Dan Zecha, Christian Eggert, Rainer Lienhart, Pose Estimation for Deriving Kinematic Parameters of Competitive Swimmers, Computer Vision Applications in Sports, part of IS&T Electronic Imaging 2017, Burlingame, California, January 2017. [PDF] (to appear)
  • Dan Zecha and Rainer Lienhart. Key-Pose Prediction in Cyclic Human Motion. IEEE Winter Conference on Applications of Computer Vision 2015 (WACV15), Waikoloa Beach, HI, January 6-9, 2015 [PDF]
  • Dan Zecha, Thomas Greif, and Rainer Lienhart. Swimmer Detection and Pose Estimation for Continuous Stroke Rate Determination. Multimedia Content Access: Algorithms and Systems VI, part of IS&T/SPIE Electronic Imaging, 23 January 2012, Burlingame, California, USA
    Also Technical Report 2011-13, University of Augsburg, Institute of Computer Science, July 2011. [PDF] [Video]
  • Dan Zecha and Rainer Lienhart. Bestimmung intrazyklischer Phasengeschwindigkeiten von Schwimmern im Schwimmkanal mittels vollautomatischer Videoanalyse. Technical Report 2014-04, University of Augsburg, Institute of Computer Science, July 2014. [PDF]


Company Logo Detection

Company logos tend to appear rather small in images which poses a challenge for detection.


Social media are an important source of information for market research companies. People uploading images from their daily lifes allow market research companies to gain insights into their consumption patterns. Of particular interest are indicators such as brand popularity or public brand perception. The detection of company logos is an important building block in these analyses which is a classic object detection task.

However, company logos are usually not the intended object when taking a picture. Instead they tend to get caught in the image by accident. As a result, company logos tend to be very small and suffer from low resolution or extreme viewing angles. Reliably detecting such objects is a challenging task -- even for modern deep learning-based pipelines.

We work on improving the reliability of the detection for these hard-to-detect logo instances. At the same time we need to keep the computational overhead low to allow the efficient analysis of large image datasets.

We also maintain the FlickrLogos-47 dataset which we use as a benchmark for our algorithms. For more information please contact Christian Eggert

References:

  • Christian Eggert, Stephan Brehm, Anton Winschel, Dan Zecha, Rainer Lienhart, A Closer Look: Small Object Detection in Faster R-CNN, IEEE ICME 2017 Hong Kong, China, July 2017. [PDF]
  • Christian Eggert, Stephan Brehm, Dan Zecha, Rainer Lienhart, Improving Small Object Proposals for Company Logo Detection, ACM ICMR 2017 Bucharest, Romania, June 2017 [arXiv] [PDF]
  • Christian Eggert, Anton Winschel, Dan Zecha, Rainer Lienhart, Saliency-guided Selective Magnification for Company Logo Detection, International Conference on Pattern Recognition 2016 (ICPR 2016), Cancun, December 2016. [PDF]


fertilized forest library

The fertilized forests project has the aim to provide an easy to use, easy to extend, yet fast library for decision forests. It summarizes the research in this field and provides a solid platform to extend it.

The library is thoroughly tested and highly flexible. It is available under the permissive 2-clause BSD license.

Feature highlights are:

  • Object oriented model of the unified decision forest model of Antonio Criminisi and Jamie Shotton, as well as extensions (e.g., Hough forests).
  • Templated C++ classes for maximum memory and calculation efficiency.
  • Compatible to the Microsoft Visual C++, the GNU, and the Intel compiler.
  • Platform independent serialization: train forests and trees on a Linux cluster and use them on a Windows PC.
  • Documented and consistent interfaces in C++, Python and Matlab.

First research results include the development of the newly introduced Induced Entropy and a successful application for uncertainty sampling in the context of self organizing adaptive systems.

References:

  • Christoph Lassner and Rainer Lienhart. Norm-induced entropies for decision forests. IEEE Winter Conference on Applications of Computer Vision 2015 (WACV15), Waikoloa Beach, HI, January 6-9, 2015
For more information, see the project homepage or contact Christoph Lassner.


Swimmer Detection and Pose Estimation for Continuous Stroke Rate Determination

The success of a professional athlete depends quite strongly on the assessment and active improvement of his or her technique. In the field of competitive swimming, a quantitative evaluation is highly desirable to supplement the typical qualitative analysis. However, quantitative (manual) evaluations are very time consuming and therefore only used in individual cases.

In a joint project with the Institute of Applied Training Science in Leipzig (Institut für angewandte Trainingswissenschaften, IAT), we are developing a system for detecting a swimmer in a swimming channel and continuously estimating his or her pose in order to capture (inner-)cyclic structures and derive kinematic parameters for a biomechanical analysis. Human pose recovery in aquatic environments faces a lot of challenges, from heavily cluttered fore- and background to partial occlusion.

The purpose of this work is two-fold: firstly, we are developing a robust method for accurately detecting individual key poses with specifically trained object detectors. The procedure is fully automatic and retrieves stroke frequency, stroke length and inner-cycle intervals. Secondly, we optimize our approach in terms of time consumption through algorithmic optimizations, parallelization and GPU programming, allowing for real time application of our system.

Sample detection and extracted stroke rate.



For more information please visit the project page or contact Dan Zecha

References:

  • Dan Zecha, Thomas Greif, and Rainer Lienhart. Swimmer Detection and Pose Estimation for Continuous Stroke Rate Determination. Multimedia Content Access: Algorithms and Systems VI, part of IS&T/SPIE Electronic Imaging, 23 January 2012, Burlingame, California, USA
    Also Technical Report 2011-13, University of Augsburg, Institute of Computer Science, July 2011. [PDF] [Video]
  • Dan Zecha and Rainer Lienhart. Bestimmung intrazyklischer Phasengeschwindigkeiten von Schwimmern im Schwimmkanal mittels vollautomatischer Videoanalyse. Technical Report 2014-04, University of Augsburg, Institute of Computer Science, July 2014. [PDF]


2D and 3D Human Pose Estimation in Single Images

We address the task of unconstrained 2D and 3D human pose estimation in single images. Both have a wide field of applicability. This ranges e.g. from video indexing over security and safety applications to entertainment purposes and marker less motion capture. The recovery of a human pose in a single image, however, is still a challenging problem. Highly articulated human poses, cluttered background and partial or complete occlusions require robust methods. The absence of a temporal model makes this particularly challenging.

Our research goal is to develop robust algorithms for this task. We aim to design methods that are on the one hand generic and robust, but on the other hand try to make use very simple techniques so that the overall complexity of the models stays low. This is extremely important in order to reach real time capable pose estimation in images.

Examples of recovered 2D and 3D body poses in single images.



For more information please contact Thomas Greif

References:

  • Thomas Greif, Debabrata Sengupta, Rainer Lienhart. Monocular 3D Human Pose Estimation by Classification. IEEE International Conference on Multimedia and Expo 2011 (ICME11), Barcelona, July 2011. [PDF]
  • Thomas Greif and Rainer Lienhart A kinematic model for Bayesian tracking of cyclic human motion, IS&T/SPIE Electronic Imaging, San Jose, USA, January 2010. [PDF]

Software:
KAET (Kinect Annotation and Evaluation Tool)


Image Classification using Different Levels of Quality in Representation and Feedback

In this project, we want to consider the field of image classification with the help of a human expert. Image Classification deals with the problem of determining the occurrence of known objects and concepts in an image. We want to extend the classic image classification approach significantly by introducing new paradigms of image representation and active learning with a human expert (i.e. suitable user interaction) in order to make it applicable on real-world image databases.

The problem of most image classification tasks today lies in the high complexity of calculating the object features as well as in the high number of possible classes and the costly annotation from a human expert. This complexity has a strong influence in the training phase and in the application phase. The goal of this project is to extend the conventional image classification approach by using different levels of quality in the description of an object/concept and different levels of quality in the feedback from the human expert.

This requires new algorithms that are designed to automatically determine the best level for the object description and the best form of feedback from the human expert. A central aspect is the balance of complexity and gain. The main advantage of the methods that will be developed is the intelligent and adaptive use of resources, which is superior to static methods. The savings in memory and CPU resources will have a great impact for resource-intensive and time- critical tasks (e.g. real-time image classification in a robot). With new forms of feedback from a human expert, the interaction with a classification system will be simplified, which increases the speed and robustness of the training process.

Different levels of quality in image representation and human feedback.


For more information please contact Nicolas Cebron

References:

  • Nicolas Cebron. Active Improvement of Hierarchical Object Features under Budget Constraints, 10th IEEE International Conference on Data Mining (ICDM), Dec. 2010, Sydney, Australia. DOI: 10.1109/ICDM.2010.74
    Also Technical Report 2011-01, University of Augsburg, Institute of Computer Science, Feb. 2011. [PDF]

Unsupervised One-class Image Classification

We are developing a classification framework for digital images which is capable of identifying images which belong to a certain class. In other words we want to design filters which find images in a given database which feature certain content (e.g. brand logos).
However, our framework should learn class models in an unsupervised manner. The user is only required to provide images which contain some common object or concept as positive training examples without further annotation or knowledge.
Our framework then finds common properties of the positive training images based on color and visual words. Thus it consists of two main stages: A color-based pre-filter (or region of interest detector) and a classifier trained on histograms of visual words ("bags-of-words").

If we want to apply color-based filters we have to make the assumption that the objects we want to identify have a distinctive color distribution. That is, all instances of the object appear in a reasonably small number of different colors.
Since we want the learning process of the color model to be unsupervised, we are confronted with two major problems: First we have to identify the colors of the object without manual annotation. Second, we have to deal with color deviations due to different lighting conditions.
Besides it is not straightforward to classify images or localize objects based on color models.

Unsupervised detection of region of interest for brand logo based on color histogram.

The second stage of our framework uses bag-of-words models to classify images. We compute spatial histograms of visual words for positive and negative training images and then train a binary classifier using these histograms. Since we want to find positive images among large scale databases we aim for a very low false positive rate. Thus, for classification we opt for a cascade of AdaBoost classifiers.
Obviously there is a vast number of choices to be made which influence the classification performance. For instance, many different local feature descriptors exist which can be used for the bag-of-words model. Also, the clustering process which yields our visual vocabulary and the AdaBoost classifier depend on many parameters. Therefore our main research focus is on finding the optimal configuration and evaluating novel enhancements.

For more information please contact Christian Ries

Automatic Detection of Offensive or Illegal Images

In a joint project with Advanced U.S. Technology Group (ATG) we are working on filtering and detection techniques for offensive and illegal images. This is a one-class image classification problem and thus closely related to the project on Unsupervised One-class Image Classification.

The purpose of this work is two-fold. Our first goal is to reliably and quickly filter offensive images from large databases of images, for instance in order to prevent minors from being exposed to such images.

The second application is the automatic detection of illegal image content. In this project we work jointly with Swiss authorities. For example, we want to facilitate the work of police officers and prosecutors who have to search a suspect's storage device for illegal content.

For more information please contact Christian Ries

Feature Bundling for Object Retrieval / Logo Recognition

Computer vision and image retrieval are inherently linked with methods that describe visual information and by this the spatial layout of image intensities and colors. Analogous to sentences, where the position of single words is subject to grammar rules, the position of visual structures in images is not arbitrary but depends on the depicted content. In other words, the spatial distribution of individual visual features does have a semantic meaning.
Derivation and storage of feature bundles for a logo brand.

In this project we explore feature bundling techniques suitable for object retrieval and logo recognition. Discriminative visual signatures that include both visual and spatial information are formed by bundling local features within a combined representation. Each bundle is stored in a hash-based index and associated with the underlying object class. Multi-class recognition of objects in unknown test images is then performed by testing if bundles of the test image are contained in this index.

Several examples of feature bundles.

For more information please contact Stefan Romberg

References:

  • Stefan Romberg, Lluis Garcia Pueyo, Rainer Lienhart, Roelof van Zwol. Scalable Logo Recognition in Real-World Images. ACM International Conference on Multimedia Retrieval 2011 (ICMR11), Trento, April 2011.
    Also Technical Report 2011-04, University of Augsburg, Institute of Computer Science, March 2011 [PDF] [Slides] [Dataset]

Learning to Reassemble Shredded Documents

All images are taken from 'Bild der Wissenschaft 08/2010'
The problem of having to reconstruct shredded documents is often faced by historians and forensic investigators. For instance, there is currently ongoing work on reassembling documents related to the Stasi which was the secret police of the GDR.

However, reconstructing documents is a difficult and laborious job due to the large number of permutations of fragment arrangements. For this reason, this project deals with the automation of the reassembly process, which incorporates the use of various local image features as well as combinatorial optimization strategies.
Our approach is evaluated on a real world dataset consisting of magazine pages that have been shredded by hand.

For more information please contact Fabian Richter

References:

  • Fabian Richter, Christian X. Ries, Rainer Lienhart. A Graph Algorithmic Framework for the Assembly of Shredded Documents. IEEE International Conference on Multimedia and Expo 2011 (ICME11), Barcelona, July 2011
    Also Technical Report 2011-05, University of Augsburg, Institute of Computer Science, March 2011 [PDF]

Image Retrieval on Large Scale Image Databases

Nowadays there exist online image repositories containing hundreds of millions of images of all kinds of quality, size and content.

These image repositories grow day by day making techniques for navigating, indexing, and searching prudent. Currently indexing is mainly based on manually entered tags and/or individual and group usage patterns. Manually entered tags, however, are very subjective and not necessarily referring to the shown image content. This subjectivity and ambiguity of tags makes image retrieval based on manually entered tags difficult.

In this project we employ the image content as the source of information to retrieve images and study the representation of images by topic models. The developed approaches are evaluated on real world, large scale image databases.
Main
result retrieval
References:
  • Rainer Lienhart, Eva Hörster, Stefan Romberg. Multilayer pLSA for Multimodal Image Retrieval. ACM International Conference on Image and Video Retrieval (CIVR 2009), July 8-10, 2009.
    Also Technical Report 2009-02, University of Augsburg, Institute of Computer Science Apr. 2009 [PDF]
  • Eva Hörster, Rainer Lienhart and Malcolm Slaney. Image Retrieval on Large-Scale Image Databases. ACM International Conference on Image and Video Retrieval (CIVR) 2007 pp. 17-24, Amsterdam, Netherlands, July 2007. also Technical Report Apr. 2007 [PDF]
  • Eva Hörster and Rainer Lienhart. Fusing Local Image Descriptors for Large-Scale Image Retrieval. International Workshop on Semantic Learning Applications in Multimedia (SLAM), Minneapolis, USA, June 2007. also as Technical Report [PDF]
  • Rainer Lienhart and Malcolm Slaney. PLSA on Large Scale Image Databases. IEEE International Conference on Acoustics, Speech and Signal Processing 2007 (ICASSP 2007), Hawaii, USA, April 2007. also Technical Report Dec. 2006 [PDF]

An annotated data set for pose estimation of swimmers

In this work we present an annotated data set for two-dimensional pose estimation of swimmers. The data set contains fifteen cycles of swimmers swimming backstroke with more than 1200 annotated video frames. A wide variety of subjects was used to create this data set, ranging from adult to teenage swimmers, both, male and female. For each frame of a cycle, the absolute positions of fourteen points corresponding to human joints were manually labeled.

The data set proves to be very challenging with respect to partial occlusions and a high amount of background noise, however, it does not contain any out of plane motions that would further complicate the task of full body pose estimation. It thus aims at pose estimation and pose tracking algorithms trying to advance in the field of recovering human poses in videos with frequently missing parts and under difficult conditions.

We explain in detail the creation of the data set, discuss the difficulties we faced, and finally demonstrate how it is used to create a training data set containing normalized cycles for action-specific pose tracking.



The data set is available for download.

References:
Thomas Greif and Rainer Lienhart "An annotated data set for pose estimation of swimmers," Technical Report, 2009. PDF

For more information please contact Thomas Greif

On the Optimal Placement of Multiple Visual Sensors

Visual sensor arrays are used in many novel multimedia applications such as video surveillance, sensing rooms, assisted living or immersive conference rooms. Often several different types of cameras are available. They differ in their ranges of view, intrinsic parameters, image sensor resolutions, optics, and costs.

Most of the above mentioned applications require the layout of video sensors to assure a minimum level of image quality or image resolution. Thus, an important issue in designing visual sensor arrays is the appropriate placement of the cameras such that they achieve one or mulitple predefined goals. As video sensor arrays are getting larger, efficient camera placement strategies need to be developed.
result configuration by greedy approach
For more information on optimal camera placement please contact Eva Hörster

Audio Brush: What You See is What You Hear

Hearing, analyzing and evaluating sounds is possible for everyone. The reference-sensor for audio, the human ear, is of amazing capabilities and high quality. In contrast editing and synthesizing audio is an indirect and non-intuitive task needing great expertise.

To overcome these limitations we are creating Audio Brush, a smart visual audio editing tool. Audio Brush allows to edit the spectrogram of a sound in the visual domain similar to editing bitmaps. At the core is a very flexible audio spectrogram based on the Gabor analysis and synthesis. It gives maximum accuracy of the representation, is fully invertible, and enables manipulating the signal at any chosen time-frequency resolution.
Audio Brush screen shot by greedy approach
For more information on Audio Brush please contact Gregor van den Boogaart


Real-Time Event Detection and Control in Live Video Streams

It is nowadays very common that public places such as pubs, restaurants, and fitness club have large TV screens to entertain their customers -- especially during national or international sports championship events. For the venue owner it would be desirable if they could control which commercials are shown to their audience. In other words they may have the desire to replace untargeted commercials by target commericals of their choice.

In this joint project with Half Minute Media Ltd. we research algorithms for robost real-time commercial detection and control (such as replacement) in live streams. We are especially developing fast and extremely reliable algorithms for
  • Mining video channels automatically in order to extract all commercials and
  • Detecting known commericials in live streams using highly compact, but discriminate clip descriptors

References:

  • Rainer Lienhart, Christoph Kuhmünch and Wolfgang Effelsberg. On the Detection and Recognition of Television Commercials, Proc. IEEE Conf. on Multimedia Computing and Systems, Ottawa, Canada, pp. 509 - 516, June 1997. also Technical Report TR-96-016, University of Mannheim, Dezember 1996.

Bayesian Face Recognition on Infrared Image Data

The availability of high-performance and low-cost desktop computing systems and digital camera equipment has given rise to a public interest towards applications that include the visual identification of human individuals. Examples for such applications are surveillance, biometrical identification or computer-human interaction.

To that effect, research in biometrical technologies follows naturally. Above other methods, images of human faces offer a non-intrusive and easy-to-use means of identification. Although the recognition of faces is a problem that is effortlessly solved by human beings during their daily routine, it poses a challenge for researchers and scientists. Boundary conditions like illumination and occlusion, as well as pose and expression of an individual lead to intrapersonal variations that often exceed those between images of different persons under similar conditions.

In association with Falcontrol Security GmbH we are researching reliable face recognition algorithms by using Bayesian methods on infrared image data.

Parallel Algorithms for Fast Machine Learning

Machine learning applications are emerging as the most promising approaches to many current problems in computer science. However, machine learning algorithms typically require the processing of large data sets and thus, long training times (sometimes on the order of several days or even weeks). Especially for newly developed approaches, high performance implementations are not available; most implementations are designed with a serial model of execution in mind.
At the same time, shared memory multiprocessing architectures are becoming more and more commonplace. The computational power of these machines could be used to solve machine learning problems much faster and in parallel, if we only knew how to properly exploit it.

The goal of our research is to reduce training times speed up machine learning algorithms by developing design patterns and strategies for parallelizing them on multiprocessor computers.