Audio Captioning On-line Demo

This page is an on-line demo of our recent research results on audio captioning.

Full presentation of results and method is in our paper entitled "Automated Audio Captioning with Recurrent Neural Networks", available from here, and presented to the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017.

Below you can find three columns. In each column you can see an audio player with three textual descriptions (captions) beneath it. The three captions correspond to the sound that you can hear from the audio player and are:

Predicted caption:
the exact predicted caption by our method. Maximum length of predicted caption is 10 words.
Processed caption:
a processed version of the original caption and the targeted output. The processing consisted of trimming the word length to maximum of 10 words, turn letters to small case, remove non-frequent words, remove punctuation, and remove words not apperaring in UK or USA english dictionaries (according to GNU Aspell Dictionaries).
Original caption:
the original caption, as given in the metadata associated with each sound. Original captions were used only to create the Processed captions and are listed here only for reference. For better reading, you must click on the "Original caption:" in order for it to appear.

Columns correspond to categorization of the predicted captions according to the employed metrics.

The Good
Predicted captions in the first column have scored good in the metrics. This means that the words appearing in the Predicted caption are the same (or almost the same) and in the same order as the ones in the Processed caption.
The (not so) Bad
Predicted captions in the second column do not have a good metric score, but describe adequatly the sound of the audio player. This means that the Predicted caption describes adequatly the sound but might not contain any word of those in the Processed caption.
(and) The Ugly
Predicted captions in the third column they neither have a good metric score nor describe adeuatly the sound of the audio player.

All sounds and original descriptions are drawn from the ProSounds Effect Library, available here!

Our method was tested over 2960 audio files with their associated captions. The result metrics for our method are:

BLEU1 BLEU2 BLEU3 BLEU4 ROUGEL METEOR CIDEr
0.191 ±0.004 0.129 ±0.003 0.106 ±0.003 0.094 ±0.003 0.149 ±0.002 0.092 ±0.002 0.526 ±0.012


The Good

Predicted caption:
music guitar distortion
Processed caption:
music guitar rock distortion
Music, Guitar, Rock, Distortion
Predicted caption:
water splash fast
Processed caption:
water splash 12
Water, Splash, Lunge 12
Predicted caption:
college clock striking
Processed caption:
college clock striking five
Balliol College clock striking five o'clock.
Predicted caption:
trailer dark 2
Processed caption:
trailer dark feedback
Trailer, Dark, Feedback
Predicted caption:
police radio signal
Processed caption:
three radio signal
Morse Code, "Three", Radio, Weak Signal
Predicted caption:
voice clip male police dispatch radio radio
Processed caption:
voice clip male police dispatch radio
Voice Clip, Male, Police Dispatch Radio, "I Need Transport"
Predicted caption:
bullet shell drop cement surface
Processed caption:
bullet shell 45 caliber drop cement surface
Bullet Shell, .45 Caliber, Drop, Cement Surface
Predicted caption:
impact hit heavy hard metal
Processed caption:
impact hit heavy hard metal debris
Impact, Hit, Heavy, Hard, Metal, Debris
Predicted caption:
ambience city region station
Processed caption:
ambience city traffic
Ambience, City, Traffic, Europe
Predicted caption:
gunshot revolver single shot 44 distant perspective microphone microphone
Processed caption:
gunshot pistol springfield single shot caliber distant perspective exterior microphone
Gunshot, Pistol, Springfield XDM40, Single Shot, .40 Caliber, Distant Perspective, Exterior, Microphone Towards Down Range

The (not so) Bad

Predicted caption:
chime musical
Processed caption:
cartoon
boing cartoon 15
Predicted caption:
crowd general with speech
Processed caption:
in hall mixed chatter quiet people
Audience in hall, mixed chatter, fairly quiet. (30 people)
Predicted caption:
water stream
Processed caption:
loop
Creek Flow, Loop
Predicted caption:
monster growl
Processed caption:
zombie creature heavy monster
Zombie Creature Breathe, Heavy, Monster, Creature
Predicted caption:
bullet wood empty
Processed caption:
hand
Suction Cup, Hand
Predicted caption:
monster alien
Processed caption:
door
stone, drag, stone door, manhole, dragging
Predicted caption:
industrial ambience room
Processed caption:
house monster
haunted, house, Halloween, spooky, ghost, ghosts, scary, fright, haunt, spook, fear, monster, monsters
Predicted caption:
body fall crack hit roll
Processed caption:
rock drop forest heavy
Rock Drop, Forest, Heavy
Predicted caption:
on motor
Processed caption:
and
JCB excavator levelling and excavating.
Predicted caption:
power
Processed caption:
crescendo
Crescendo

(and) The Ugly

Predicted caption:
open doors
Processed caption:
siren
Ferry: CrossChannel, `Dover', bridge, siren sounded.
Predicted caption:
debris car break
Processed caption:
typewriter bell
Typewriter Return, Bell
Predicted caption:
girl horror
Processed caption:
musical
Whooshes, Musical
Predicted caption:
wood shake
Processed caption:
plastic bag carpet surface metal
Plastic Bag Drag, Carpet Surface, Metal Objects
Predicted caption:
car door slide central wood slow
Processed caption:
distortion
Whooshes, Distortion
Predicted caption:
new chimes clock and
Processed caption:
fire engine with horn
Fire engine departs with horn
Predicted caption:
laugh clip
Processed caption:
animals birds crow bird
Animals, Birds, Crow, Caw, Bird
Predicted caption:
flying horses walk voice grass loud
Processed caption:
cu from being by adult food at and bird island
Wandering Albatross. CU frenetic bill tapping from chick being fed by adult. Food transfer at 2m25s and 3m28s. Top Meadows, Bird Island
Predicted caption:
beeps start up system
Processed caption:
cat
cat meow 17
Predicted caption:
church atmosphere species low
Processed caption:
interior whistle moves off and speed
Interior, guard's whistle, moves off and gathers speed.