Projet ANR

Robovox Dataset

In this challenge, we introduce a novel benchmark that complements previous works and aims at fostering research in far-field single-channel and multi-channel speaker verification. We will propose an evaluation benchmark in which the voice dialogues are recorded by a robot in various acoustic conditions.

Robovox is a French corpus recorded by a mobile robot (E4) in the framework of the ANR project RoboVox. The robot is equipped with a speaker recognition system in noisy environments. There are three microphones on the angles of the robot (Micro #1, Micro #2, Micro #3). The fourth microphone is embedded inside the robot (Micro #4). Another microphone is used as a ground truth microphone (Micro #5). The ground truth microphone is close to the mouth of the speaker. The microphones are depicted in Fig 1. The speech files are recorded from conversations between Robovox and speakers. Robovox utilizes a loudspeaker positioned beneath the robot to articulate its utterances.

The dataset includes 78 speakers. The number of conversations between the robot and the speakers is between 24 and 36 which results in 2219 conversations. In each conversation, there are 5 dialogues (speaker turns) on average. Therefore, the total number of recorded dialogues is 11,000. The average length of each dialog is 3.6 seconds.

Each recording has 8 channels. The channel information is as follows:

  • Channel 1 to 3: microphones on the angels of the robot;
  • Channel 4: microphone embedded inside the robot;
  • Channel 5: ground truth microphone which is close to the speaker;
  • Channel 6: Unused channel;
  • Channel 7 and 8: the robot’s dialogues turn.

It is worth noting that having a clean signal recorded by Channel 5, enables us to have the best-expected baseline system and allows us to know the amount of performance degradation for far-field microphones. An example of a recorded signal spectrum is depicted in the following Figure:


The files are recorded from different distances in different acoustical environments with the main following settings:

  • 1m, 2m, and 3m: Distance of the speaker from the robot: respectively 1, 2, and 3 meters.
  • hall, open space, small room (open/close), and medium room (open/close): The sessions are recorded in the different rooms/environments with the door open or closed in meeting rooms.
  • wall, center, and corner: The robot is placed close to a wall (or window), in the center of the room, or in the corner respectively. Severe reverberation can be spotted.
  • calm or noisy: Level of noise in the environment.


This audio database is made available under the terms of the Creative Commons Attribution NonCommercial-ShareAlike 4.0 International License. This means that you are free to share (copy, distribute, and transmit the work) and remix (adapt the work), as long as you credit the original authors, do not use this work for commercial purposes, and share any derivative works under a similar license. 

Download the dataset and evaluation protocol

After your registration the dataset will be accessible from the files tab.

Download dataset and evaluation protocols