ANIMAL-10N Dataset Description
Data Collection: To include human error in the image labeling process, we first defined five pairs of "confusing" animals:
{(cat, lynx), (jaguar, cheetah), (wolf, coyote), (chimpanzee, orangutan), (hamster, guinea pig)}, where two animals in each pair look very similar. Then, we crawled 6,000 images for each of the ten animals on Google and Bing by using the animal name as a search keyword. Consequently, in total, 60,000 images were collected.
Data Labeling: For human labeling, we recruited 15 participants, which were composed of ten undergraduate and five graduate students, on the KAIST online community. They were educated for one hour about the characteristics of each animal before the labeling process, and each of them was asked to annotate 4,000 images with the animal names in a week, where an equal number (i.e., 400) of images were given from each animal. More specifically, we combined the images for a pair of animals into a single set and provided each participant with five sets; hence, a participant categorized 800 images as either of two animals five times. After the labeling process was complete, we paid about US $150 to each participant. Finally, excluding irrelevant images, the labels for 55,000 images were generated by the participants. Please note that these labels may involve human mistakes because we intentionally mixed confusing animals.
Data Organization: We randomly selected 5,000 images for the test set and used the remaining 50,000 images for the training set. Because the test set should be free from noisy labels, only the images whose label matches the search keyword were considered for the test set. Besides, the images are almost evenly distributed to the ten classes (or animals) in both the training and test sets, as shown in the table below.
Label |
0: Cat |
1: Lynx |
2: Wolf |
3: Coyote |
4: Cheetah |
5: jaguar |
6: Chimpanzee |
7: Orangutan |
8: Hamster |
9: Guinea pig |
Number of Samples in Training |
5466 |
4608 |
5091 |
4841 |
4981 |
4913 |
5322 |
4999 |
4970 |
4809 |
Number of Samples in Testing |
557 |
485 |
423 |
410 |
509 |
524 |
620 |
557 |
440 |
475 |
Noise Rate Estimation by Accuracy: Because the ground-truth labels are unknown, we estimated the noise rate τ by the cross-validation with grid search. We trained DenseNet (L=25, k=12) using SELFIE on the 50, 000 training images and evaluated the performance on the 5, 000 testing images. We found the best noise rate τ = 0.08 from a grid noise rate τ ∈ [0.06, 0.13] when noise rate was incremented by 0.01. Therefore, we decided to set noise rate τ = 0.08 for ANIMAL-10N.
Noise Rate Estimation by Human Inspection: We also estimated the noise rate τ by human inspection to verify the result based on the grid search. To this end, we randomly sampled 6,000 images and acquired two more labels for each of these images in the same way. Meanwhile, human experts different from the 15 participants carefully examined the 6,000 images to get the ground-truth labels. Comparing the human labels and the ground-truth labels in the image below, the former in the legend represents the number of the votes for the true label, and the latter represents the number of the votes for the other label. Because three votes were ready for each image, for conservative estimation, the final human label was decided by majority. Thus, the two cases of 3:0 and 2:1 were regarded as correct labeling, and the other two cases of 1:2 and 0:3 were regarded as incorrect labeling.
Overall, the proportion of incorrect human labels was 4.08 + 2.36 = 6.44% in the sample, and it is fairly close to τ = 0.08 obtained by the grid search.
Result with Realistic Noise: The table below summarizes the best test errors of the four training methods using the two architectures on ANIMAL-10N. In both architectures, SELFIE achieved the lowest test error. Specifically, SELFIE improved the absolute test error by up to 0.9pp using DenseNet (L=25, k=12) and 2.4pp using VGG-19. SELFIE maintained its dominance over other methods on realistic noise, though the performance gain was not that huge because of a light noise rate (i.e., 8%).
Method |
DenseNet (L=25, k=12) |
VGG-19 |
Default |
17.9±0.02 |
20.6±0.14 |
ActiveBias |
17.6±0.17 |
19.5±0.26 |
Coteaching(τ = 0.08) |
17.5±0.17 |
19.8±0.13 |
SELFIE(τ = 0.08) |
17.0±0.10 |
18.2±0.09 |
Data Format:
The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., as well as test_batch.bin.
Each of these files is formatted as follows:
<id><label><depth x height x width>
...
<id><label><depth x height x width>
The reading procedure is similar to that of a popular CIFAR-10 tutorial.
# You can read our binary files as below:
ID_BYTES = 4
LABEL_BYTES = 4
RECORD_BYTES = ID_BYTES + LABEL_BYTES + width * height * depth
reader = tf.FixedLengthRecordReader(record_bytes=RECORD_BYTES)
file_name, value = reader.read(filename_queue)
byte_record = tf.decode_raw(value, tf.uint8)
image_id = tf.strided_slice(byte_record, [0], [ID_BYTES])
image_label = tf.strided_slice(byte_record, [ID_BYTES], [ID_BYTES + LABEL_BYTES])
array_image = tf.strided_slice(byte_record, [ID_BYTES + LABEL_BYTES], [RECORD_BYTES])
depth_major_image = tf.reshape(array_image, [depth, height, width])
record.image = tf.transpose(depth_major_image, [1, 2, 0])
For more information, please refer to our
official GitHub page