Skip to content

Spectrogram preprocessing

ESO trains on two datasets per species. A preprocessed dataset, used to train the baseline CNN. An unprocessed dataset, used to train the per-chromosome CNNs and to evaluate the final ESO chromosome. Both are derived from the same audio files.

The preprocessed dataset applies a low-pass filter and downsampling. The unprocessed dataset does not. The motivation is that ESO is meant to replace these preprocessing steps with band selection.

flowchart LR
    A[Audio .wav files] --> B[Annotations<br/>SVL or XML]
    A --> P1[Preprocessed branch]
    A --> P2[Unprocessed branch]
    P1 --> F1[Low-pass filter<br/>+ downsample]
    F1 --> S1[Segment + mel-spectrogram]
    P2 --> S2[Segment + mel-spectrogram]
    B --> S1
    B --> S2
    S1 --> T1[Train / val / test<br/>baseline CNN]
    S2 --> T2[Train / val / test<br/>ESO chromosomes]

Annotations

AnnotationReader parses Sonic Visualiser SVL files (and equivalent XML formats) into per-event records.

The labelled events define presence segments. Absence segments are sampled from the remaining acoustic material, including biophony, geophony, and anthropophony. The paper used a 60/20/20 split for training, validation, and test, computed before preprocessing so that no audio file appears in two splits.

Spectrograms

Mel-spectrograms are produced with a Hann window of size n_fft samples and a stride of hop_length samples, followed by a mel filter bank of n_mels bands.

The paper's per-species values for the two datasets are reproduced below for reference. They illustrate that ESO's unprocessed datasets retain the full available bandwidth and use the original sampling rate, while the baseline's preprocessed datasets are filtered and downsampled to twice the species' Nyquist rate.

Field Hainan gibbon Thyolo Alethe Pin-tailed Whydah
Recording rate (Hz) 9 600 32 000 48 000
Low-pass cut-off (Hz) 2 000 3 100 9 000
Downsample rate (Hz) 4 800 6 400 18 400
Segment duration (s) 4 1 2
n_fft 1 024 1 024 1 024
hop_length (preprocessed) 256 256 256
hop_length (unprocessed) 256 256 512
Preprocessed mel-spectrogram 128 x 76 128 x 26 128 x 144
Unprocessed mel-spectrogram 128 x 151 128 x 126 128 x 188

For the Pin-tailed Whydah, the unprocessed hop_length is doubled to 512 to keep the spectrogram width comparable to the other datasets, given the higher original sampling rate. The same logic applies to any high-sample-rate dataset.

Class balancing

The presence and absence training sets are typically imbalanced. The paper uses three augmentation methods to expand the minority class, all applied proportionally.

Method Description
Time shift Pick a time point inside the segment. Shift the segment so it starts there. Wrap the tail to the beginning so the duration is preserved.
Blend Pick a minority-class segment and a negative-class segment. Combine them as \(\alpha x_{s1} + (1 - \alpha) x_{s2}\) with \(\alpha = 0.2\).
Noise Add Gaussian noise (mean 0, std 1) scaled by 0.0009.

For the splits used in the paper, the augmented training sets contain roughly equal numbers of presence and absence segments.

Field reference

See PreprocessingConfig for the full list of fields and their defaults.