Spectrogram preprocessing¶
ESO trains on two datasets per species. A preprocessed dataset, used to train the baseline CNN. An unprocessed dataset, used to train the per-chromosome CNNs and to evaluate the final ESO chromosome. Both are derived from the same audio files.
The preprocessed dataset applies a low-pass filter and downsampling. The unprocessed dataset does not. The motivation is that ESO is meant to replace these preprocessing steps with band selection.
flowchart LR
A[Audio .wav files] --> B[Annotations<br/>SVL or XML]
A --> P1[Preprocessed branch]
A --> P2[Unprocessed branch]
P1 --> F1[Low-pass filter<br/>+ downsample]
F1 --> S1[Segment + mel-spectrogram]
P2 --> S2[Segment + mel-spectrogram]
B --> S1
B --> S2
S1 --> T1[Train / val / test<br/>baseline CNN]
S2 --> T2[Train / val / test<br/>ESO chromosomes]
Annotations¶
AnnotationReader parses Sonic Visualiser SVL files (and equivalent XML formats) into per-event records.
The labelled events define presence segments. Absence segments are sampled from the remaining acoustic material, including biophony, geophony, and anthropophony. The paper used a 60/20/20 split for training, validation, and test, computed before preprocessing so that no audio file appears in two splits.
Spectrograms¶
Mel-spectrograms are produced with a Hann window of size n_fft samples and a stride of hop_length samples, followed by a mel filter bank of n_mels bands.
The paper's per-species values for the two datasets are reproduced below for reference. They illustrate that ESO's unprocessed datasets retain the full available bandwidth and use the original sampling rate, while the baseline's preprocessed datasets are filtered and downsampled to twice the species' Nyquist rate.
| Field | Hainan gibbon | Thyolo Alethe | Pin-tailed Whydah |
|---|---|---|---|
| Recording rate (Hz) | 9 600 | 32 000 | 48 000 |
| Low-pass cut-off (Hz) | 2 000 | 3 100 | 9 000 |
| Downsample rate (Hz) | 4 800 | 6 400 | 18 400 |
| Segment duration (s) | 4 | 1 | 2 |
n_fft |
1 024 | 1 024 | 1 024 |
hop_length (preprocessed) |
256 | 256 | 256 |
hop_length (unprocessed) |
256 | 256 | 512 |
| Preprocessed mel-spectrogram | 128 x 76 | 128 x 26 | 128 x 144 |
| Unprocessed mel-spectrogram | 128 x 151 | 128 x 126 | 128 x 188 |
For the Pin-tailed Whydah, the unprocessed hop_length is doubled to 512 to keep the spectrogram width comparable to the other datasets, given the higher original sampling rate. The same logic applies to any high-sample-rate dataset.
Class balancing¶
The presence and absence training sets are typically imbalanced. The paper uses three augmentation methods to expand the minority class, all applied proportionally.
| Method | Description |
|---|---|
| Time shift | Pick a time point inside the segment. Shift the segment so it starts there. Wrap the tail to the beginning so the duration is preserved. |
| Blend | Pick a minority-class segment and a negative-class segment. Combine them as \(\alpha x_{s1} + (1 - \alpha) x_{s2}\) with \(\alpha = 0.2\). |
| Noise | Add Gaussian noise (mean 0, std 1) scaled by 0.0009. |
For the splits used in the paper, the augmented training sets contain roughly equal numbers of presence and absence segments.
Field reference¶
See PreprocessingConfig for the full list of fields and their defaults.