XIMENA FERNANDEZ
Durham University
Joint work with
WOJCIECH REISE (Paris-Saclay), MARIA DOMINGUEZ, MARIANO BEGUERISSE-DIAZ (Spotify) & HEATHER HARRINGTON (Oxford).
*Music Obfuscator by Ben Grosser (2015)
Given two audio recordings, identify wheather they correspond to the same audio content .
Given two audio recordings, identify wheather they correspond to the same audio content .
Case study: Shazam (2003)
Case study: Shazam (2003)
Case study: Shazam (2003)
Case study: Shazam (2003)
Case study: Shazam (2003)
Case study: Shazam (2003)
Case study: Shazam
Let $\mathcal S$ denote the mel-spectrogram of an audio track $s:[0,T]\to \mathbb{R}$.
Let $\mathcal S$ denote the mel-spectrogram of an audio track $s:[0,T]\to \mathbb{R}$.
Let $\mathcal S$ denote the mel-spectrogram of an audio track $s:[0,T]\to \mathbb{R}$.
Let $\mathcal S$ denote the mel-spectrogram of an audio track $s:[0,T]\to \mathbb{R}$.
$~~~~~~~~~~~~t_0~~~~~~~~~~~~~~~~~~~~~~~t_1~~~~~~~~~~~~~~~~~~~~t_2~~~~~~~~~~~~~~~~~~~~~t_3~~~~~~~~~~~~~~~~~~~~~~t_4~~~~~~~~~~~~~~~~~~~~~t_5 \dots$
$~~~~$
$~~~~~~~~~~~~~~~~t_0~~~~~~~~~~~~~~~~~t_1~~~~~~~~~~~~~~~~t_2~~~~~~~~~~~~~~~t_3~~~~~~~~~~~~~~~t_4~~~\dots~~~~~~~~~~~~~~~~~~t'_0~~~~~~~~~~~~~~~~~t'_1~~~~~~~~~~~~~~~~t'_2~~~~~~~~~~~~~~~~~t'_3~~~~~~~~~~~~~~~~t'_4~~~\dots$
For every homological dimension $d=0,1$, the $d$-Betti distance matrix $M_d$ between $s$ and $s'$ is defined as \begin{equation*} (M_d)_{i,j} = \Vert \beta_{i,d} - \beta'_{j,d} \Vert_{L^1}. \end{equation*}
We summarize the distance between every pair of windows $W_i$ and $W_j'$ as \begin{equation*} C_{i,j} = \lambda (M_0)_{i,j} + (1-\lambda) (M_1)_{i,j} \end{equation*} for a parameter $0\leq \lambda\leq 1$.
For $m\geq 1$, compute $\bar t'_{j_i} = \mathrm{median} \{t_{j_{i-m}},\dots, t_{j_{i-1}}, t_{j_i}, t_{j_{i+1}}, \dots, t_{j_{i+m}}\}$, the moving median at $t_{j_i}$. Consider $\bar P=\{( t_{i}, \bar t_{j_i}'): i =1,\dots,k\}$.
For $m\geq 1$, compute $\bar t'_{j_i} = \mathrm{median} \{t_{j_{i-m}}, t_{j_{i-m+1}}, \dots, t_{j_{i-1}}, t_{j_i}\}$, the moving median at $t_{j_i}$. Consider $\bar P=\{( t_{i}, \bar t_{j_i}'): i =1,\dots,k\}$.
For $m\geq 1$, compute $\bar t'_{j_i} = \mathrm{median} \{t_{j_{i-m}}, t_{j_{i-m+1}}, \dots, t_{j_{i-1}}, t_{j_i}\}$, the moving median at $t_{j_i}$. Consider $\bar P=\{( t_{i}, \bar t_{j_i}'): i =1,\dots,k\}$.
We assess the functional increasing dependency of the points in $P$ as \begin{equation} \rho_{\bar P} = \mathrm{Pearson}\{(t_i), (\bar{t}'_{j_i})\}. \end{equation}
For $m\geq 1$, compute $\bar t'_{j_i} = \mathrm{median} \{t_{j_{i-m}}, t_{j_{i-m+1}}, \dots, t_{j_{i-1}}, t_{j_i}\}$, the moving median at $t_{j_i}$. Consider $\bar P=\{( t_{i}, \bar t_{j_i}'): i =1,\dots,k\}$.
We assess the functional increasing dependency of the points in $P$ as \begin{equation} \rho_{\bar P} = \mathrm{Pearson}\{(t_i), (\bar{t}'_{j_i})\}. \end{equation}
Music Obfuscator by Ben Grosser
Song | Shazam (60 sec) |
---|---|
Smells Like Teen Spirit | No |
Get Lucky | No |
Giant Steps | No |
Stairway to Heaven | Yes |
Headlines | Yes |
Blue in Green | No |
You’re Gonna Leave | No |
Blue Ocean Floor | No |
Music Obfuscator by Ben Grosser
Song | Shazam (60 sec) | Correlation (30-30 sec) | Correlation (60-30 sec) |
---|---|---|---|
Smells Like Teen Spirit | No | 0.98960 | 0.73208 |
Get Lucky | No | 0.99819 | 0.99906 |
Giant Steps | No | 0.97728 | 0.83904 |
Stairway to Heaven | Yes | 0.99802 | 0.68533 |
Headlines | Yes | 0.96832 | 0.91173 |
Blue in Green | No | 0.99970 | 0.59276 |
You’re Gonna Leave | No | 0.98778 | 0.71766 |
Blue Ocean Floor | Yes | 0.93724 | 0.51332 |
Spotify Database + PySOX Transformer
Obfuscation type | Degree |
---|---|
Low Pass Filter | 50, 100, 200, 400, 800, 1200 |
High Pass Filter | 200, 400, 800, 1600, 2000 |
White Noise | 0.05, 0.10, 1.20, 0.40 |
Pink Noise | 0.05, 0.10, 1.20, 0.40 |
Reverberation | 25, 50, 75, 100 |
Tempo | 0.50, 0.80, 1.05, 1.2, 1.50, 2.00 |
Pitch | -4, -2, 2, 4, 8 |
Spotify Database + PySOX Transformer
obfuscation type | degree | accuracy |
---|---|---|
highpass | 50 | 0.996 |
100 | 0.612 | |
200 | 0.184 | |
400 | 0.032 | |
800 | 0.004 | |
1200 | 0.000 | |
lowpass | 200 | 0.272 |
400 | 0.660 | |
800 | 0.896 | |
1600 | 0.964 | |
2000 | 0.976 |
obfuscation type | degree | accuracy |
---|---|---|
pinknoise | 0.05 | 1.000 |
0.10 | 0.964 | |
0.20 | 0.872 | |
0.40 | 0.668 | |
whitenoise | 0.05 | 0.984 |
0.10 | 0.964 | |
0.20 | 0.804 | |
0.40 | 0.416 | |
reverb | 25 | 0.996 |
50 | 0.956 | |
75 | 0.736 | |
100 | 0.208 |
obfuscation type | degree | accuracy |
---|---|---|
tempo | 0.50 | 0.128 |
0.80 | 0.732 | |
1.05 | 0.932 | |
1.20 | 0.924 | |
1.50 | 0.892 | |
2.00 | 0.976 | |
pitch | -4.00 | 0.724 |
-2.00 | 0.936 | |
2.00 | 0.936 | |
4.00 | 0.836 | |
8.00 | 0.472 |
Given an audio track $s$ and a database $\mathcal D$, identify an element $s'\in \mathcal D$ with the same audio content as $s$.
Given an audio track $s$ and a database $\mathcal D$, identify an element $s'\in \mathcal D$ with the same audio content as $s$.
Given an audio track $s$ and a database $\mathcal D$, identify an element $s'\in \mathcal D$ with the same audio content as $s$.