Studies comparing acoustic signals often rely on pixel-wise differences between spectrograms, as in for example mean squared error (MSE). Pixel-wise errors are not representative of perceptual sensitivity, however, and such measures can be highly sensitive to small local signal changes that may be imperceptible. In computer vision, high-level visual features extracted with convolutional neural networks (CNN) can be used to calculate the fidelity of computer-generated images. Here, we propose the auditory perceptual distance (APD) metric based on acoustic features extracted with an unsupervised CNN and validated by perceptual behavior. Using complex vocal signals from songbirds, we trained a Siamese CNN on a self-supervised task using spectrograms rescaled to match the auditory frequency sensitivity of European starlings, Sturnus vulgaris. We define APD for any pair of sounds as the cosine distance between corresponding feature vectors extracted by the trained CNN. We show that APD is more robust to temporal and spectral translation than MSE, and captures the sigmoidal shape of typical behavioral psychometric functions over complex acoustic spaces. When fine-tuned using starlings' behavioral judgments of naturalistic song syllables, the APD model yields even more accurate predictions of perceptual sensitivity, discrimination, and categorization on novel complex (high-dimensional) acoustic dimensions, including diverging decisions for identical stimuli following different training conditions. Thus, the APD model outperforms MSE in robustness and perceptual accuracy, and offers tunability to match experience-dependent perceptual biases.