13. September 2023Research

Zero-Shot Audio Captioning via Audibility Guidance

This paper introduces a novel zero-shot method for audio captioning using audibility guidance, demonstrating improved caption quality.

In the realm of audio captioning, a groundbreaking paper titled "Zero-Shot Audio Captioning via Audibility Guidance" has been recently released by authors Tal Shaharabany, Ariel Shaulov, and Lior Wolf. The paper introduces a novel approach to the task of audio captioning, which is somewhat similar to image and video captioning, but has not been extensively explored.

The authors propose three guiding principles for audio captioning: fluency, faithfulness to the audio input, and audibility, which is the quality of being perceivable based solely on audio. Interestingly, their method is a zero-shot one, meaning captioning is not learned but occurs as an inference process. This process involves three networks corresponding to the three proposed qualities.

Firstly, a Large Language Model, specifically GPT-2, is used to ensure the fluency of the generated text. GPT-2's construction with layers of Transformers allows for token interactions and captures these interactions, resulting in more fluent and coherent sentences.

Secondly, a multimodal matching network called ImageBind is used to provide a matching score between an audio file and a text. This network establishes alignment between the representation of an audio segment and a textual counterpart. It is primarily trained to align the visual video frames with the original video soundtrack.

Lastly, a text classifier is trained using a dataset collected automatically. The dataset consists of generated captions from ChatGPT, with prompts designed to direct the generation of both audible and inaudible sentences. The classifier, based on the DistilBert network, considers factors such as coherence of words, grammatical correctness, context, and the likelihood that the sentence represents a meaningful auditory scenario.

The research demonstrated that the introduction of audibility guidance significantly enhances performance compared to the baseline method, which lacks this objective. The evaluation results from the pipeline on the AudioCaps dataset showed improvement over the baseline for a wide range of parameter values. The metrics used to measure this were BLEU, METEOR, CIDEr, and SPICE.

This paper opens up new avenues in the field of audio captioning, demonstrating the potential of using audibility guidance to improve the quality of generated captions. The zero-shot method proposed could revolutionize the way we approach this task, making it a significant contribution to the field.

Read the whole article here: http://arxiv.org/abs/2309.03884v1

Bereit, KI in Ihrem Unternehmen einzusetzen?

Entdecken Sie, wie higent Ihnen hilft, Prozesse zu automatisieren und KI-Agenten in Ihrem Betrieb zu verankern.

Jetzt starten Kontakt aufnehmen