Using knowledge graphs for audio retrieval: a case study on copyright infringement detection
World Wide Web
(2024) 27:37
https://doi.org/10.1007/s11280-024-01277-0
Using knowledge graphs for audio retrieval: a case study
on copyright infringement detection
Marco Montanaro1 · Antonio Maria Rinaldi1
Cristian Tommasino1,2
· Cristiano Russo1
·
Received: 23 December 2023 / Revised: 5 April 2024 / Accepted: 14 May 2024
© The Author(s) 2024
Abstract
Identifying cases of intellectual property violation in multimedia files poses significant challenges for the Internet infrastructure, especially when dealing with extensive document
collections. Typically, techniques used to tackle such issues can be categorized into either
of two groups: proactive and reactive approaches. This article introduces an approach combining both proactive and reactive solutions to remove illegal uploads on a platform while
preventing legal uploads or modified versions of audio tracks, such as parodies, remixes
or further types of edits. To achieve this, we have developed a rule-based focused crawler
specifically designed to detect copyright infringement on audio files coupled with a visualization environment that maps the retrieved data on a knowledge graph to represent information
extracted from audio files. Our system automatically scans multimedia files that are uploaded
to a public collection when a user submits a search query, performing an audio information
retrieval task only on files deemed legal. We present experimental results obtained from tests
conducted by performing user queries on a large music collection, a subset of 25,000 songs
and audio snippets obtained from the Free Music Archive library. The returned audio tracks
have an associated Similarity Score, a metric we use to determine the quality of the adversarial searches executed by the system. We then proceed with discussing the effectiveness
and efficiency of different settings of our proposed system.
Keywords Web crawling · Audio retrieval · Information retrieval · Deep neural networks ·
Knowledge graphs
1 Introduction
The internet has greatly improved the effortless sharing of multimedia content across devices,
leading to enhanced efficiency. However, it frequently neglects to prioritize certain fundamental aspects of data. This encompasses the sharing of copyright-protected files, which is
challenging to monitor due to the absence of comprehensive and collectively followed Intellectual Property protection laws [1], as well as the complexity of overseeing vast amounts
Marco Montanaro, Antonio Maria Rinaldi, Cristiano Russo and Cristian Tommasino contributed equally to
this work
Extended author information available on the last page of the article
0123456789().: V,-vol
123
37
Page 2 of 30
World Wide Web
(2024) 27:37
of data transmitted in a decentralized manner. One effective strategy to consider Intellectual
Property infringement is to incorporate a digital signature or watermark [2–4] into multimedia files using unique encryption keys. Incorporating a digital signature into multimedia files
presents a set of challenges that span both methodological and technical considerations. A
pivotal obstacle lies in the development of robust algorithms capable of securely embedding
watermarks without compromising the integrity of multimedia content. Crucially, visible
watermarks should minimally impact user experience, whereas invisible digital signatures
necessitate resilience against diverse forms of manipulation. Moreover, compatibility poses
another hurdle, as different file formats and compression methods may react differently to
watermark embedding techniques. Practical challenges involve the delicate balance between
copyright protection and user experience. Intrusive watermarks might deter users from engaging with the content, while insufficiently secure signatures may fail to safeguard intellectual
property. Our approach involves utilizing the original data of protected multimedia files while
preserving the integrity of the original content. It selectively stores information related to
indexing audio files, leaving the core data untouched. To achieve this, we employ a focused
Web crawler to gather data pertaining to multimedia files hosted on the internet. Focused
crawlers [5] are specialized Web crawlers developed to monitor specific topics or segments
of the Web. During their crawling process, they have the capability to selectively filter pages
by identifying content relevant to predefined topics in accordance with their On-Line Selection Policy. This enables them to meticulously analyze particular segments of the Web. Our
version of the crawler operates in the multimodal content based information retrieval domain,
which encompasses searching for information in multimedia data formats such as audio files,
typed text or the metadata contained within a file [6]. The primary objective of our crawler
is to ascertain whether the retrieved set of documents are relevant to the end user’s objectives while ensuring that the obtained data does not contain any potentially illegal document.
Because we’re focused on detecting illegal content, our crawler needs a way to distinguish
between legal and illegal uploads. To achieve this, it utilizes deep neural networks (DNNs)
to filter out audio tracks that do not comply with copyright laws, referencing a collection of
copyrighted tracks as a benchmark. Intellectual property laws present numerous challenges
due to their typically vague legal language1 [1], necessitating the exploration of less stringent
methods for verifying online data integrity. This task can be likened to a recommendation
task [7] where illegal content is “recommended" to a legality verification bot by finding
similarity with a reference collection of illicit documents. Additionally, the focused crawler
implemented for this paper engages in adversarial information retrieval, meaning that only a
portion of the retrieved data is shown to the user as a response to their query, hiding part of the
result set. This is beneficial both to the system administrators, as the automatic detection of
potentially compromising content frees them from the obligation of manually checking every
IP violation, and to the rights holders, as they only need to provide their legitimate copy of
the audio track to the repository to prevent illegal uploads from showing up in the result set.
Another crucial facet of our approach pertains to its capacity for results obfuscation: rather
than simply removing illegal files, our system employs a mechanism to retain an abstract representation of the pertinent subset of files that necessitate concealment from the user [8]. All
of the retrieved files violating copyright law are then utilized to enhance the detection capabilities of the DNNs, allowing us to identify more sophisticated modifications of legal audio
data. Consequently, server administrators overseeing extensive collections ofaudio files and
1 http://www.oecd.org/sti/ieconomy/KBC2-IP.Final.pdf
123
World Wide Web
(2024) 27:37
(...truncated)