Using knowledge graphs for audio retrieval: a case study on copyright infringement detection (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01277-0.pdf

Using knowledge graphs for audio retrieval: a case study on copyright infringement detection

World Wide Web (2024) 27:37 https://doi.org/10.1007/s11280-024-01277-0 Using knowledge graphs for audio retrieval: a case study on copyright infringement detection Marco Montanaro1 · Antonio Maria Rinaldi1 Cristian Tommasino1,2 · Cristiano Russo1 · Received: 23 December 2023 / Revised: 5 April 2024 / Accepted: 14 May 2024 © The Author(s) 2024 Abstract Identifying cases of intellectual property violation in multimedia files poses significant challenges for the Internet infrastructure, especially when dealing with extensive document collections. Typically, techniques used to tackle such issues can be categorized into either of two groups: proactive and reactive approaches. This article introduces an approach combining both proactive and reactive solutions to remove illegal uploads on a platform while preventing legal uploads or modified versions of audio tracks, such as parodies, remixes or further types of edits. To achieve this, we have developed a rule-based focused crawler specifically designed to detect copyright infringement on audio files coupled with a visualization environment that maps the retrieved data on a knowledge graph to represent information extracted from audio files. Our system automatically scans multimedia files that are uploaded to a public collection when a user submits a search query, performing an audio information retrieval task only on files deemed legal. We present experimental results obtained from tests conducted by performing user queries on a large music collection, a subset of 25,000 songs and audio snippets obtained from the Free Music Archive library. The returned audio tracks have an associated Similarity Score, a metric we use to determine the quality of the adversarial searches executed by the system. We then proceed with discussing the effectiveness and efficiency of different settings of our proposed system. Keywords Web crawling · Audio retrieval · Information retrieval · Deep neural networks · Knowledge graphs 1 Introduction The internet has greatly improved the effortless sharing of multimedia content across devices, leading to enhanced efficiency. However, it frequently neglects to prioritize certain fundamental aspects of data. This encompasses the sharing of copyright-protected files, which is challenging to monitor due to the absence of comprehensive and collectively followed Intellectual Property protection laws [1], as well as the complexity of overseeing vast amounts Marco Montanaro, Antonio Maria Rinaldi, Cristiano Russo and Cristian Tommasino contributed equally to this work Extended author information available on the last page of the article 0123456789().: V,-vol 123 37 Page 2 of 30 World Wide Web (2024) 27:37 of data transmitted in a decentralized manner. One effective strategy to consider Intellectual Property infringement is to incorporate a digital signature or watermark [2–4] into multimedia files using unique encryption keys. Incorporating a digital signature into multimedia files presents a set of challenges that span both methodological and technical considerations. A pivotal obstacle lies in the development of robust algorithms capable of securely embedding watermarks without compromising the integrity of multimedia content. Crucially, visible watermarks should minimally impact user experience, whereas invisible digital signatures necessitate resilience against diverse forms of manipulation. Moreover, compatibility poses another hurdle, as different file formats and compression methods may react differently to watermark embedding techniques. Practical challenges involve the delicate balance between copyright protection and user experience. Intrusive watermarks might deter users from engaging with the content, while insufficiently secure signatures may fail to safeguard intellectual property. Our approach involves utilizing the original data of protected multimedia files while preserving the integrity of the original content. It selectively stores information related to indexing audio files, leaving the core data untouched. To achieve this, we employ a focused Web crawler to gather data pertaining to multimedia files hosted on the internet. Focused crawlers [5] are specialized Web crawlers developed to monitor specific topics or segments of the Web. During their crawling process, they have the capability to selectively filter pages by identifying content relevant to predefined topics in accordance with their On-Line Selection Policy. This enables them to meticulously analyze particular segments of the Web. Our version of the crawler operates in the multimodal content based information retrieval domain, which encompasses searching for information in multimedia data formats such as audio files, typed text or the metadata contained within a file [6]. The primary objective of our crawler is to ascertain whether the retrieved set of documents are relevant to the end user’s objectives while ensuring that the obtained data does not contain any potentially illegal document. Because we’re focused on detecting illegal content, our crawler needs a way to distinguish between legal and illegal uploads. To achieve this, it utilizes deep neural networks (DNNs) to filter out audio tracks that do not comply with copyright laws, referencing a collection of copyrighted tracks as a benchmark. Intellectual property laws present numerous challenges due to their typically vague legal language1 [1], necessitating the exploration of less stringent methods for verifying online data integrity. This task can be likened to a recommendation task [7] where illegal content is “recommended" to a legality verification bot by finding similarity with a reference collection of illicit documents. Additionally, the focused crawler implemented for this paper engages in adversarial information retrieval, meaning that only a portion of the retrieved data is shown to the user as a response to their query, hiding part of the result set. This is beneficial both to the system administrators, as the automatic detection of potentially compromising content frees them from the obligation of manually checking every IP violation, and to the rights holders, as they only need to provide their legitimate copy of the audio track to the repository to prevent illegal uploads from showing up in the result set. Another crucial facet of our approach pertains to its capacity for results obfuscation: rather than simply removing illegal files, our system employs a mechanism to retain an abstract representation of the pertinent subset of files that necessitate concealment from the user [8]. All of the retrieved files violating copyright law are then utilized to enhance the detection capabilities of the DNNs, allowing us to identify more sophisticated modifications of legal audio data. Consequently, server administrators overseeing extensive collections ofaudio files and 1 http://www.oecd.org/sti/ieconomy/KBC2-IP.Final.pdf 123 World Wide Web (2024) 27:37 (...truncated)