A ARCHITECTURE OF CNN-TRANSFORMER HYBRID WITH MASKED TIME SERIES AUTO-CODING FOR BEHAVIORAL BIOMETRICS ON MOBILE DEVICES (pdf)

Article PDF cannot be displayed. You can download it here:

https://scinews.kpi.ua/article/download/344357/335723

A ARCHITECTURE OF CNN-TRANSFORMER HYBRID WITH MASKED TIME SERIES AUTO-CODING FOR BEHAVIORAL BIOMETRICS ON MOBILE DEVICES

СИСТЕМНИЙ АНАЛІЗ ТА НАУКА ПРО ДАНІ 55 DOI: https://doi.org/10.20535/kpisn.2025.4.344357 UDC 004.032.26:004.93 M.P. Havrylovych Igor Sikorsky Kyiv Polytechnic Institute, Kyiv, Ukraine Corresponding author: * ARCHITECTURE OF HYBRID CNN-TRANSFORMER WITH MASKED TIME SERIES AUTO-CODING FOR BEHAVIOURAL BIOMETRICS ON MOBILE DEVICES Background. Continuous behavioural authentication (keystroke dynamics, touch/swipe, motion sensors) verifies identity without extra actions. However, models degrade under device, session and activity shifts, are sensitive to noise and often require significant labelling. As passwordless logins spread, demand rises for post-login risk control and for models that are robust, compute-efficient and stable in real-world conditions. Objective. The paper aims to develop and empirically study a compact CNN-Transformer hybrid with lightweight self-supervised masked time-series autoencoding (MAE-style) for mobile behavioural biometrics on the HMOG and WISDM datasets. Methods. A 1D-CNN front end extracts local cues from smartphone motion signals, while a Transformer encoder captures longer-range dependencies. We use masked reconstruction on unlabelled HMOG sessions for self-supervised pretraining under a limited computational budget and then fine-tune the same hybrid architecture for user identification. We evaluate three hybrid variants on HMOG (trained from scratch, with masked pretraining, and with masked pretraining plus CORAL domain adaptation) and three models on WISDM (a Transformer baseline, a hybrid trained from scratch and a hybrid initialised from the HMOG-pretrained weights). Performance is measured using user-level mean and median Equal Error Rate (EER) and AUC at the individual user level. Results. On HMOG, the hybrid model trained from scratch achieves the best user-level metrics (EER 21.51 % mean, 18.63 % median; AUC 0.854 mean, 0.905 median), while the lightweight MAE and CORAL variants do not yet surpass this baseline. On WISDM, the hybrid model substantially outperforms a pure Transformer baseline (EER 9.41 % vs 51.25 % mean; AUC 0.902 vs 0.488 mean), and cross-dataset initialisation from the HMOG MAE-pretrained weights provides an additional improvement (EER 8.42 % mean, 2.07 % median; AUC 0.907 mean, 0.959 median). Conclusions. The results indicate that a compact CNN-Transformer hybrid is effective for sensor-based mobile behavioural biometrics and that even lightweight masked pretraining can be helpful for cross-dataset transfer. At the same time, the benefits of MAE and CORAL on HMOG depend strongly on the pretraining budget and masking configuration, suggesting that further tuning is needed to fully exploit self-supervised pretraining in this setting. Keywords: behavioural biometrics; continuous authentication; smartphone sensors; CNN-Transformer hybrid; masked autoencoding; self-supervised pretraining; domain adaptation. Introduction The widespread use of smartphones and wearables has turned them into primary access points for services, including systems that explicitly explore behavioural biometrics on everyday activities [1] and continuous sensing on smartphones [2, 3], which in turn creates stricter requirements for the underlying information security mechanisms. Traditional oneshot authentication methods such as passwords, PIN codes and fingerprint scans verify the user only at login time. Once a device is unlocked, anyone who physically gains access to it can continue to work under the legitimate user’s identity. This is parti cularly critical when smartphones are used to access financial services, corporate resources and personal communications, as demonstrated both in generalpurpose smartphone biometrics [1, 2, 3] and in our earlier work on continuous authentication for security‑critical services [4]. Continuous behavioural authentication offers an alternative paradigm: the user’s identity is verified Пропозиція для цитування цієї статті: М.П. Гаврилович, «Aрхітектура гібридного CNN-transformer з маскованим автокодуванням часових рядів для поведінкової біометрії на мобільних пристроях», Наукові вісті КПІ, № 4, с. 55–62, 2025. doi: https://doi.org/10.20535/kpisn.2025.4.344357 Offer a citation for this article: M.P. Havrylovych, “Architecture of hybrid CNN-transformer with masked time series auto-coding for behavioural biometrics on mobile devices”, KPI Science News, No. 4, pp. 55–62, 2025. doi: https://doi.org/10.20535/kpisn.2025.4.344357 © The Autor(s). The article is distributed under the terms of the license CC BY 4.0 56 2025 / 4 KPI Science News in the background throughout device usage, based on behavioural signals [1, 2, 3]. These signals include keystroke dynamics on the virtual keyboard, which have been extensively reviewed for both fixed‑text and free‑text scenarios [8, 9], deep keystroke models on desktop and mobile platforms [10, 11], as well as touch/swipe patterns and inertial sensor data such as accelerometer and gyroscope signals that underpin smartphone and smartwatch biometrics [1, 2, 3]. Together, they form a behavioural “fingerprint” that can be used to distinguish one user from others without requiring explicit re-authentication. This class of methods is closely related to behavioural biometrics and continuous authentication frameworks used for post-login risk control in high-stakes applications [1, 4]. However, building robust behavioural biometric models is challenging. Unlike static biometrics, behavioural patterns are highly context-dependent. They vary with posture (sitting, standing, walking), activity, device model and UI layout, and can also drift over time. Sensor data is noisy and often contains missing values. Changes in hardware, operating system version or user habits can cause domain shifts that degrade the performance of models trained on earlier data. Collecting large labelled datasets per user is expensive and often impractical, especially at scale, a limitation repeatedly highlighted in smartphone and sensor-based continuous authentication studies [1, 2, 3] and confirmed in our own experi ments on motion‑based verification and wearable sensing [4]. Recent advances in deep learning, particularly convolutional and recurrent architectures in conti nuous authentication [4, 6] and Transformer-based models for keystroke and time-series data [10, 11, 13, 14], have significantly improved the state of the art in signal and sequence modelling. CNNs are effective at capturing local patterns and invarian ces, while Transformers use self-attention to model long-range dependencies. In parallel, self-supervised learning methods such as masked autoencoders (MAE) have demonstrated that useful representations can be learned from large unlabelled datasets by reconstructing masked parts of the input [13, 14]. Despite these advances, many mobile behavioural biometric systems still rely on purely convolutional or recurrent architectures [1, 2, 11] or on traditional keystroke pipelines surveyed in [ (...truncated)