A ARCHITECTURE OF CNN-TRANSFORMER HYBRID WITH MASKED TIME SERIES AUTO-CODING FOR BEHAVIORAL BIOMETRICS ON MOBILE DEVICES
СИСТЕМНИЙ АНАЛІЗ ТА НАУКА ПРО ДАНІ
55
DOI: https://doi.org/10.20535/kpisn.2025.4.344357
UDC 004.032.26:004.93
M.P. Havrylovych
Igor Sikorsky Kyiv Polytechnic Institute, Kyiv, Ukraine
Corresponding author:
*
ARCHITECTURE OF HYBRID CNN-TRANSFORMER WITH MASKED TIME
SERIES AUTO-CODING FOR BEHAVIOURAL BIOMETRICS ON MOBILE DEVICES
Background. Continuous behavioural authentication (keystroke dynamics, touch/swipe, motion sensors) verifies identity
without extra actions. However, models degrade under device, session and activity shifts, are sensitive to noise and often
require significant labelling. As passwordless logins spread, demand rises for post-login risk control and for models that
are robust, compute-efficient and stable in real-world conditions.
Objective. The paper aims to develop and empirically study a compact CNN-Transformer hybrid with lightweight
self-supervised masked time-series autoencoding (MAE-style) for mobile behavioural biometrics on the HMOG and
WISDM datasets.
Methods. A 1D-CNN front end extracts local cues from smartphone motion signals, while a Transformer encoder
captures longer-range dependencies. We use masked reconstruction on unlabelled HMOG sessions for self-supervised
pretraining under a limited computational budget and then fine-tune the same hybrid architecture for user identification. We evaluate three hybrid variants on HMOG (trained from scratch, with masked pretraining, and with masked
pretraining plus CORAL domain adaptation) and three models on WISDM (a Transformer baseline, a hybrid trained
from scratch and a hybrid initialised from the HMOG-pretrained weights). Performance is measured using user-level
mean and median Equal Error Rate (EER) and AUC at the individual user level.
Results. On HMOG, the hybrid model trained from scratch achieves the best user-level metrics (EER 21.51 % mean,
18.63 % median; AUC 0.854 mean, 0.905 median), while the lightweight MAE and CORAL variants do not yet surpass
this baseline. On WISDM, the hybrid model substantially outperforms a pure Transformer baseline (EER 9.41 % vs
51.25 % mean; AUC 0.902 vs 0.488 mean), and cross-dataset initialisation from the HMOG MAE-pretrained weights
provides an additional improvement (EER 8.42 % mean, 2.07 % median; AUC 0.907 mean, 0.959 median).
Conclusions. The results indicate that a compact CNN-Transformer hybrid is effective for sensor-based mobile behavioural biometrics and that even lightweight masked pretraining can be helpful for cross-dataset transfer. At the same
time, the benefits of MAE and CORAL on HMOG depend strongly on the pretraining budget and masking configuration, suggesting that further tuning is needed to fully exploit self-supervised pretraining in this setting.
Keywords: behavioural biometrics; continuous authentication; smartphone sensors; CNN-Transformer hybrid; masked
autoencoding; self-supervised pretraining; domain adaptation.
Introduction
The widespread use of smartphones and wearables has turned them into primary access points
for services, including systems that explicitly explore
behavioural biometrics on everyday activities [1] and
continuous sensing on smartphones [2, 3], which in
turn creates stricter requirements for the underlying
information security mechanisms. Traditional oneshot authentication methods such as passwords, PIN
codes and fingerprint scans verify the user only at
login time. Once a device is unlocked, anyone who
physically gains access to it can continue to work
under the legitimate user’s identity. This is parti
cularly critical when smartphones are used to access
financial services, corporate resources and personal
communications, as demonstrated both in generalpurpose smartphone biometrics [1, 2, 3] and in our
earlier work on continuous authentication for security‑critical services [4].
Continuous behavioural authentication offers
an alternative paradigm: the user’s identity is verified
Пропозиція для цитування цієї статті: М.П. Гаврилович, «Aрхітектура гібридного CNN-transformer з маскованим автокодуванням часових рядів для поведінкової біометрії на мобільних пристроях», Наукові вісті КПІ,
№ 4, с. 55–62, 2025. doi: https://doi.org/10.20535/kpisn.2025.4.344357
Offer a citation for this article: M.P. Havrylovych, “Architecture of hybrid CNN-transformer with masked time
series auto-coding for behavioural biometrics on mobile devices”, KPI Science News, No. 4, pp. 55–62, 2025.
doi: https://doi.org/10.20535/kpisn.2025.4.344357
© The Autor(s).
The article is distributed under the terms of the license CC BY 4.0
56
2025 / 4
KPI Science News
in the background throughout device usage, based on
behavioural signals [1, 2, 3]. These signals include
keystroke dynamics on the virtual keyboard, which
have been extensively reviewed for both fixed‑text
and free‑text scenarios [8, 9], deep keystroke models
on desktop and mobile platforms [10, 11], as well
as touch/swipe patterns and inertial sensor data such
as accelerometer and gyroscope signals that underpin smartphone and smartwatch biometrics [1, 2, 3].
Together, they form a behavioural “fingerprint” that
can be used to distinguish one user from others without requiring explicit re-authentication. This class of
methods is closely related to behavioural biometrics
and continuous authentication frameworks used for
post-login risk control in high-stakes applications
[1, 4].
However, building robust behavioural biometric models is challenging. Unlike static biometrics,
behavioural patterns are highly context-dependent.
They vary with posture (sitting, standing, walking),
activity, device model and UI layout, and can also
drift over time. Sensor data is noisy and often contains missing values. Changes in hardware, operating
system version or user habits can cause domain shifts
that degrade the performance of models trained on
earlier data. Collecting large labelled datasets per
user is expensive and often impractical, especially at
scale, a limitation repeatedly highlighted in smartphone and sensor-based continuous authentication
studies [1, 2, 3] and confirmed in our own experi
ments on motion‑based verification and wearable
sensing [4].
Recent advances in deep learning, particularly
convolutional and recurrent architectures in conti
nuous authentication [4, 6] and Transformer-based
models for keystroke and time-series data [10, 11,
13, 14], have significantly improved the state of the
art in signal and sequence modelling. CNNs are
effective at capturing local patterns and invarian
ces, while Transformers use self-attention to model
long-range dependencies. In parallel, self-supervised
learning methods such as masked autoencoders
(MAE) have demonstrated that useful representations can be learned from large unlabelled datasets
by reconstructing masked parts of the input [13, 14].
Despite these advances, many mobile behavioural biometric systems still rely on purely convolutional or recurrent architectures [1, 2, 11] or
on traditional keystroke pipelines surveyed in [ (...truncated)