← back

introduction

This research demonstrates that LLMs, combined with readily available voice synthesis tools, enable fully automated vishing attacks at unprecedented scale and minimal cost. Through the development and evaluation of two proof of concept architectures — a modular pipeline and a Realtime system using OpenAI’s API — I show that the technical barriers to mass voice-based fraud have effectively collapsed. The project encompasses the complete attack lifecycle: from automated victim profiling through to real-time conversational manipulation, revealing that sophisticated vishing no longer requires human operators or specialised expertise.

Vishing represents an escalating cybersecurity threat, with scammers stealing an estimated $16.6 billion in 2024 alone [1]. Unlike text-based phishing, which victims can scrutinise at their own pace, vishing exploits real-time voice interaction to manipulate emotions and establish trust before critical thinking engages [2]. This threat landscape is being fundamentally transformed by advances in AI. Current LLMs exceed 95% accuracy on reasoning benchmarks that challenged systems just years ago [3], while voice-native architectures now achieve human-like latency and naturalness [4]. The convergence of these capabilities with plummeting costs creates conditions for automated fraud at scales previously unimaginable.

This work builds upon and extends recent research demonstrating AI’s capacity for social engineering. While Figueiredo et al. establish that LLMs could conduct vishing with manually crafted personas [5], and Toapanta et al. show voice cloning’s effectiveness with static scripts [6], my research addresses three critical gaps. First, I demonstrate end-to-end automation without human intervention: from data scraping through attack execution. Second, I systematically evaluate mainstream LLM compliance with malicious requests, finding 66–100% success rates with basic prompting. Third, I provide economic analysis showing attacks cost as little as $0.074 per call, making mass deployment financially viable even at low success rates.

The technical implementation involves two distinct architectures that illustrate different trade-offs in the attack design space. The modular system achieves 4.7-second perceived response latency at minimal cost by orchestrating separate components for speech recognition, LLM processing, and voice synthesis. The Realtime architecture leverages OpenAI’s native voice capabilities to achieve 639ms latency (indistinguishable from human conversation), representing the first documented adversarial use of this API.

The ethical dimensions of this research warrant explicit acknowledgment. While demonstrating these vulnerabilities creates risks of misuse, the alternative — allowing capabilities to develop in criminal contexts without public awareness — poses greater harm. All testing was conducted entirely by the author, using synthetic data and controlled environments, with no actual victims targeted. By documenting these threats transparently, this research aims to catalyse defensive developments before widespread criminal adoption occurs, recognising that the democratisation of AI capabilities makes prevention through obscurity no longer viable.