Research papers on privacy, security, and machine learning
5 conferences1 journals2 preprints4 top-tier
Filters
8 results
2025Preprint
Every Keystroke You Make: A Tech-Law Measurement and Analysis of Event Listeners for Wiretapping
Shaoor Munir, Nurullah Demir, Qian Li, Konrad Kollnig, Zubair Shafiq
arXiv (2025)•arXiv Preprint
TL;DR: 38.52% of top websites install third-party keystroke listeners. We connect this invasive tracking to U.S. wiretapping laws.
We conduct a technical and legal analysis connecting JavaScript event listeners used by third-party trackers to U.S. wiretapping laws. Using an instrumented web browser to analyze the top-million websites, we discovered that 38.52% websites installed third-party event listeners to intercept keystrokes, and that at least 3.18% websites transmitted intercepted information to a third-party server. We demonstrate that captured data—such as email addresses entered in form fields—are leveraged for unsolicited marketing campaigns. We map this invasive tracking technique against federal and California wiretapping statutes, bridging the gap between emerging technical practices and decades-old legal frameworks designed to protect electronic communications privacy.
Yash Vekaria, Yohan Beugin, Shaoor Munir, Gunes Acar, Nataliia Bielova, Steven Englehardt, Umar Iqbal, Alexandros Kapravelos, Pierre Laperdrix, Nick Nikiforakis, Jason Polakis, Franziska Roesner, Zubair Shafiq, Sebastian Zimmeck
arXiv (2025)•arXiv Preprint
TL;DR: Comprehensive systematization of web tracking research: techniques, defenses, and regulations shaping the evolving privacy landscape.
This paper consolidates research on web tracking by examining technical mechanisms, countermeasures, and regulations that shape the modern and rapidly evolving web tracking landscape. We synthesize fragmented literature across tracking techniques, defenses against tracking, and regulatory compliance. The field is experiencing transformative change due to industry shifts in advertising, browser adoption of anti-tracking features, and increased privacy regulation enforcement. We identify open research challenges and propose future directions for researchers, practitioners, and policymakers studying web tracking practices.
FP-Inconsistent: Detecting Evasive Bots using Browser Fingerprint Inconsistencies
Hari Venugopalan, Shaoor Munir, Shuaib Ahmed, Tangbaihe Wang, Samuel T. King, Zubair Shafiq
IMC (2024)•ACM Internet Measurement Conference
TL;DR: Bots evade detection by altering fingerprints but leave inconsistencies. Our rules reduce evasion rates by ~48%.
As browser fingerprinting is increasingly being used for bot detection, bots have started altering their fingerprints for evasion. We conduct the first large-scale evaluation of evasive bots to investigate whether and how altering fingerprints helps bots evade detection. To systematically investigate evasive bots, we deploy a honey site incorporating two anti-bot services (DataDome and BotD) and solicit bot traffic from 20 different bot services that purport to sell "realistic and undetectable traffic". Across half a million requests from 20 different bot services on our honey site, we find an average evasion rate of 52.93% against DataDome and 44.56% evasion rate against BotD. Our comparison of fingerprint attributes from bot services that evade each anti-bot service individually as well as bot services that evade both shows that bot services indeed alter different browser fingerprint attributes for evasion. Further, our analysis reveals the presence of inconsistent fingerprint attributes in evasive bots. Given evasive bots seem to have difficulty in ensuring consistency in their fingerprint attributes, we propose a data-driven approach to discover rules to detect such inconsistencies across space (two attributes in a given browser fingerprint) and time (a single attribute at two different points in time). These rules, which can be readily deployed by anti-bot services, reduce the evasion rate of evasive bots against DataDome and BotD by 48.11% and 44.95% respectively.
2024ConferenceTop-Tier19% acceptance rateBest Artifact Award
Blocking Tracking JavaScript at the Function Granularity
Abdul Haddi Amjad, Shaoor Munir, Zubair Shafiq, Muhammad Ali Gulzar
CCS (2024)•ACM SIGSAC Conference on Computer and Communications Security
TL;DR: Not.js blocks tracking JavaScript at function-level granularity with 94% precision and 98% recall, without breaking websites.
Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose Not.js, a fine grained JavaScript blocking tool that operates at the function level granularity. Not.js's strengths lie in analyzing the dynamic execution context, including the call stack and calling context of each JavaScript function, and then encoding this context to build a rich graph representation. Not.js trains a supervised machine learning classifier on a webpage's graph representation to first detect tracking at the JavaScript function level and then automatically generate surrogate scripts that preserve functionality while removing tracking. Our evaluation of Not.js on the top 10K websites demonstrates that it achieves high precision (94%) and recall (98%) in detecting tracking JavaScript functions, outperforming the state of the art while being robust against off the shelf JavaScript obfuscation. Fine grained detection of tracking functions allows Not.js to automatically generate surrogate scripts that remove tracking JavaScript functions without causing major breakage. Our deployment of Not.js shows that mixed scripts are present on 62.3% of the top 10K websites, with 70.6% of the mixed scripts being third party that engage in tracking activities such as cookie ghostwriting. We share a sample of the tracking functions detected by Not.js within mixed scripts not currently on filter lists with filter list authors, who confirm that these scripts are not blocked due to potential functionality breakage, despite being known to implement tracking.
PURL: Safe and Effective Sanitization of Link Decoration
Shaoor Munir, Patrick Lee, Umar Iqbal, Zubair Shafiq, Sandra Siby
USENIX Security (2024)•USENIX Security Symposium
TL;DR: PURL uses ML to sanitize tracking information from URL parameters while preserving website functionality.
While privacy-focused browsers have taken steps to block third-party cookies and browser fingerprinting, novel tracking methods that bypass existing defenses continue to emerge. Since trackers need to exfiltrate information from the client- to server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. We present PURL, a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. We use PURL to perform a measurement study on top-million websites. We find that link decorations are widely abused by well-known advertisers and trackers to exfiltrate user information collected from browser storage, email addresses, and scripts involved in fingerprinting.
Shaoor Munir, Konrad Kollnig, Anastasia Shuba, Zubair Shafiq
JETLAW, Vol. 27, No. 3 (2024)•Vanderbilt Journal of Entertainment & Technology Law
TL;DR: Chrome is central to Google's market dominance. We propose behavioral, structural, and divestment remedies.
This article delves into Google's dominance of the browser market, highlighting how Google's Chrome browser is playing a critical role in asserting Google's dominance in other markets. While Google perpetuates the perception that Google Chrome is a neutral platform built on open-source technologies, we argue that Chrome is instrumental in Google's strategy to reinforce its dominance in online advertising, publishing, and the browser market itself. Our examination of Google's strategic acquisitions, anti-competitive practices, and the implementation of so-called "privacy controls," shows that Chrome is far from a neutral gateway to the web. Rather, it serves as a key tool for Google to maintain and extend its market power, often to the detriment of competition and innovation. We examine how Chrome not only bolsters Google's position in advertising and publishing through practices such as coercion, self-preferencing, it also helps leverage its advertising clout to engage in a "pay-to-play" paradigm, which serves as a cornerstone in Google's larger strategy of market control. We also discuss potential regulatory interventions and remedies, drawing on historical antitrust precedents. We propose a triad of solutions motivated from our analysis of Google's abuse of Chrome: behavioral remedies targeting specific anti-competitive practices, structural remedies involving an internal separation of Google's divisions, and divestment of Chrome from Google. Despite Chrome's dominance and its critical role in Google's ecosystem, it has escaped antitrust scrutiny—a gap our article aims to bridge. Addressing this gap is instrumental to solve current market imbalances and future challenges brought on by increasingly hegemonizing technology firms, ensuring a competitive digital environment that nurtures innovation and safeguards consumer interests.
COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies
Shaoor Munir, Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso
CCS (2023)•ACM SIGSAC Conference on Computer and Communications Security
TL;DR: First-party tracking cookies exist on 89.86% of websites. CookieGraph detects them with 90% accuracy without breaking SSO.
As third-party cookie blocking is becoming the norm in mainstream web browsers, advertisers and trackers have started to use first-party cookies for tracking. To understand this phenomenon, we conduct a differential measurement study with versus without third-party cookies. We find that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked. As opposed to third-party cookie blocking, first-party cookie blocking is not practical because it would result in major breakage of website functionality. We propose CookieGraph, a machine learning-based approach that can accurately and robustly detect and block first-party tracking cookies. CookieGraph detects first-party tracking cookies with 90.18% accuracy, outperforming the state-of-the-art CookieBlock by 17.31%. We show that CookieGraph is robust against cookie name manipulation, while CookieBlock's accuracy drops by 15.87%. While blocking all first-party cookies results in major breakage on 32% of the sites with SSO logins, and CookieBlock reduces it to 10%, we show that CookieGraph does not cause any major breakage on these sites. Our deployment of CookieGraph shows that first-party tracking cookies are used on 89.86% of the top-million websites. We find that 96.61% of these first-party tracking cookies are in fact ghostwritten by third-party scripts embedded in the first-party context. We also find evidence of first-party tracking cookies being set by fingerprinting scripts. The most prevalent first-party tracking cookies are set by major advertising entities such as Google, Facebook, and TikTok.
EACL (2021)•European Chapter of the Association for Computational Linguistics
TL;DR: We can attribute AI-generated text to its source language model with 91-98% accuracy using subtle stylistic signatures.
Given the potential misuse of recent advances in synthetic text generation by language models (LMs), it is important to have the capacity to attribute authorship of synthetic text. While stylometric organic (i.e., human written) authorship attribution has been quite successful, it is unclear whether similar approaches can be used to attribute a synthetic text to its source LM. We address this question with the key insight that synthetic texts carry subtle distinguishing marks inherited from their source LM and that these marks can be leveraged by machine learning (ML) algorithms for attribution. We propose and test several ML-based attribution methods. Our best attributor built using a fine-tuned version of XLNet (XLNet-FT) consistently achieves excellent accuracy scores (91% to near perfect 98%) in terms of attributing the parent pre-trained LM behind a synthetic text. Our experiments show promising results across a range of experiments where the synthetic text may be generated using pre-trained LMs, fine-tuned LMs, or by varying text generation parameters.