Skip to main content
Back

Machine learning

10 results across all content

Publications (4)

2024ConferenceTop-TierBest Artifact Award

Blocking Tracking JavaScript at the Function Granularity

ACM SIGSAC Conference on Computer and Communications Security(CCS) · 19% acceptance

Abdul Haddi Amjad, Shaoor Munir, Zubair Shafiq, Muhammad Ali Gulzar

TL;DR:Not.js blocks tracking JavaScript at function-level granularity with 94% precision and 98% recall, without breaking websites.

Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose Not.js, a fine grained JavaScript blocking tool that operates at the function level granularity. Not.js's strengths lie in analyzing the dynamic execution context, including the call stack and calling context of each JavaScript function, and then encoding this context to build a rich graph representation. Not.js trains a supervised machine learning classifier on a webpage's graph representation to first detect tracking at the JavaScript function level and then automatically generate surrogate scripts that preserve functionality while removing tracking. Our evaluation of Not.js on the top 10K websites demonstrates that it achieves high precision (94%) and recall (98%) in detecting tracking JavaScript functions, outperforming the state of the art while being robust against off the shelf JavaScript obfuscation. Fine grained detection of tracking functions allows Not.js to automatically generate surrogate scripts that remove tracking JavaScript functions without causing major breakage. Our deployment of Not.js shows that mixed scripts are present on 62.3% of the top 10K websites, with 70.6% of the mixed scripts being third party that engage in tracking activities such as cookie ghostwriting. We share a sample of the tracking functions detected by Not.js within mixed scripts not currently on filter lists with filter list authors, who confirm that these scripts are not blocked due to potential functionality breakage, despite being known to implement tracking.

2024ConferenceTop-Tier

PURL: Safe and Effective Sanitization of Link Decoration

USENIX Security Symposium(USENIX Security) · 17% acceptance

Shaoor Munir, Patrick Lee, Umar Iqbal, Zubair Shafiq, Sandra Siby

TL;DR:PURL uses ML to sanitize tracking information from URL parameters while preserving website functionality.

While privacy-focused browsers have taken steps to block third-party cookies and browser fingerprinting, novel tracking methods that bypass existing defenses continue to emerge. Since trackers need to exfiltrate information from the client- to server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. We present PURL, a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. We use PURL to perform a measurement study on top-million websites. We find that link decorations are widely abused by well-known advertisers and trackers to exfiltrate user information collected from browser storage, email addresses, and scripts involved in fingerprinting.

2023ConferenceTop-Tier

COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies

ACM SIGSAC Conference on Computer and Communications Security(CCS) · 19% acceptance

Shaoor Munir, Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso

TL;DR:First-party tracking cookies exist on 89.86% of websites. CookieGraph detects them with 90% accuracy without breaking SSO.

As third-party cookie blocking is becoming the norm in mainstream web browsers, advertisers and trackers have started to use first-party cookies for tracking. To understand this phenomenon, we conduct a differential measurement study with versus without third-party cookies. We find that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked. As opposed to third-party cookie blocking, first-party cookie blocking is not practical because it would result in major breakage of website functionality. We propose CookieGraph, a machine learning-based approach that can accurately and robustly detect and block first-party tracking cookies. CookieGraph detects first-party tracking cookies with 90.18% accuracy, outperforming the state-of-the-art CookieBlock by 17.31%. We show that CookieGraph is robust against cookie name manipulation, while CookieBlock's accuracy drops by 15.87%. While blocking all first-party cookies results in major breakage on 32% of the sites with SSO logins, and CookieBlock reduces it to 10%, we show that CookieGraph does not cause any major breakage on these sites. Our deployment of CookieGraph shows that first-party tracking cookies are used on 89.86% of the top-million websites. We find that 96.61% of these first-party tracking cookies are in fact ghostwritten by third-party scripts embedded in the first-party context. We also find evidence of first-party tracking cookies being set by fingerprinting scripts. The most prevalent first-party tracking cookies are set by major advertising entities such as Google, Facebook, and TikTok.

2021Conference

Through the Looking Glass: Learning to Attribute Synthetic Text Generated by Language Models

European Chapter of the Association for Computational Linguistics(EACL) · 25% acceptance

Shaoor Munir, Brishna Batool, Zubair Shafiq, Padmini Srinivasan, Fareed Zaffar

TL;DR:We can attribute AI-generated text to its source language model with 91-98% accuracy using subtle stylistic signatures.

Given the potential misuse of recent advances in synthetic text generation by language models (LMs), it is important to have the capacity to attribute authorship of synthetic text. While stylometric organic (i.e., human written) authorship attribution has been quite successful, it is unclear whether similar approaches can be used to attribute a synthetic text to its source LM. We address this question with the key insight that synthetic texts carry subtle distinguishing marks inherited from their source LM and that these marks can be leveraged by machine learning (ML) algorithms for attribution. We propose and test several ML-based attribution methods. Our best attributor built using a fine-tuned version of XLNet (XLNet-FT) consistently achieves excellent accuracy scores (91% to near perfect 98%) in terms of attributing the parent pre-trained LM behind a synthetic text. Our experiments show promising results across a range of experiments where the synthetic text may be generated using pre-trained LMs, fine-tuned LMs, or by varying text generation parameters.

Talks (4)

PURL: Safe and Effective Sanitization of Link Decoration

USENIX Security 2024 · August 2024

Presenting a machine-learning approach that uses cross-layer graph representation of webpage execution to safely and effectively sanitize tracking information in decorated links.

Watch/Listen →

COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies

ACM CCS 2023 · November 2023

Presenting a machine learning-based approach that accurately detects and blocks first-party tracking cookies that are increasingly used as third-party cookies become blocked by browsers.

Watch/Listen →

COOKIEGRAPH: Measuring and Countering First Party Tracking Cookies

Ad-Filtering Dev Summit 2022 · October 2022

Early presentation of CookieGraph research showing how first-party tracking cookies are used on 89.86% of top websites, with 96.61% being ghostwritten by third-party scripts.

Watch/Listen →

Attribution of Text Generated by Language Models

EACL 2021 · April 2021

Presenting machine learning methods to attribute synthetic text to its source language model, achieving 91-98% accuracy in identifying the parent pre-trained LM behind generated text.

Watch/Listen →

Academic Service (2)

CCS2026

ACM SIGSAC Conference on Computer and Communications Security

Track: Machine Learning and Security

AISec2025

ACM Workshop on Artificial Intelligence and Security

Machine learning Research & Content | Shaoor Munir