Blocking Tracking JavaScript at the Function Granularity
ACM SIGSAC Conference on Computer and Communications Security(CCS) · 19% acceptance
Abdul Haddi Amjad, Shaoor Munir, Zubair Shafiq, Muhammad Ali Gulzar
TL;DR:Not.js blocks tracking JavaScript at function-level granularity with 94% precision and 98% recall, without breaking websites.
Modern websites extensively rely on JavaScript to implement both functionality and tracking. Existing privacy enhancing content blocking tools struggle against mixed scripts, which simultaneously implement both functionality and tracking, because blocking the script would break functionality and not blocking it would allow tracking. We propose Not.js, a fine grained JavaScript blocking tool that operates at the function level granularity. Not.js's strengths lie in analyzing the dynamic execution context, including the call stack and calling context of each JavaScript function, and then encoding this context to build a rich graph representation. Not.js trains a supervised machine learning classifier on a webpage's graph representation to first detect tracking at the JavaScript function level and then automatically generate surrogate scripts that preserve functionality while removing tracking. Our evaluation of Not.js on the top 10K websites demonstrates that it achieves high precision (94%) and recall (98%) in detecting tracking JavaScript functions, outperforming the state of the art while being robust against off the shelf JavaScript obfuscation. Fine grained detection of tracking functions allows Not.js to automatically generate surrogate scripts that remove tracking JavaScript functions without causing major breakage. Our deployment of Not.js shows that mixed scripts are present on 62.3% of the top 10K websites, with 70.6% of the mixed scripts being third party that engage in tracking activities such as cookie ghostwriting. We share a sample of the tracking functions detected by Not.js within mixed scripts not currently on filter lists with filter list authors, who confirm that these scripts are not blocked due to potential functionality breakage, despite being known to implement tracking.
Shaoor Munir, Patrick Lee, Umar Iqbal, Zubair Shafiq, Sandra Siby
TL;DR:PURL uses ML to sanitize tracking information from URL parameters while preserving website functionality.
While privacy-focused browsers have taken steps to block third-party cookies and browser fingerprinting, novel tracking methods that bypass existing defenses continue to emerge. Since trackers need to exfiltrate information from the client- to server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. We present PURL, a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. We use PURL to perform a measurement study on top-million websites. We find that link decorations are widely abused by well-known advertisers and trackers to exfiltrate user information collected from browser storage, email addresses, and scripts involved in fingerprinting.
COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies
ACM SIGSAC Conference on Computer and Communications Security(CCS) · 19% acceptance
Shaoor Munir, Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, Carmela Troncoso
TL;DR:First-party tracking cookies exist on 89.86% of websites. CookieGraph detects them with 90% accuracy without breaking SSO.
As third-party cookie blocking is becoming the norm in mainstream web browsers, advertisers and trackers have started to use first-party cookies for tracking. To understand this phenomenon, we conduct a differential measurement study with versus without third-party cookies. We find that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked. As opposed to third-party cookie blocking, first-party cookie blocking is not practical because it would result in major breakage of website functionality. We propose CookieGraph, a machine learning-based approach that can accurately and robustly detect and block first-party tracking cookies. CookieGraph detects first-party tracking cookies with 90.18% accuracy, outperforming the state-of-the-art CookieBlock by 17.31%. We show that CookieGraph is robust against cookie name manipulation, while CookieBlock's accuracy drops by 15.87%. While blocking all first-party cookies results in major breakage on 32% of the sites with SSO logins, and CookieBlock reduces it to 10%, we show that CookieGraph does not cause any major breakage on these sites. Our deployment of CookieGraph shows that first-party tracking cookies are used on 89.86% of the top-million websites. We find that 96.61% of these first-party tracking cookies are in fact ghostwritten by third-party scripts embedded in the first-party context. We also find evidence of first-party tracking cookies being set by fingerprinting scripts. The most prevalent first-party tracking cookies are set by major advertising entities such as Google, Facebook, and TikTok.
TL;DR:We can attribute AI-generated text to its source language model with 91-98% accuracy using subtle stylistic signatures.
Given the potential misuse of recent advances in synthetic text generation by language models (LMs), it is important to have the capacity to attribute authorship of synthetic text. While stylometric organic (i.e., human written) authorship attribution has been quite successful, it is unclear whether similar approaches can be used to attribute a synthetic text to its source LM. We address this question with the key insight that synthetic texts carry subtle distinguishing marks inherited from their source LM and that these marks can be leveraged by machine learning (ML) algorithms for attribution. We propose and test several ML-based attribution methods. Our best attributor built using a fine-tuned version of XLNet (XLNet-FT) consistently achieves excellent accuracy scores (91% to near perfect 98%) in terms of attributing the parent pre-trained LM behind a synthetic text. Our experiments show promising results across a range of experiments where the synthetic text may be generated using pre-trained LMs, fine-tuned LMs, or by varying text generation parameters.
PURL: Safe and Effective Sanitization of Link Decoration
USENIX Security 2024 · August 2024
Presenting a machine-learning approach that uses cross-layer graph representation of webpage execution to safely and effectively sanitize tracking information in decorated links.
COOKIEGRAPH: Measuring and Countering First-Party Tracking Cookies
ACM CCS 2023 · November 2023
Presenting a machine learning-based approach that accurately detects and blocks first-party tracking cookies that are increasingly used as third-party cookies become blocked by browsers.
COOKIEGRAPH: Measuring and Countering First Party Tracking Cookies
Ad-Filtering Dev Summit 2022 · October 2022
Early presentation of CookieGraph research showing how first-party tracking cookies are used on 89.86% of top websites, with 96.61% being ghostwritten by third-party scripts.
Presenting machine learning methods to attribute synthetic text to its source language model, achieving 91-98% accuracy in identifying the parent pre-trained LM behind generated text.