Customizing OpenWPM & adding urlClassification: Code Modifications

Wesley Tan

June 25, 2024

3 min read

TechnologyProgrammingWeb Privacy

Introduction

OpenWPM (https://github.com/openwpm/OpenWPM) is a web measurement framework designed to automate the collection of web privacy measurements from different websites. While its out-of-the-box features are extensive, my project required additional data points and functionalities not covered by the default OpenWPM setup. Specifically, I needed to introduce new database schemas and extend the data collection capabilities to include URL classifications based on first-party and third-party interactions, enabling each request_id to correspond with its respective urlClassification flags.

Adding Functionality to OpenWPM

First, I wanted to observe the urlClassification (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest) of each request. The main challenges I faced included:

Understanding OpenWPM's Architecture: Before making any changes, I had to thoroughly understand the existing codebase, which required a deep dive into its components like the instrumentation scripts and database interactions.
Ensuring Compatibility: Any changes had to be compatible with the existing framework without disrupting its core functionalities.
Data Integrity and Performance: It was crucial to ensure that the customizations did not introduce data integrity issues or degrade performance.

By better understanding the architecture, the files that needed to be edited were:

schema.sql
parquet_schema.py
http_instrument.ts
test_values.py

To address my specific needs, I implemented several key customizations:

Schema Modifications

I introduced new database tables to store URL classifications. This involved creating new TypeScript interfaces in schema.ts to define the structure of the records that would be stored. For example, I added the UrlClassification interface:

export interface UrlClassification {
  request_id: number;
  first_party_classification: string;
  third_party_classification: string;
  time_stamp: DateTime;
}

Event Handling Extensions

I modified the HTTP instrumentation file, http-instrument.ts, to include handlers that perform URL classification during different phases of the HTTP request lifecycle. Changes were made to capture and classify URLs at the time of response completion:

private async handleUrlClassification(details: browser.webRequest._OnCompletedDetails):
Promise {
  const classificationRecord: UrlClassification = {
    request_id: details.requestId,
    first_party_classification: this.classifyUrl(details.url, 'firstParty'),
    third_party_classification: this.classifyUrl(details.url, 'thirdParty'),
    time_stamp: new Date().toISOString(),
  };
  await this.dataReceiver.saveRecord("urlclassification", classificationRecord);
}

Changing the Classification Structure

Initially, URL classifications were stored in arrays (e.g. {firstParty: [], thirdParty: []}), which posed challenges for granular analysis. We first note that urlClassification employs the following structure:

first_party_classification: An array of strings indicating first-party classifications.
third_party_classification: An array of strings indicating third-party classifications.

The goal was to refactor this structure to store each classification in its own row, making the data more accessible and easier to analyze.

Purpose

The primary purpose of these changes was to:

Improve Data Granularity: By storing each classification separately, we can perform more detailed analysis and reporting.
Enhance Usability: Structured data makes it easier to query and analyze specific classifications, improving the overall utility of the collected data.

Changes

The process involved several key steps:

Updating the Instrumentation Code

The first step was to modify the http_instrument.ts file to handle URL classifications separately and assign a class ID to each classification.

const classLegend = [
  { class_id: 1, class: "fingerprinting" },
  { class_id: 2, class: "fingerprinting_content" },
  { class_id: 3, class: "cryptomining" },
  { class_id: 4, class: "cryptomining_content" },
  { class_id: 5, class: "tracking_ad" },
  { class_id: 6, class: "tracking_analytics" },
  { class_id: 7, class: "tracking_social" },
  { class_id: 8, class: "tracking_content" },
  { class_id: 9, class: "any_basic_tracking" },
  { class_id: 10, class: "any_strict_tracking" },
  { class_id: 11, class: "any_social_tracking" },
];

Implementing Classification Handling

We added functions to handle URL classification by assigning class_id based on the classification flags and storing them in separate rows.

handleUrlClassification: To handle classification and assign class_id.
storeClassification: To store the classification with class_id and status.

To accommodate the new data structure, the schema.sql and parquet_schema.py files were updated. Take the schema.sql for instance:

Original Schema:

-- Table for URL classifications
CREATE TABLE IF NOT EXISTS urlclassification (
  request_id INTEGER NOT NULL,
  first_party_classification TEXT,
  third_party_classification TEXT,
  time_stamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  visit_id INTEGER NOT NULL,
  FOREIGN KEY(visit_id) REFERENCES site_visits(visit_id)
);

Updated Schema:

-- Table for URL classifications
CREATE TABLE IF NOT EXISTS urlclassification (
  request_id INTEGER NOT NULL,
  class_id INTEGER NOT NULL,
  status TEXT NOT NULL,
  time_stamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  visit_id INTEGER NOT NULL,
  FOREIGN KEY(visit_id) REFERENCES site_visits(visit_id)
);

In conclusion, this customization journey has not only tailored OpenWPM more closely to my needs but also deepened my understanding of web tracking technologies.