From af299f6f825e97142d79b7087495214348fefcbe Mon Sep 17 00:00:00 2001 From: Stefan Zabka Date: Thu, 21 Dec 2023 13:39:30 +0100 Subject: [PATCH] Documenting the JS Instrument (#949) * First draft of JS Instrument Documentation * Changed ` to `` * Elaborated on Setting up the instrumentation * docs(JSInstrument): stash --------- Co-authored-by: Stefan Zabka --- docs/Platform-Architecture.md | 22 +++++---- docs/developers/CurrentFlow.svg | 4 ++ docs/developers/JS-Instrument.rst | 78 +++++++++++++++++++++++++++++++ docs/index.rst | 2 + 4 files changed, 96 insertions(+), 10 deletions(-) create mode 100644 docs/developers/CurrentFlow.svg create mode 100644 docs/developers/JS-Instrument.rst diff --git a/docs/Platform-Architecture.md b/docs/Platform-Architecture.md index 7ed554ad1..583fc9aff 100644 --- a/docs/Platform-Architecture.md +++ b/docs/Platform-Architecture.md @@ -21,7 +21,7 @@ In OpenWPM we have a watchdog thread that tries to ensure two things. - `memory_watchdog` - It is part of default manager_params. It is set to false by default which can manually be set to true. - It is a watchdog that tries to ensure that no Firefox instance takes up too much memory. - - It is mostly useful for long running cloud crawls. + - It is mostly useful for long-running cloud crawls. ### Issuing commands @@ -29,21 +29,23 @@ OpenWPM uses the `CommandSequence` as a fundamental unit of work. A `CommandSequence` describes as series of steps that will execute in order on a particular browser. All available Commands are visible by inspecting the `CommandSequence` API. -For example you could wire up a `CommandSequence` to go to a given url and take a screenshot of it by writing: +For example, you could wire up a `CommandSequence` to go to a given url and take a screenshot of it by writing: ```python - command_sequence = CommandSequence(url) - # Start by visiting the page - command_sequence.get(sleep=3, timeout=60) - command_sequence.save_screenshot() +from openwpm.command_sequence import CommandSequence +url = "https://example.com" +command_sequence = CommandSequence(url) +# Start by visiting the page +command_sequence.get(sleep=3, timeout=60) +command_sequence.save_screenshot() ``` But this on its own would do nothing, because `CommandSequence`s are not automatically scheduled. Instead, you need to submit them to a `TaskManager` by calling: ```python - manager.execute_command_sequence(command_sequence) - manager.close() +manager.execute_command_sequence(command_sequence) +manager.close() ``` Please note that you need to close the manager, because by default `CommandSequence`s are executed in a non-blocking fashion meaning that you might reach the end of your main function/file before the CommandSequence completed running. @@ -87,7 +89,7 @@ which provides stability in logging data despite the possibility of individual b ## The WebExtension All of our data collection happens in the OpenWPM WebExtension, which can be found under [Extension](../Extension). -The Extension makes heavy use of priviliged APIs and can only be installed on unbranded or custom builds of Firefox with add-on security disabled. +The Extension makes heavy use of privileged APIs and can only be installed on unbranded or custom builds of Firefox with add-on security disabled. The currently supported instruments can be found in [Configuration.md](Configuration.md#Instruments) @@ -95,7 +97,7 @@ The currently supported instruments can be found in [Configuration.md](Configura ### Overview -One of the Data Aggregators, contained in `openwpm/DataAggregator`, gets spawned in a separate process and receives data from the WebExtension and the platform alike. We as previously mentioned we support both local as well as remote data saving. +One of the Data Aggregators, contained in `openwpm/DataAggregator`, gets spawned in a separate process and receives data from the WebExtension and the platform alike. We as previously mentioned we support both local and remote data saving. The most useful feature of the Data Aggregator is the fact that it is isolated from the other processes through a network socket interface (see `openwpm/SocketInterface.py`). ### Data Logged diff --git a/docs/developers/CurrentFlow.svg b/docs/developers/CurrentFlow.svg new file mode 100644 index 000000000..8e7b01b4e --- /dev/null +++ b/docs/developers/CurrentFlow.svg @@ -0,0 +1,4 @@ + + +ENV -WebExtensionENV-Web ContentContentProcessMainProcess[instrumentNameis part of Config]optRegistration1: Create Tab3: VisitWebpage4: Close Tab11: Call toWebAPI5: SendEvent6: WebAPIreturns10: Registration completed2: Callbackwith Event7: \ No newline at end of file diff --git a/docs/developers/JS-Instrument.rst b/docs/developers/JS-Instrument.rst new file mode 100644 index 000000000..db3f21efc --- /dev/null +++ b/docs/developers/JS-Instrument.rst @@ -0,0 +1,78 @@ +JS Instrument technical documentation +===================================== + +Of all the Instruments in the OpenWPM WebExtension the one that is most likely +to collect the most information is the Javascript instrument. +It allows users to specify which WebAPI calls they are interested in and +receive a full breadth of information on how websites use the instrumented APIs. + +To allow for this rich data collection it employs a number of tricks and subtleties +which this document aims to capture. + +TL;DR: We pass the configuration to a content script in the WebExtension. In the content +scope generate a string that contains the script we want to execute on the page +and then insert it in into the page. +This script is literally a format string in which the configuration gets embedded via +``JSON.stringify``. + +Setting up the instrumentation +------------------------------ + +In the JavascriptInstrument class which runs in the background script, we register two content +scripts to run at ``document_start`` these are: + +1. A dynamically generated script that sets ``window.openWpmContentScriptConfig`` to the + ``JSON.stringified`` value of the contentScriptConfig. +2. ``content.js`` which is the combination of ``javascript-instrument-page-scope`` and + ``javascript-instrument-content-scope`` as produced by webpack + +By setting those two in this order we are able to pass a parameter to the content script. +I currently do not know of another way to dynamically pass config from the background to the +content scope but this feels hacky. + +In ``javascript-instrument-content-scope`` we then create a massive string that contains +all of the following: + +1. The ``lib/js-instruments.ts`` file, where the actual instrumenting happens +2. The ``jsInstrumentationSettings`` as a JSON object +3. The ``javascript-instrument-page-scope`` which contains the setup and sendMessagesToLogger + functions + +This string is then injected into the page scope where ``javascript-instrument-page-scope`` +starts executing, pulling the testing and event parameter out of data attributes on it's +script node. It then calls into ``lib/js-instruments.ts`` which then does the actual +instrumentation. + +See CurrentFlow_ for a diagram I made as part of my bachelor's thesis to demonstrate the flow of +information + +.. _CurrentFlow: + +.. image:: ./CurrentFlow.svg + :width: 400 + :alt: An UML sequence diagram showing the process of injecting instrumentation code into + the web page and the propagation of events captured by the JavaScript instrument + to OpenWPM's execution platform (at the left) + + + + +Data collection +--------------- + +TL;DR: We wrap each WebAPI that we should instrument and forward all calls to us +to the underlying object, while logging the accesses. This is done by the injected +script mentioned above. + +Getting the data into the Database +---------------------------------- + +Since the data collection happens in the website scope, but we care about it +in the in the database, we had to figure out a way to get it there. + +We do this via the following steps: + +1. Dispatch a custom event via ``document.dispatchEvent`` in ``javascript-instrument-page-scope`` +2. Register a listener for the custom event in ``javascript-instrument-content-scope`` and + call ``runtime.sendMessage`` to pass it from the content scope into the background scope +3. Where ``javascript-instrument`` (in the background scope) receives the message and forwards it to the ``loggingdb`` \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 2a737f960..2d0a6e3fe 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -36,6 +36,8 @@ We're hoping to improve this setup in the future. Release-Checklist + developers/JS-Instrument + Indices and tables ==================