From 2303784c956f738aea3c03dda7ec5c1d1e1f3346 Mon Sep 17 00:00:00 2001 From: Stefan Zabka Date: Sat, 24 Jul 2021 20:03:02 +0200 Subject: [PATCH 1/4] First draft of JS Instrument Documentation --- docs/developers/JS-Instrument.rst | 39 +++++++++++++++++++++++++++++++ docs/index.rst | 2 ++ 2 files changed, 41 insertions(+) create mode 100644 docs/developers/JS-Instrument.rst diff --git a/docs/developers/JS-Instrument.rst b/docs/developers/JS-Instrument.rst new file mode 100644 index 000000000..3b41e6b31 --- /dev/null +++ b/docs/developers/JS-Instrument.rst @@ -0,0 +1,39 @@ +JS Instrument technical documentation +===================================== + +Of all the Instruments in the OpenWPM webextension the one that is most likely +to collect the most information is the Javascript instrument. +It allows users to specify which WebAPI calls they are interested in and +receive a full breadth of information on how websites use the instrumented APIs. + +To allow for this rich data collection it employs a number of tricks and subtelties +which this document aims to capture. + +Setting up the instrumentation +------------------------------ + +TL;DR: We pass the configuration to a content script in the webextension. In the content +scope generate a string that contains the script we want to execute on the page +and then insert it in into the page. +This script is literally a format string in which the configuration gets embedded via +`JSON.stringify`. + +Data collection +--------------- + +TL;DR: We wrap each WebAPI that we should instrument and forward all calls to us +to the underlying object, while logging the accesses. This is done by the injected +script mentioned above. + +Getting the data into the Database +---------------------------------- + +Since the data collection happens in the website scope, but we care about it +in the in the database, we had to figure out a way to get it there. + +We do this via the following steps: + +1. Dispatch a custom event via `document.dispatchEvent` in `javascript-instrumentat-page-scope` +2. Register a listener for the custom event in `javascript-instrument-content-scope` and + call `runtime.sendMessage` to pass it from the content scope into the background scope +3. Where `js-instrument` receives the message and forwards it to the `loggingdb` \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 2a737f960..2d0a6e3fe 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -36,6 +36,8 @@ We're hoping to improve this setup in the future. Release-Checklist + developers/JS-Instrument + Indices and tables ================== From a9e3cd3be80156dfe1a8d4b77110f91565da4652 Mon Sep 17 00:00:00 2001 From: Stefan Zabka Date: Sat, 24 Jul 2021 20:57:57 +0200 Subject: [PATCH 2/4] Changed ` to `` --- docs/developers/JS-Instrument.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/developers/JS-Instrument.rst b/docs/developers/JS-Instrument.rst index 3b41e6b31..2bc798e57 100644 --- a/docs/developers/JS-Instrument.rst +++ b/docs/developers/JS-Instrument.rst @@ -16,7 +16,7 @@ TL;DR: We pass the configuration to a content script in the webextension. In the scope generate a string that contains the script we want to execute on the page and then insert it in into the page. This script is literally a format string in which the configuration gets embedded via -`JSON.stringify`. +``JSON.stringify``. Data collection --------------- @@ -33,7 +33,7 @@ in the in the database, we had to figure out a way to get it there. We do this via the following steps: -1. Dispatch a custom event via `document.dispatchEvent` in `javascript-instrumentat-page-scope` -2. Register a listener for the custom event in `javascript-instrument-content-scope` and - call `runtime.sendMessage` to pass it from the content scope into the background scope -3. Where `js-instrument` receives the message and forwards it to the `loggingdb` \ No newline at end of file +1. Dispatch a custom event via ``document.dispatchEvent`` in ``javascript-instrumentat-page-scope`` +2. Register a listener for the custom event in ``javascript-instrument-content-scope`` and + call ``runtime.sendMessage`` to pass it from the content scope into the background scope +3. Where ``javascript-instrument`` (in the background scope) receives the message and forwards it to the ``loggingdb`` \ No newline at end of file From 88cfc0cb78866d0ff3545b4d76acfe60b4f491a7 Mon Sep 17 00:00:00 2001 From: vringar Date: Sat, 31 Jul 2021 18:40:15 +0200 Subject: [PATCH 3/4] Elaborated on Setting up the instrumentation --- docs/developers/JS-Instrument.rst | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/docs/developers/JS-Instrument.rst b/docs/developers/JS-Instrument.rst index 2bc798e57..c722f1b19 100644 --- a/docs/developers/JS-Instrument.rst +++ b/docs/developers/JS-Instrument.rst @@ -18,6 +18,36 @@ and then insert it in into the page. This script is literally a format string in which the configuration gets embedded via ``JSON.stringify``. +In the JavascriptInstrument class which runs in the background script, we register two content +scripts to run at ``document_start`` these are: + +1. A dynamically generated script that sets ``window.openWpmContentScriptConfig`` to the + ``JSON.stringified`` value of the contentScriptConfig. +2. ``content.js`` which is the combination of ``javascript-instrumentat-page-scope`` and + ``javascript-instrument-content-scope`` as produced by webpack + +By setting those two in this order we are able to pass a parameter to the content script. +I currently do not know of another way to dynamically pass config from the background to the +content scope but this feels hacky. + +In ``javascript-instrument-content-scope`` we then create a massive string that contains +all of the following: + +1. The ``lib/js-instruments.ts`` file, where the actual instrumenting happens +2. The ``jsInstrumentationSettings`` as a JSON object +3. The ``javascript-instrument-page-scope`` which contains the setup and sendMessagesToLogger + functions + +This string is then injected into the page scope where ``javascript-instrument-page-scope`` +starts executing, pulling the testing and event parameter out of data attributes on it's +script node. It then calls into ``lib/js-instruments.ts`` which then does the actual +instrumentation. + +TODO: Elaborate on how that works. + + + + Data collection --------------- From 86bd004bcba75fcc4d9daabe0a864483fcb9927a Mon Sep 17 00:00:00 2001 From: vringar Date: Tue, 5 Sep 2023 14:49:33 +0200 Subject: [PATCH 4/4] docs(JSInstrument): stash --- docs/Platform-Architecture.md | 22 ++++++++++++---------- docs/developers/CurrentFlow.svg | 4 ++++ docs/developers/JS-Instrument.rst | 27 ++++++++++++++++++--------- 3 files changed, 34 insertions(+), 19 deletions(-) create mode 100644 docs/developers/CurrentFlow.svg diff --git a/docs/Platform-Architecture.md b/docs/Platform-Architecture.md index 7ed554ad1..583fc9aff 100644 --- a/docs/Platform-Architecture.md +++ b/docs/Platform-Architecture.md @@ -21,7 +21,7 @@ In OpenWPM we have a watchdog thread that tries to ensure two things. - `memory_watchdog` - It is part of default manager_params. It is set to false by default which can manually be set to true. - It is a watchdog that tries to ensure that no Firefox instance takes up too much memory. - - It is mostly useful for long running cloud crawls. + - It is mostly useful for long-running cloud crawls. ### Issuing commands @@ -29,21 +29,23 @@ OpenWPM uses the `CommandSequence` as a fundamental unit of work. A `CommandSequence` describes as series of steps that will execute in order on a particular browser. All available Commands are visible by inspecting the `CommandSequence` API. -For example you could wire up a `CommandSequence` to go to a given url and take a screenshot of it by writing: +For example, you could wire up a `CommandSequence` to go to a given url and take a screenshot of it by writing: ```python - command_sequence = CommandSequence(url) - # Start by visiting the page - command_sequence.get(sleep=3, timeout=60) - command_sequence.save_screenshot() +from openwpm.command_sequence import CommandSequence +url = "https://example.com" +command_sequence = CommandSequence(url) +# Start by visiting the page +command_sequence.get(sleep=3, timeout=60) +command_sequence.save_screenshot() ``` But this on its own would do nothing, because `CommandSequence`s are not automatically scheduled. Instead, you need to submit them to a `TaskManager` by calling: ```python - manager.execute_command_sequence(command_sequence) - manager.close() +manager.execute_command_sequence(command_sequence) +manager.close() ``` Please note that you need to close the manager, because by default `CommandSequence`s are executed in a non-blocking fashion meaning that you might reach the end of your main function/file before the CommandSequence completed running. @@ -87,7 +89,7 @@ which provides stability in logging data despite the possibility of individual b ## The WebExtension All of our data collection happens in the OpenWPM WebExtension, which can be found under [Extension](../Extension). -The Extension makes heavy use of priviliged APIs and can only be installed on unbranded or custom builds of Firefox with add-on security disabled. +The Extension makes heavy use of privileged APIs and can only be installed on unbranded or custom builds of Firefox with add-on security disabled. The currently supported instruments can be found in [Configuration.md](Configuration.md#Instruments) @@ -95,7 +97,7 @@ The currently supported instruments can be found in [Configuration.md](Configura ### Overview -One of the Data Aggregators, contained in `openwpm/DataAggregator`, gets spawned in a separate process and receives data from the WebExtension and the platform alike. We as previously mentioned we support both local as well as remote data saving. +One of the Data Aggregators, contained in `openwpm/DataAggregator`, gets spawned in a separate process and receives data from the WebExtension and the platform alike. We as previously mentioned we support both local and remote data saving. The most useful feature of the Data Aggregator is the fact that it is isolated from the other processes through a network socket interface (see `openwpm/SocketInterface.py`). ### Data Logged diff --git a/docs/developers/CurrentFlow.svg b/docs/developers/CurrentFlow.svg new file mode 100644 index 000000000..8e7b01b4e --- /dev/null +++ b/docs/developers/CurrentFlow.svg @@ -0,0 +1,4 @@ + + +ENV -WebExtensionENV-Web ContentContentProcessMainProcess[instrumentNameis part of Config]optRegistration1: Create Tab3: VisitWebpage4: Close Tab11: Call toWebAPI5: SendEvent6: WebAPIreturns10: Registration completed2: Callbackwith Event7: \ No newline at end of file diff --git a/docs/developers/JS-Instrument.rst b/docs/developers/JS-Instrument.rst index c722f1b19..db3f21efc 100644 --- a/docs/developers/JS-Instrument.rst +++ b/docs/developers/JS-Instrument.rst @@ -1,29 +1,29 @@ JS Instrument technical documentation ===================================== -Of all the Instruments in the OpenWPM webextension the one that is most likely +Of all the Instruments in the OpenWPM WebExtension the one that is most likely to collect the most information is the Javascript instrument. It allows users to specify which WebAPI calls they are interested in and receive a full breadth of information on how websites use the instrumented APIs. -To allow for this rich data collection it employs a number of tricks and subtelties +To allow for this rich data collection it employs a number of tricks and subtleties which this document aims to capture. -Setting up the instrumentation ------------------------------- - -TL;DR: We pass the configuration to a content script in the webextension. In the content +TL;DR: We pass the configuration to a content script in the WebExtension. In the content scope generate a string that contains the script we want to execute on the page and then insert it in into the page. This script is literally a format string in which the configuration gets embedded via ``JSON.stringify``. +Setting up the instrumentation +------------------------------ + In the JavascriptInstrument class which runs in the background script, we register two content scripts to run at ``document_start`` these are: 1. A dynamically generated script that sets ``window.openWpmContentScriptConfig`` to the ``JSON.stringified`` value of the contentScriptConfig. -2. ``content.js`` which is the combination of ``javascript-instrumentat-page-scope`` and +2. ``content.js`` which is the combination of ``javascript-instrument-page-scope`` and ``javascript-instrument-content-scope`` as produced by webpack By setting those two in this order we are able to pass a parameter to the content script. @@ -43,7 +43,16 @@ starts executing, pulling the testing and event parameter out of data attributes script node. It then calls into ``lib/js-instruments.ts`` which then does the actual instrumentation. -TODO: Elaborate on how that works. +See CurrentFlow_ for a diagram I made as part of my bachelor's thesis to demonstrate the flow of +information + +.. _CurrentFlow: + +.. image:: ./CurrentFlow.svg + :width: 400 + :alt: An UML sequence diagram showing the process of injecting instrumentation code into + the web page and the propagation of events captured by the JavaScript instrument + to OpenWPM's execution platform (at the left) @@ -63,7 +72,7 @@ in the in the database, we had to figure out a way to get it there. We do this via the following steps: -1. Dispatch a custom event via ``document.dispatchEvent`` in ``javascript-instrumentat-page-scope`` +1. Dispatch a custom event via ``document.dispatchEvent`` in ``javascript-instrument-page-scope`` 2. Register a listener for the custom event in ``javascript-instrument-content-scope`` and call ``runtime.sendMessage`` to pass it from the content scope into the background scope 3. Where ``javascript-instrument`` (in the background scope) receives the message and forwards it to the ``loggingdb`` \ No newline at end of file