-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Canonical event model v70
🚧 The latest canonical model for Snowplow can be found on the Snowplow documentation site.
This relates to the Canonical Event Model from Snowplow Release 63 Red-Cheeked Cordon-Bleu to 70 Bornean Green Magpie. For other versions:
- Canonical Event Model (Latest)
- Canonical Event Model pre-Release 73
- Canonical Event Model pre-Release 71
- Canonical Event Model pre-Release 63
Table of contents:
- Overview
- The Snowplow canonical data structure: understanding the individual fields
- A note about data storage formats
In order to analyse Snowplow data, it is important to understand how it is structured. We have tried to make the structure of Snowplow data as simple, logical, and easy-to-query as possible.
- Each line represents one event. Each line in the Snowplow events table represents a single event, be that a page view, add to basket, play video, like etc.
- Structured data. Snowplow data is structured: individual fields are stored in their own columns, making writing sophisticated queries on the data easy, and making it straightforward for analysts to plugin any kind of analysis tool into their Snowplow data to compose and execute queries
-
Extensible schema. Snowplow started life as a web analytics data warehousing platform, and has a basic schema suitable for performing web analytics, with a wide range of web-specific dimensions (related to page URLs, browsers, operating systems, devices, IP addresses, cookie IDs) and web-specfic events (page views, page pings, transactions). All of these fields can be found in the
atomic.events
table, which is a "fat" (many columns) table. As Snowplow has evolved into a general purpose event analytics platform, we've enabled Snowplow users to define additional event types (we call these custom unstructured events) and define their own entities (we call these custom contexts) so that they can extend the schema to suit their own businesses. For Snowplow users running Amazon Redshift, each custom unstructured event and custom context will be stored in its own dedicated table, again with one line per event. These additional tables can be joined back to the coreatomic.events
table, by joining on theroot_id
field in the custom unstructured event / custom context table with theevent_id
in theatomic.events
table -
Single table. All the events are effectively stored in a single table, making running queries across the data very easy. Even if you're running Snowplow with Redshift and have extended the schema as described above, you can still query the data as if it were in a single fat table. This is because:
- The joins from the additional tables to the core
atomic.events
table are one-to-one - The field joined on is the distribution and sort key for both tables, so queries are as fast as if the data were in a single table
- The joins from the additional tables to the core
- Immutable log. The Snowplow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from Snowplow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the Snowplow events table when performing any analysis
- 2.1 Common fields (platform and event independent)
- 2.1.1 Application fields
- 2.1.2 Date / time fields
- 2.1.3 Event / transaction fields
- 2.1.4 Snowplow version fields
- 2.1.5 User-related fields
- 2.1.6 Device and operating system fields
- 2.1.7 Location fields
- 2.1.8 IP address-based fields
- 2.2 Platform-specific fields
- 2.2.1 Web-specific fields
- 2.3 Event-specific fields
- 2.3.1 Page views
- 2.3.2 Page pings
- 2.3.3 Ecommerce transations
- 2.3.4 Error tracking
- 2.3.5 Custom structured events
- 2.3.6 Custom unstructured events
- 2.3.7 Custom contexts
- 2.4 Specific unstructured events
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
app_id |
text | Application ID | Yes | 'angry-birds' |
platform |
text | Platform | Yes | 'web' |
The application ID is used to distinguish different applications that are being tracked by the same Snowplow stack.
The platform ID is used to distinguish the same app running on different platforms e.g. iOS
vs web
.
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
collector_tstamp |
timestamp | Time stamp for the event recorded by the collector | Yes | '2013-11-26 00:02:05' |
dvce_tstamp |
timestamp | Timestamp event was recorded on the client device | No | '2013-11-26 00:03:57.885' |
dvce_sent_tstamp |
timestamp | When the event was sent by the client device | No | '2013-11-26 00:03:58.032' |
etl_tstamp |
timestamp | Timestamp event began ETL | No | '2017-01-26 00:01:25.292' |
os_timezone |
text | Client operating system timezone | No | 'Europe/London' |
derived_tstamp |
timestamp | Calculated "true timestamp" for the event | No | '2013-11-26 00:02:04' |
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
event |
text | The type of event recorded | Yes | 'page_view' |
event_id |
text | A UUID for each event | Yes | 'c6ef3124-b53a-4b13-a233-0088f79dcbcb' |
txn_id |
int | Transaction ID set client-side, used to de-dupe records | No | 421828 |
A complete list of event types is given here.
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
v_tracker |
text | Tracker version | Yes | 'no-js-0.1.0' |
v_collector |
text | Collector version | Yes | 'clj-tomcat-0.1.0', 'cf' |
v_etl |
text | ETL version | Yes | 'serde-0.5.2' |
name_tracker |
text | Tracker namespace | No | 'cloudfront-1' |
etl_tags |
text | JSON of tags for this ETL run | No | "['prod']" |
Some Snowplow Trackers allow the user to name each specific Tracker instance. name_tracker
corresponds to this name, and can be used to distinguish which tracker generated which events.
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
user_id |
text | Unique ID set by business | No | '[email protected]' |
domain_userid |
text | User ID set by Snowplow using 1st party cookie | No | 'bc2e92ec6c204a14' |
network_userid |
text | User ID set by Snowplow using 3rd party cookie | No | 'ecdff4d0-9175-40ac-a8bb-325c49733607' |
user_ipaddress |
text | User IP address | No | '92.231.54.234' |
domain_sessionidx |
int | A visit / session index | No | 3 |
domain_sessionid |
text | A visit / session identifier | No | 'c6ef3124-b53a-4b13-a233-0088f79dcbcb' |
domain_sessionidx
is the number of the current user session. For example, an event occurring during a user's first session would have domain_sessionidx
set to 1. The JavaScript Tracker calculates this field by storing a visit count in a first-party cookie. Whenever the Tracker fires an event, if more than 30 minutes have elapsed since the last event, the visitor count is increased by 1. (Whenever an event is fired, a "session cookie" is created and set to expire in 30 minutes. This is how the Tracker can tell whether the visit count should be incremented.) Thirty minutes is the default value and can be changed using the setSessionCookieTimeout
method.
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
useragent |
text | Raw useragent | Yes | |
dvce_type |
text | Type of device | No | 'Computer' |
dvce_ismobile |
boolean | Is the device mobile? | No | 1 |
dvce_screenheight |
int | Screen height in pixels | No | 1024 |
dvce_screenwidth |
int | Screen width in pixels | No | 1900 |
os_name |
text | Name of operating system | No | 'Android' |
os_family |
text | Operating system family | No | 'Linux' |
os_manufacturer |
text | Company responsible for OS | No | 'Apple' |
Back to top.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
geo_country |
text | ISO 3166-1 code for the country the visitor is located in | No | 'GB', 'US' |
geo_region |
text | ISO-3166-2 code for country region the visitor is in | No | 'I9', 'TX' |
geo_city |
text | City the visitor is in | No | 'New York', 'London' |
geo_zipcode |
text | Postcode the visitor is in | No | '94109' |
geo_latitude |
text | Visitor location latitude | No | 37.443604 |
geo_longitude |
text | Visitor location longitude | No | -122.4124 |
geo_region_name |
text | Visitor region name | No | 'Florida' |
geo_timezone |
text | Visitor timezone name | No | 'Europe/London' |
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
ip_isp |
text | Visitor's ISP | No | 'FDN Communications' |
ip_organization |
text | Organization associated with the visitor's IP address - defaults to ISP name if none is found | No | 'Bouygues Telecom' |
ip_domain |
text | Second level domain name associated with the visitor's IP address | No | 'nuvox.net' |
ip_netspeed |
text | Visitor's connection type | No | 'Cable/DSL' |
Currently the only platform supported is web
. However, as we build trackers for more platforms (e.g. iOS, Windows 8) we would expect to add platforms that are specific to each of these platforms.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
Page fields | ||||
page_url |
text | The page URL | Yes | 'http://www.example.com' |
page_urlscheme |
text | Scheme aka protocol | Yes | 'https' |
page_urlhost |
text | Host aka domain | Yes | '“www.snowplowanalytics.com' |
page_urlport |
int | Port if specified, 80 if not | 80 | |
page_urlpath |
text | Path to page | No | '/product/index.html' |
page_urlquery |
text | Querystring | No | 'id=GTM-DLRG' |
page_urlfragment |
text | Fragment aka anchor | No | '4-conclusion' |
page_referrer |
text | URL of the referrer | No | 'http://www.referrer.com' |
page_title |
text | Web page title | No | 'Using ChartIO to visualize and interrogate Snowplow data - Snowplow Analytics' |
refr_urlscheme |
text | Referer scheme | No | 'http' |
refr_urlhost |
text | Referer host | No | 'www.bing.com' |
refr_urlport |
int | Referer port | No | 80 |
refr_urlpath |
text | Referer page path | No | '/images/search' |
refr_urlquery |
text | Referer URL querystring | No | 'q=psychic+oracle+cards' |
refr_urlfragment |
text | Referer URL fragment | No | |
refr_medium |
text | Type of referer | No | 'search', 'internal' |
refr_source |
text | Name of referer if recognised | No | 'Bing images' |
refr_term |
text | Keywords if source is a search engine | No | 'psychic oracle cards' |
refr_domain_userid |
text | The Snowplow domain_userid of the referring website | No | 'bc2e92ec6c204a14' |
refr_dvce_tstamp |
timestamp | The time of attaching the domain_userid to the inbound link | No | '2013-11-26 00:02:05' |
Document fields | ||||
doc_charset |
text | The page’s character encoding | No | , 'UTF-8' |
doc_width |
int | The page's width in pixels | No | 1024 |
doc_height |
int | The page's height in pixels | No | 3000 |
Marketing / traffic source fields | ||||
mkt_medium |
text | Type of traffic source | No | 'cpc', 'affiliate', 'organic', 'social' |
mkt_source |
text | The company / website where the traffic came from | No | 'Google', 'Facebook' |
mkt_term |
text | Any keywords associated with the referrer | No | 'new age tarot decks' |
mkt_content |
text | The content of the ad. (Or an ID so that it can be looked up.) | No | 13894723 |
mkt_campaign |
text | The campaign ID | No | 'diageo-123' |
mkt_clickid |
text | The click ID | No | 'ac3d8e459' |
mkt_network |
text | The ad network to which the click ID belongs | No | 'DoubleClick' |
Browser fields | ||||
user_fingerprint |
int | A user fingerprint generated by looking at the individual browser features | No | 2161814971 |
connection_type |
text | Type of internet connection | No | No |
cookie |
boolean | Does the browser support persistent cookies? | No | 1 |
br_name |
text | Browser name | No | 'Firefox 12' |
br_version |
text | Browser version | No | '12.0' |
br_family |
text | Browser family | No | 'Firefox' |
br_type |
text | Browser type | No | 'Browser' |
br_renderengine |
text | Browser rendering engine | No | 'GECKO' |
br_lang |
text | Language the browser is set to | No | 'en-GB' |
br_features_pdf |
boolean | Whether the browser recognizes PDFs | No | 1 |
br_features_flash |
boolean | Whether Flash is installed | No | 1 |
br_features_java |
boolean | Whether Java is installed | No | 1 |
br_features_director |
boolean | Whether Adobe Shockwave is installed | No | 1 |
br_features_quicktime |
boolean | Whether QuickTime is installed | No | 1 |
br_features_realplayer |
boolean | Whether RealPlayer is installed | No | 1 |
br_features_windowsmedia |
boolean | Whether mplayer2 is installed | No | 1 |
br_features_gears |
boolean | Whether Google Gears is installed | No | 1 |
br_features_silverlight |
boolean | Whether Microsoft Silverlight is installed | No | 1 |
br_cookies |
boolean | Whether cookies are enabled | No | 1 |
br_colordepth |
int | Bit depth of the browser color palette | No | 24 |
br_viewheight |
int | Viewport height | No | 1000 |
br_viewwidth |
int | Viewport width | No | 1000 |
See issue 94 for more details on br_windowheight
and br_windowwidth
.
Back to top.
Snowplow includes specific fields to capture data associated with specific events.
Note that to date, all event types have been defined by Snowplow. Also note that event_vendor
values follow the Java package naming convention.
Snowplow currently supports (or will support in the near future) the following event types:
Event type | Value of event field in model |
|
---|---|---|
2.3.1 | Page views | 'page_view' |
2.3.2 | Page pings | 'page_ping' |
2.3.3 | Ecommerce transactions | 'transaction' and 'transaction_item' |
2.3.4 | Errors | 'error' |
2.3.5 | Custom structured events | 'struct' |
2.3.6 | Custom unstructured events | 'unstruct' |
Details of which fields are available for which events are given below:
There are currently no fields that are specific to page_view
events: all the fields that are required are part of the standard fields available for any web-based event e.g. page_urlscheme
, page_title
.
Back to top.
There are four additional fields included with page pings that indicate how a user has scrolled over a web page since the last page ping:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
pp_xoffset_min |
integer | Minimum page x offset seen in the last ping period | No | 0 |
pp_xoffset_max |
integer | Maximum page x offset seen in the last ping period | No | 100 |
pp_yoffset_min |
integer | Minimum page y offset seen in the last ping period | No | 0 |
pp_yoffset_max |
integer | Maximum page y offset seen in the last ping period | No | 200 |
Back to top.
There are a large number of fields specifically for transaction events.
Fields that start tr_
relate to the transaction as a whole. Fields that start ti_
refer to the specific item included in the transaction. (E.g. a product in the basket.) Single transactions typically span multiple lines of data: there will be a single line where event
= transaction
, where the tr_
fields are set, and multiple lines (one for each product included) where event
= transaction_item
and the ti_
fields are set.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
tr_orderid |
text | Order ID | Yes | '#134' |
tr_affiliation |
text | Transaction affiliation (e.g. store where sale took place) | No | 'web' |
tr_total |
decimal | Total transaction value | Yes | 12.99 |
tr_tax |
decimal | Total tax included in transaction value | No | 3.00 |
tr_shipping |
decimal | Delivery cost charged | No | 0.00 |
tr_total_base * |
decimal | Total in base currency | No | 12.99 |
tr_tax_base * |
decimal | Total tax in base currency | No | 3.00 |
tr_shipping_base * |
decimal | Delivery cost in base currency | No | 0.00 |
tr_city |
text | Delivery address, city | No | 'London' |
tr_state |
text | Delivery address, state | No | 'Washington' |
tr_country |
text | Delivery address, country | No | 'France' |
tr_currency |
text | Currency | No | 'USD' |
ti_orderid |
text | Order ID | Yes | '#134' |
ti_sku |
text | Product SKU | Yes | 'pbz00123' |
ti_name |
text | Product name | No | 'Cone pendulum' |
ti_category |
text | Product category | No | 'New Age' |
ti_price |
text | Product unit price | Yes | 9.99 |
ti_price_base * |
text | Price in base currency | No | 9.99 |
ti_quantity |
text | Number of product in transaction | Yes | 2 |
ti_currency |
text | Currency | No | 'EUR' |
base_currency * |
text | Reporting currency | No | 'GBP' |
* Set exclusively by the Currency conversion enrichment.
Back to top.
This has not been implemented yet.
Back to top.
If you wish to track an event that Snowplow does not recognise as a first class citizen (i.e. one of the events listed above), then you can track them using the generic 'custom structured events'. Currently there are five fields that are available to store data related to custom events: we plant to increase this to 25 in the near future:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
se_category |
text | Category of event | Yes | 'ecomm', 'video' |
se_action |
text | Action performed / event name | Yes | 'add-to-basket', 'play-video' |
se_label |
text | The object of the action e.g. the ID of the video played or SKU of the product added-to-basket | No | 'pbz00123' |
se_property |
text | A property associated with the object of the action | No | 'HD', 'large' |
se_value |
decimal | A value associated with the event / action e.g. the value of goods added-to-basket | No | 9.99 |
See issue 74 for additional information.
Back to top.
Custom unstructured events are a flexible tool that enable Snowplow users to define their own event types and send them into Snowplow.
When a user sends in a custom unstructured event, they do so as a JSON of name-value properties, that conforms to a JSON schema defined for the event earlier.
The complete JSON envelop for the event is loaded into the atomic.events
table, in the unstruct_event
field as shown below:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
unstruct_event |
JSON | Name of the event | Yes | {"schema":"iglu:com.mycompany/product_view/jsonschema/1-0-1", "data":{"sku":"14424"}} |
In addition, for Snowplow users running Redshift, the unstructured event is shredded into its own table in Amazon Redshift. The fields in this table will be determined by the JSON schema defined for the event in advance. Users can query just the table for that particular unstructured event, if that's all that's required for their analysis, or join that table back to the atomic.events
table by atomic.my_example_unstructured_event_table.root_id = atomic.events.root_id
.
Back to top.
Contexts enable Snowplow users to define their own entities that are related to events, and fields that are related to each of those entities. For example, an online retailer may choose to define a user
context, to store information about a particular user, which might include data points like the users Facebook ID, age, membership details etc. In addition, they may also define a product
context, with product data e.g. SKU, name, created date, description, tags etc.
An event can have any number of custom contexts attached. Each context is passed into Snowplow as a JSON. Additionally, the Snowplow Enrichment process can derive additional contexts.
The arrays of contexts JSONs related to an event are loaded directly into the atomic.events
table as described below:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
contexts |
array | Contexts attached to event by Tracker | No | [{"schema":"iglu:com.acme/page_type/jsonschema/1-0-0", "data":{"type":"test"}}] |
derived_contexts |
array | Contexts attached to event by Enrich | No | [{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0", "data":{"deviceFamily":"Windows"}}] |
In addition, for users running on Redshift, Snowplow will shred each context JSON into a dedicated table in the atomic
schema, making it much more efficient for analysts to query data passed in in any one of the contexts. Those contexts can be joined back to the core atomic.events
table on atomic.my_custom_context_table.root_id = atomic.events.event_id
, which is a one-to-one join.
Back to top.
These are also a variety of unstructured events and custom contexts defined by Snowplow. You can find their schemas here.
Back to top.
- Currently, Snowplow data is stored in S3 (for processing in Apache Hive, Pig, and / or Mahout), Redshift and PostgreSQL (for processing by any SQL-compatible analytics package).
- There are minor differences between the structure of data in both formats. These relate to data structures that Hive and PostgreSQL support (e.g. JSONs) that Redshift does not
- Nevertheless, the structure of both is similar: representing a fat table
- Going forwards our intention is to move the storage format for data on S3 from the current flatfile structure to Avro. This will become the 'canonical Snowplow data structure'. Other formats (e.g. Redshift, PostgreSQL etc.) will simply be 'flattened' versions of the same data. We have outlined some of our plans in this blog post.
Back to top.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.
HOME » TECHNICAL DOCUMENTATION
1A. Trackers
Overview
ActionScript3 Tracker
Android Tracker
Arduino Tracker
CPP Tracker
Golang Tracker
Google AMP Tracker
iOS Tracker
Java Tracker
JavaScript Tracker
Lua Tracker
.NET Tracker
Node.js Tracker
PHP Tracker
Pixel Tracker
Python Tracker
Ruby Tracker
Scala Tracker
Unity Tracker
Building a Tracker
1B. Webhooks
Iglu webhook adapter
CallRail webhook adapter
MailChimp webhook adapter
Mandrill webhook adapter
PagerDuty webhook adapter
Pingdom webhook adapter
SendGrid webhook adapter
Urban Airship Connect webhook adapter
Mailgun webhook adapter
StatusGator webhook adapter
Unbounce webhook adapter
Olark webhook adapter
Marketo webhook adapter
Vero webhook adapter
2. Collectors
Overview
Cloudfront collector
Clojure collector (Elastic Beanstalk)
Scala Stream Collector
3. Enrich
Overview
EmrEtlRunner
Stream Enrich
Beam Enrich
Snowplow Event Recovery
Hadoop Event Recovery
C. Canonical Snowplow event model
4. Storage
Overview
Relational Database Shredder
Relational Database Loader
S3 Loader
Elasticsearch Loader
Storage in Redshift
Storage in PostgreSQL
Storage in Infobright (deprecated)
D. Snowplow storage formats (to write)
5. Analytics
Analytics-documentation
Event-manifest-populator
Common
Shredding
Artifact repositories