Skip to content

Email processing

Leigh Dodds edited this page Jan 20, 2021 · 7 revisions

A number of data feeds for Energy Usage are set to the application as email attachments. Few of the energy data providers offer an API and those that are available are very different.

Most, if not all, offer an option to receive meter readings via email as a CSV attachment. The formats are reasonably consistent across suppliers although there are some variations which are handled during the AMR CSV Parsing. But processing of email provides a consistent (if not ideal) way to receive data from schools.

This part of the system works as follows:

  • Meter data for a specific school, or group of schools is forwarded to a pre-defined email address for each school or group, e.g. [email protected]. This email address is tied to configuration which helps us to later parse the AMR CSV files. Emails from schools are auto-forwarded to the correct address, from a single internal Energy Sparks gmail account which has filters set up. These filters forward the email to the right email address specific for the specific supplier/file format.
  • Emails are received via Amazon Email Service which has been configured with rules to place emails into S3 buckets. The school-prefix is used to group files received via the different email addresses.
  • As emails are received they are automatically processed via Amazon lambda functions to take them through a simple data pipe-line that extracts the attached CSV files (which might be zipped) and place them into a bucket for processing
  • Unprocessed data files are picked up and loaded into the system as part of the regular Background processes.

Email rules

The email rules are configured in the Amazon SES service. The rules just indicate that data for specific emails should be placed into S3 for later processing. Copies of the data are made for the development, test and production buckets.

Data Pipeline (Lambda functions)

The lambda functions work from copies of the emails or attachments that are placed into S3 by the SES service.

There are several buckets:

  • data-inbox - copies of emails received by SES. In MIME format. Populated by the email rule
  • data-process - copies of attachments from emails received. In various formats, usually CSV or ZIP
  • data-compressed (actually called "uncompressed", confusingly) - Zipped attachments for uncompressing
  • data-unprocessable - files with unrecognised formats, or files which triggered errors when uncompressing
  • data-amr-data - CSV files ready for loading. This is read

The data-inbox bucket contents are just the emails. The other buckets are organised by email prefix to make it easier to identify the source of the data.

Data moves through these buckets via the lambda functions which listen for events on them:

  • unpack-attachments - read MIME format files from data-inbox place attachments in data-process bucket
  • process-file - copy CSV files straight to data-amr-data, ZIP files to data-compressed. Anything else to data-unprocessable
  • uncompress - process ZIP files from data-compressed placing contents in data-amr-data. Problematic files are moved to data-unprocessable

The files themselves are not really processed or interpreted by these functions, other than to remove invalid UTF-8 characters. All interpretation of the CSV files is done at a later stage.

The code and configuration for the lambda functions is in the data-pipeline folder of the application.

Clone this wiki locally