Skip to content

AMR CSV Parsing

Leigh Dodds edited this page Jan 20, 2021 · 6 revisions

This page has some notes about how AMR CSV data is parsed and processed in the application.

Overview

Most of the AMR data is provided as CSV files, the bulk of which is delivered via Email processing although some comes from FTP Processing. This all ends up in an AWS S3 bucket specific to the environment, i.e. development, test (or staging) and production.

The import tasks is one of the Background processes and is triggered via a cron job.

The data picked up from S3 may be in a variety of CSV formats. These are reasonably consistent but have different column orders, headers, etc based on the feed.

The configuration information for each feed is stored in the database. The information is viewable by an Administrator in Energy Sparks, but requires developer intervention to amend or update.

During each run of the importer, a log record (AmrDataFeedImportLog) is generated for each CSV file, which indicates which configuration was used. The record captures any error messages and the number of records inserted or updated.

The records from the CSV are parsed according to the configuration and, where necessary, the structure is adjusted to match our internal storage.

It is then stored within the amr_data_feed_readings table which aims to store the data in normalised format. E.g. as a series of 48 hour readings and with meter ids added where we can match MPAN/MPRNs from the data to those we have configured in the system. But no conversion is applied to the individual data points or data: it's all strings in whatever format the CSV provided it.

This data is not used explicitly by the system. It first goes through AMR Data Validation. The output of that step is what is used in the application.

Errors and warnings from the loading process can be viewed by administrators.

CSV Parsing & Feed Configuration

CSVs are parsed using the default Ruby CSV parser. Invalid UTF-8 characters should already have been stripped out during the Email Processing. The column separator is defined in the configuration.

The amr_data_feed_configs table stores the configuration used to parse and interpret CSV files. The relevant columns are:

  • column_separator -- the column separator in the CSV file. Default is ","
  • description -- descriptive label for the configuration, e.g. energy-assets. Required
  • date_format -- Ruby data format string, e.g. %d/%m/%Y. Required. Note: if this doesn't work, or is incorrect, then we fall back to Date.parse currently
  • expected_units -- Specify expected units, e.g. kWh
  • handle_off_by_one -- default is false. Required
  • header_example -- Expected header to be included in the CSV file, e.g. "\"MPR\",\"Date\",\"hr0030\",\"hr0100\",\"hr0130\",\"hr0200\",\"...".
  • identifier -- prefix for files that match this format in S3. Files are stored as part of Email Processing and FTP Processing
  • import_warning_days -- default is 7
  • meter_description_field --
  • missing_readings_limit -- number of allowed missing readings per day. Not used?
  • mpan_mprn_field -- name of column that contains MPAN or MPRN identifier. E.g. \"MPR\". Required
  • msn_field --
  • number_of_header_rows -- Number of rows in CSV to consider to be header rows. Default is 0
  • postcode_field --Only set for BANES currently.
  • process_type -- Required. default is s3_folder
  • provider_id_field -- provider id? Only set for BANES currently.
  • reading_date_field -- name of column that contains date of meter reading. E.g. \"Date\"
  • reading_fields -- Required. Array of column names that have the 48 half-hourly meter readings. E.g. ["\"hr0030\"", "\"hr0100\"", "\"hr0130\"", "\"hr0200\"", "\"hr0230\"", "\"hr0300\"", "\"hr0330\"", "\"hr0400\"", "\"hr0430\"", "\"hr0500\"", "\"hr0530\"", "\"hr0600\"", "\"hr0630\"", "\"hr0700\"", "\"hr0730\"", "\"hr0800\"", "\"hr0830\"", "\"hr0900\"", "\"hr0930\"", "\"hr1000\"", "\"hr1030\"", "\"hr1100\"", "\"hr1130\"", "\"hr1200\"", "\"hr1230\"", "\"hr1300\"", "\"hr1330\"", "\"hr1400\"", "\"hr1430\"", "\"hr1500\"", "\"hr1530\"", "\"hr1600\"", "\"hr1630\"", "\"hr1700\"", "\"hr1730\"", "\"hr1800\"", "\"hr1830\"", "\"hr1900\"", "\"hr1930\"", "\"hr2000\"", "\"hr2030\"", "\"hr2100\"", "\"hr2130\"", "\"hr2200\"", "\"hr2230\"", "\"hr2300\"", "\"hr2330\"", "\"hr0000\""]
  • row_per_reading -- Required. default is false
  • source_type -- Required. default is email
  • total_field --
  • units_field --

Note that header field names must be quoted. There are also created_at, updated_at fields for notes when configuration has changed.

  • if the first row in the file matches the header_example in the configuration, then this row is skipped. (Note: this is as parsed by CSV parser then turned into an array, so the internal configuration in the database might be comma-based, but the original file might use tabs)
  • if there's a number_of_header_rows, then this many rows are skipped. In this case, the actual header names are ignored.

CSV Validation and Normalisation

  • if handle_off_by_one is set, then the first reading for the day is actually from the day before, so we adjust that, moving the reading to the right day. Assume this is known from inspecting the data prior to setup. Code assumes CSV is sorted by date.
  • empty rows are ignored, as are rows with only a single field. This isn't logged.
  • rows are rejected if the number of missing readings is more than missing_readings_limit. This isn't logged.

To extract data from columns the code treats each row in the CSV file as an array. The names of the fields from the configuration are used to find the right index in the array.

The reading_fields are treated slightly differently. The configuration is treated as an array of column names, rather than specifying both the name and order of a columns. So there may be additional columns provided (such as notes for individual reading columns, or alternative formats), but these will be ignored.

If there is an expected_units configured, then this is added to each meter reading.

The end result is enough data to populate the amr_data_feed_readings table.

Additional validation and interpretation of the rest of the data is handled in AMR Data Validation

Clone this wiki locally