Releases: innosat-mats/rac-extract-payload
v1.4.0 More restrictive partitioning
v1.3.1
v1.3.0 Partition by hour
What's Changed
- Bump certifi from 2022.9.24 to 2022.12.7 in /raclambda by @dependabot in #164
- Refactor lambda to expect and handle a single file at a time by @e-larsson in #167
- Increase memory and storage by @skymandr in #172
- Partition by hour by @skymandr in #174
New Contributors
- @dependabot made their first contribution in #164
Full Changelog: v1.2.0...v1.3.0
v1.2.0: Parquet partition update
🌟 Features
This changes the output directory structure (the "partitioning scheme") when writing Parquet from
y/m/d/STREAM_filename.parquet
to
STREAM/y/m/d/filename.parquet
.
This structure is preferable, since it makes it very easy for e.g. a Lambda function listening for new files to know if it should wake up or not. The downside is that this makes it a little less convenient to read data from different sources at the same time, but no more so, than from different CSVs, so in a sense this brings the Parquet writing back in line with the CSV/PNG/JSON pipeline.
🐛️ Bugs
📋 Documentation
No changes
📋 Documentation
No changes
🛠 System
No changes
👷 Chore / Maintenance
No changes
v1.1.2: Optional image data
🌟 Features
No changes
🐛️ Bugs
Parquet schema now allows ImageData
to be empty. This is necessary because we want the meta data associated with an image, even if the image itself is broken.
📋 Documentation
No changes
🛠 System
No changes
👷 Chore / Maintenance
No changes
v1.1.1: Better error-handling when parsing JPEGs
🌟 Features
No changes
🐛️ Bugs
The default error handling from libjpeg
is designed to exit the program on certain errors where we just want to skip that step and continue processing. This lead to .rac
-files containing corrupt JPEG data breaking the entire processing (see #153). This release fixes that by implementing a custom error-handler.
📋 Documentation
No changes
🛠 System
No changes
👷 Chore / Maintenance
No changes
v0.2.8: Better error-handling when parsing JPEGs
🌟 Features
No changes
🐛️ Bugs
The default error handling from libjpeg
is designed to exit the program on certain errors where we just want to skip that step and continue processing. This lead to .rac
-files containing corrupt JPEG data breaking the entire processing (see #153). This release fixes that by implementing a custom error-handler.
📋 Documentation
No changes
🛠 System
No changes
👷 Chore / Maintenance
No changes
v1.1.0 Day one patch: New file name and schema conventions
🌟 Features
This release changes so that parquet files are written to different files based on both packet type and original file, using different parquet schemas depending on packet. This is preferable, since otherwise Go will helpfully add default values (e.g 0) to columns that belong to another packet type, making the origin of rows hard to disambiguate, and making it hard to filter.
The new file naming scheme also makes it easy to filter what files to read in PyArrow, which should be useful to save on resources.
🐛️ Bugs
Updated time convention from micro seconds to nano seconds for EXPDate
output.
📋 Documentation
rac -help
has been updated with information about he updated outputs.
🛠 System
No changes
👷 Chore / Maintenance
No changes
v1.0.0 Major release: New output format
🌟 Features
This release introduces a new output format, Parquet 0. This is a compact binary format, similar to CSV, but made for high-throughput data processing. It has excellent support in Python through e.g. PyArrow 1. To write to parquet use the -parquet
flag.
Also note that the option to write to AWS directly has been removed in this release; see System below.
Details on the outputs
The parquet files follow the same naming conventions used in the CSVs, but the header row is stored as meta-data instead. Parquet files support variable length rows, so instead of one file per packet type, one file per input file is produced.
In addition, the parquet files are written using a partitioning scheme so that data for each day is written to a file in a directory for that day. This means that files with the same name may occur in directories for subsequent days, if the original RAC-file covers two days. Partitioning is performed based on the CUC time of the source packet.
When writing to parquet the PNG-files are stored in the parquet files themselves, rather than as separate files. This introduces two new columns:
ImageName
: the name of the PNG-image, if it had been written to disk,ImageData
: the parsed PNG data.
Capability to write to CSV/PNG/JSON is retained as default and writes the same files as before, but some headers etc. have been updated in the resulting files, so any scripts parsing those may have to be updated.
For further details, see rac -help
!
🐛️ Bugs
No changes
📋 Documentation
rac -help
has been updated with information about he updated and new output formats.
🛠 System
- AWS is no-longer supported directly from the RAC binary, see #148 for rationale. AWS sync will henceforth be handled by the calling outer layer, such as a Lambda function.
👷 Chore / Maintenance
- Minor fixes of typographical errors and the like.
v0.2.7 Continue interrupted multi packet
🌟 Features
Added new command line argument -dregs <path>
. This lets the user specify a path where to read and write temporary "dregs" files [0] used for saving and continuing multi-packet data between batch runs.
🐛️ Bugs
No changes
📋 Documentation
No changes
🛠 System
No changes
👷 Chore / Maintenance
No changes
[0]: Data Remaining after Extracting Group of Source packets