Lazily-evaluated transformation on Dat archives. Inspired by Resilient Distributed Dataset(RDD)
npm i dat-transform
word count example:
const {RDD, kv} = require('dat-transform')
const Hyperdrive = require('hyperdrive')
const ram = require('random-access-memory')
const archive = new Hyperdrive(ram, '<DAT-ARCHIVE-KEY>', {sparse: true})
// define transforms
var wc = RDD(archive)
.splitBy(/[\n\s]/)
.filter(x => x !== '')
.map(word => kv(word, 1))
// actual run(action)
wc
.reduceByKey((x, y) => x + y)
.toArray(res => {
console.log(res) // [{bar: 2, baz: 1, foo: 1}]
})
Transforms are lazily-evaluated function on a dat archive. Defining a transform on a RDD will not trigger computation immediately. Instead, transformations will be pipelined and computed when we actually need the result, therefore provides opportunities of optimization.
Transforms are applied to each file separately.
Following transforms are included:
map(f)
filter(f)
splitBy(f)
sortBy(f) // check test/index.js for gotcha
Actions are operations that returns a value to the application.
Examples of actions:
collect()
take(n)
reduceByKey(f)
count()
sum()
takeSortedBy()
dat-transform
provides indexing via hyperdrive's list of entry.
You can specify the entries you want to computed with, which can greatly reduce bandwidth usage.
get(entryName)
select(f)
Partitions lets you re-index and cache the computed result to another archive.
partition(outArchive) // return promise
Transforms can be marshalled as JSON. which allows execution on remote machine.
RDD.marshal
unmarshal
dat-transform
use streams from highland.js, which provides lazy-evaluation and back-pressure.
The MIT License