Introducing Gallia: a Scala library for data manipulation

by Anthony Cros (2021)

Introduction

Gallia's main goals are:

Practicality
Readability
Scalability (optionally)

Execution happens in two phases, each traversing a dedicated execution DAG:

A initial meta phase that ignores the data and ensures that transformation steps are consistent with one another (schema-wise)
A subsequent data phase where the data is actually processed

This article is a quick tour that focuses on a few examples rather than a comprehensive guide. A more thorough discussion of design choices/limitations/direction will come as subsequent article(s). The project is very immature and at this point I would simply like to gauge the level of interest for it (please keep any feedback at this level for now).

Preliminary notes:

Some links lead to documentation that is actually to be written.
The examples use JSON - despite its flaws - because of its ubiquity as a notation

Dependencies

The library is written with Scala (2.13)

It requires the following inclusion:

libraryDependencies += "org.gallia" %% "gallia-core" % "0.0.1"

IMPORTANT NOTE: no JAR has actually been published yet, license will need to be finalized first

The client code then requires the following import:

import gallia._

One can also optionally add the following import for general utilities:

import aptus._ // our utilities library

gallia-core dependency graph:

Note that there exists at this time two other modules besides core, with additional dependencies of their own: gallia-mongodb and gallia-spark (discussed further down)

Preliminary examples

Process single JSON document

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read() // will infer schema if none is provided

    // uppercase string value for field "foo" ("hello" -> "HELLO")
    .toUpperCase('foo)

    // increment integer value for field "bar" (1 -> 2)
    .increment('bar)

    // remove field "qux" (irrespective of field type)
    .remove('qux)

    // nest (boolean) field "baz" under (new) field "parent"
    .nest('baz).under('parent)

    // flip boolean value of field "baz" (now nested under "parent")
    .flip('parent |> 'baz ~> 'BAZ)
                     // rename it "while-at-it" ("baz" -> "BAZ")

  .printJson()
  // prints: {"foo": "HELLO", "bar": 2, "parent": { "BAZ": false }}

The schema is maintained throughout operations, so you get an error if you try for example to square a boolean:

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read()
      .toUpperCase('foo)
      .increment  ('bar)
      .remove     ('qux)
      .nest       ('baz).under('parent)
      .square     ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
    .printJson()
    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

Notes:

This error occurs prior to the actual data run, and no data is therefore processed
The error mechanisms works at any level of nesting/multiplicity
Of course, some errors cannot be caught until the data is actually seen (e.g. "foo".apply(5) or is-distinct types of checks)

Process collection of JSON documents

// INPUT:
//    {"first": "John", "last": "Johnson", "DOB": "1986-02-04", ...}\n
//    {"first": "Kate", ...
"/data/protopeople.jsonl.gz"
  .stream() // vs .read() for single object

    .generate('username).from(_.string('first), _.string('last))
      .using { (f, l) => s"${f.head}${l}".toLowerCase } // -> "jjohnson"
    .toUpperCase('last)
    .fuse('first, 'last).as('name).using(_ + " " + _)
    .transformString('DOB ~> 'age).using(
        _.toLocalDateFromIso.getYear.thn(2021 - _))

  .write(
    "/tmp/people.jsonl.gz")
    // OUTPUT:
    //  {"username": "jjohnson", "name": "John JOHNSON", "age": 32, ...}\n
    //  {"username": ...

Notes:

JSONL = one JSON document per line
This example makes use of .toLocalDateFromIso() and .thn() from our import aptus._ above (see docs)

Process CSV/TSV files

"/data/some.tsv.gz".stream()
  .retain('_id, 'age, 'gender)
  .groupBy('age)
  // ...

See more in inputs below.

Basics

Key referencing

Keys can be referenced as scala's Symbol, String, Enumeration, and enumeratum.Enum

"""{"foo": 1}"""
  .read().rename("foo" ~> 'FOO)
  // OUTPUT: {"FOO":1}

"""{"Very Poor Key Choice  ":
    "please_stop_using_spaces_and_unnecessary_uppercasing_in_keys"}"""
  .read()
    .rename("Very Poor Key Choice  " ~> 'much_better)
    .transformString('much_better).using(_ => "isn't it?")
  // OUTPUT: {"much_better":"isn't it?"}

Path referencing

Paths can be referenced conveniently via the "pipe+greater-than" (|>) notation:

"""{"parent": {"foo": "value"}}"""
  .read().toUpperCase('parent |> 'foo)
  // OUTPUT: {"parent": {"foo": "VALUE"}}

Note that a key is just a trivial path.

Target selection (keys or paths)

Applicable for both .read() and .stream() (one vs multiple objects)

// INPUT: {"foo": "hello", "bar": 1, "baz": true, "qux": "world"}
data.retain(_.firstKey) // {"foo": "hello"}

data.retain(_.allBut('qux))      //{"foo": "hello", "bar": 1, "baz": true}
data.retain(_.customKeys(_.tail))//{"bar": 1, "baz": true, "qux": "world"}

// (overly) complex example:
"""{"k1": "v1", "K2": "v2", "k3": "V3", "K4": "V4", "k5": "v5"}""".read()
  .removeIfValueFor(_.string(_.filterKeys {
      _.startsWith("k") })) // careful not to confuse key selection
    .matches {
      _.startsWith("v") }   // with value selection
  // OUTPUT: {"K2": "v2", "k3": "V3", "K4": "V4"}

// if need a more custom selection
"""{"parent": {"foo": "hello", "bar": 1}}""".read()
    // "leaf" as opposed to "all" (so will exclude 'parent path itself)
    .retain(_.customLeafPaths(_.init))
  // OUTPUT: {"parent": {"foo": "hello"}} since it corresponds
  //   to the "init" of Seq('parent |> 'foo, 'parent |> 'bar)

Generalization of target selection

Likewise applicable for both .read() and .stream()

val obj = """{"foo": "hi", "bar": 1, "baz": true, "qux": "you"}""".read()

// can't use "then" (scala) or "thn" (aptus)
obj.forKey    ('foo)      .zen(_ toUpperCase _) // { "foo": "HI", ...
obj.forEachKey('foo)      .zen(_ toUpperCase _)
obj.forEachKey('foo, 'bar).zen(_ toUpperCase _)

obj.forAllKeys((x, k) => x.rename(k).using(_.toUpperCase)) //{"FOO":"hi",..
// ... likewise with forPath, forEachPath, forAllPaths, forLeafPaths, ...

Nested data

"""{"parent": {"foo": "bar"}}""".read()
  .toUpperCase('parent |> 'foo)
  // OUTPUT: {"parent":{"foo":"BAR"}}

Renaming keys

Renaming can be expressed conveniently via the "tilde+greater-than" (~>) notation :

           """{"foo": "bar"}""" .read().rename           ('foo ~> 'FOO)
"""{"parent": {"foo": "bar"}}""".read().rename('parent |> 'foo ~> 'FOO)
// OUTPUT:
//             {"FOO":"bar"}
//   {"parent":{"FOO":"bar"}}
// (respectively)

A case could be made that rekey would be more appropriate than rename, but it feels too unnatural.

Renaming keys "while-at-it"

"""{"foo": 1}""".read()
  .increment('foo ~> 'FOO)
  // OUTPUT: {"FOO":2} - value is incremented and key is uppercased

Single vs Multiple objects

Gallia does not necessarily expect its elements ("objects") to come in multiples, it is capable of processing them individually.

Example of going from one to the other, then back:

"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream()
  .asArray1        //  {"foo":["bar1","bar2"]}
  .flattenBy('foo) // [{"foo": "bar1"}, {"foo": "bar2"}] (original array)

There are other ways to go back and forth between the two (e.g. reducing as shown below, see code for more)

Internally, all object-wise operations are actually just an implicit mapping so that the following two expressions are equivalent

"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream()      .toUpperCase('foo)
"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream().map(_.toUpperCase('foo))

DAG Heads

The Head type models a leaf in the DAG(s) that underlies the future execution plan.

Internally, heads comes in as three flavors, each offering a different and relevant subset of operations:

HeadO: For single object manipulation
HeadS: For multiple objects manipulation
HeadV[T]: For "orphan" values manipulation (HeadV is typically not encountered in client code)

Notes:

"Orphan" values are more conceptually relevant to nested subgraphs, not commonly manipulated by client code. It represents values that are not part of a structured object, e.g the string "foo" alone as opposed to the same string "foo" within an object {"key1": 1, "key2": "foo", ...}.
The DAGs/heads concepts will be discussed in more details in a future article dedicated to design.

SQL-like querying

people
  // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

    /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
    /* 2. SELECT           */ .retain('name, 'age)
    /* 3. GROUP BY + COUNT */ .countBy('age)

  // OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...

WHERE clause: Alternatively as filterBy(_.int('age)).matches(_ < 25) if need more than the basic =, <, >, +, ... (see types)
SELECT clause: this would actually be redundant since the subsequent GROUP BY step also retains those fields implicitly
GROUP BY + COUNT: if unspecified, uses default _count output field

Reduction

people.reduceWithMean('age)      // {"age":21.5}
people.reduce('age).wit(_.stdev) // {"age":1.118[...]}

More powerfully:

people
  .reduce(
      'age .aggregates(_.mean, _.stdev),
      'city.count_distinct)
  // OUTPUT: {"age":{"_mean":21.5,"_stdev":1.118[...]},"city":3}

Aggregations

people.group('name).by('city)

// "GROUP all keys but the last key BY that last key"
people
  .group(_.initKeys)
    .by(_.lastKey)
      .as('grouped) // would use '_group if unspecified
  //OUTPUT: [
  // [{"gender":"male","grouped":[{"name":"John","age":21,"city":"Toronto"},
  //     ... ]

// other count types available:
//   distinct, present, missing and distinct+present
people.count('name).by('city)

people.sum  ('age).by('city) // also sum, mean, stdev, ...
people.stats('age).by('city) // descriptive statistics (minimal for now)
  // OUTPUT: [ {"city":"Toronto","_stats":{"mean":21.0, ...

A more "custom" aggregation (nonsensical):

people
  .groupBy('city)
  .transformGroupObjectsUsing {
    _.squash(_.string('name), _.int('age))
      // random nonsensical aggregation for demonstration purpose only
      .using(_.map { case (n, a) => n.size + a }.sum) }
  .rename(_group ~> 'awesomeness)
  // OUTPUT:
  //  [{"city":"Toronto"     , "awesomeness":25},
  //   {"city":"Philadelphia", "awesomeness":24},
  //   {"city":"Lyon"        , "awesomeness":53}, ... ]

Pivoting

people
  .pivot(_.int('age)).usingMean
    .rows   ('city)
    .column ('gender)
      // having to provide those is an unfortunate consequence of
      // maintaining a schema (these values are only known at runtime)
      .asNewKeys('male, 'female)
  // OUTPUT:
  //  [ {"city":"Toronto","male":21},
  //    {"city":"Toronto","female":20},
  //    {"city":"Lyon","male":22.5},     ...]

Note that unpivoting isn't available, but scheduled

Renesting Tables

Common prefixes can be leveraged for re-nesting, e.g. "contact_" below:

// INPUT: "name<TAB>contact_phone<TAB>contact_address<TAB>..."
//                  ^^^^^^^           ^^^^^^^
table
  .renest(_.allKeys)
    .usingSeparator("_")
    // OUTPUT: "{"name":"John", "contact":{"phone": 1234567, "address":..
    //                           ^^^^^^^

This mechanism is not limited to a single level, it can transform keys:

foo_bar_baz1<TAB>foo_bar_baz2<TAB>...

into

{"foo": {"bar": {"baz1": ..., "baz2": ...}}, ...}

In practice the renesting operation typically involves a lot more work, e.g. if a value is like "foo1,foo2,foo3", it may also need to be split and denormalized on a one-per-row basis. It is also common to encounter values such as "John:32|Kate:33|Jean:34" or combinations of values such as "John|Kate|Jean" + "32|33|34" (the latter two actually sharing the same cardinality of elements pipe-wise). This alone would deserve its own article, but in the meantime the DbNsfp example highlights a number of interesting such cases.

The opposite operation (flattening) is scheduled .

IO

Input

.read() (single object) and .stream() (multiple objects) guess as much about the input format as they can from the input String provided:

JSON markers, e.g. {, [, ...
extensions, e.g. .json, .tsv, .gz, ...
URI schemes, e.g. file://, http://, jdbc://, ..
...

We will see later an example of how to override the default behavior for reading and writing.

Here are some examples of input consumption:

// will infer schema (costly timewise)
"/some/local/file.json" .read  ()
"/some/local/file.jsonl".stream()

// providing schema
"/some/local/file.json" .read  [MyCaseClass]
"/some/local/file.jsonl".stream[MyCaseClass]

// equivalently
"/some/local/file.json" .read  ('foo.string, 'baz.int)
"/some/local/file.jsonl".stream('foo.string, 'baz.int)

       "/some/local/file.jsonl".stream()
"file:///some/local/file.jsonl".stream()

 "http://someserver/test.jsonl".stream()
"https://someserver/test.jsonl".stream()

"ftp://someserver/pub/foo/bar.tsv".stream()

// must make corresponding JDBC driver jar available
"jdbc:myfavdb://localhost:1234/test?user=root&password=root"
  .stream(_.allFrom("TABLE1"))

"jdbc:myfavdb://localhost:1234/test?user=root&password=root"
  .stream(_.query("SELECT * from TABLE1"))

(conn: java.sql.Connection)       .stream(_.sql("SELECT * from TABLE1"))
(ps:   java.sql.PreparedStatement).stream()

// requires gallia-mongodb module and import gallia.mongodb._
"mongodb://localhost:27017/test.coll1".stream()
"mongodb://localhost:27017/test"      .stream(_.query("""{"find":"coll1"}"""))

Tables

Considering the following TSV file:

$ cat /data/some.tsv | column -nt
f1  f2  f3   f4     f5     f6  f7     f8
z   1   1.1  true   9,8,7  k   d,e,f  T
y   2   2.2  false  6,5,4

And the following call:

"/data/some.tsv".stream()

// or its explicit equivalent
"/data/some.tsv".stream(_.tsv.inferSchema)

The following schema and data will be inferred:

val schema =
  cls(
      'f1.string,  'f2.int     , 'f3.double, 'f4.boolean, 'f5.ints,
      'f6.string_, 'f7.strings_, 'f8.boolean_)

val data =
 Seq(
  obj('f1 -> "z", 'f2 -> 1, 'f3 -> 1.1, 'f4 -> true , 'f5 -> Seq(9, 8, 7),
        'f6 -> "k", 'f7 -> Seq("d", "e", "f"), 'f8 -> true),
  obj('f1 -> "y", 'f2 -> 2, 'f3 -> 2.2, 'f4 -> false, 'f5 -> Seq(6, 5, 4)))

Note that _ here stands for ?, meaning optional. For instance 'f7.strings_ would be represented as Option[Seq[String]] in Scala.

Additional modules using a similar paradigm could be added in the future, e.g.:

// NEO4J
"neo4j+s://demo.neo4jlabs.com".stream(
    _.query("""(:Person {name: string})
        -[:ACTED_IN {roles: [string]}]
          ->(:Movie {title: string, released: number})"""))

// Sparql
"http://www.disease-ontology.org?query=".stream(
    _.query("""
      SELECT DISTINCT *
      WHERE {?s <http://www.w3.org/2000/01/rdf-schema#label> "common cold"}
      LIMIT 3"""))

// GraphQL
"https://swapi.com/graphql".stream(
    _.query(
        """{user (id: 1) { firstname } }"""))

// Parquet / Avro / ...
"/data/test.parquet".stream()

// Excel (if sheet contains a single table)
"/data/doc.xlsx".stream(_.allFrom("Some Sheet Name"))

// XML
"/data/doc.xml".stream() // Requires costly schema inferring first

Note: There are proof of concepts for the last two (XML and Excel).

Output

Output works in a similar fashion, relying on extensions/URI schemes as much as possible

modifiedPeople.write("/tmp/output/result.tsv")
modifiedPeople.write("/tmp/output/result.jsonl.bz2")

// these are not actually implemented yet (only reading is):
modifiedPeople.write("mongodb://localhost:27017/test.coll1")
modifiedPeople.write(
    uri       = "mongodb://localhost:27017/test",
    container = "coll1")

modifiedPeople.write(
  uri       = "jdbc:myfavdb://localhost:1234/test?user=foo&password=bar",
  container = "SOME_RESULT_TABLE")

Scaling

Spark RDDs

See Apache Spark's RDD documentation.

This module requires

libraryDependencies += "org.gallia" %% "gallia-spark" % "0.0.1"

And the following import:

import gallia.spark._

Abstraction:

The main abstraction for top-level multiplicity is data.multiple.streamer.Streamer[T], which is then wrapped by the data.single.Obj-aware counterpart data.multiple.Objs (wraps a Streamer[Obj]). It currently comes in three flavors, all also under data.multiple.streamer:

ViewStreamer: default
IteratorStreamer: enabled via .stream(_.iteratorMode)
RddStreamer: enabled via .stream(_.rdd) if gallia.spark._ has been imported

Example processing:

logging.setToWarn()

"/data/file.jsonl"
    .stream(_.rdd) // will run with `local[*]` as master by default
    .rename('foo ~> 'FOO)
    .printJsonl() // note: closes SparkContext upon completion by default

"/data/huge.tsv.bz2"
  .stream(_.rdd)
    .rename('gene).to('hugo_symbol)
    .groupBy('mutation_id).as('genes)
  .printJsonl()

Example Joins:

"/data/huge1.tsv.gz"       .stream(_.rdd("spark://localhost:7077"))
  .leftJoin(
"hdfs://data/huge2.tsv.lzo".stream(_.rdd("spark://localhost:7077")),
     on = 'some_common_key)
     // ...
  .write("s3://mybucket/huge12.jsonl.bz2")

Notes:

The same code without the _.rdd("...") part would use an in-memory join ("local" mode)
Reading and writing to/from HDFS and S3 isn't actually ready, neither is using lzo compression

Modify underlying RDD (The Law of Leaky Abstractions):

"s3://mybucket/huge.tsv.bz2"
  .stream(_.rdd("spark://..."))
    .retain('mutation_id, 'gene_symbol, 'chromosome)
    // can by-pass abstraction when needed,
    //   though schema is not allowed to change
    //   (which cannot be enforced)
    .rdd { _.coalesce(1).cache }
    .groupBy('mutation_id).as('genes)
  // ...

One can use/reuse a pre-existing SparkContext instead:

(sc: SparkContext)
  .stream("s3://mybucket/huge.tsv.bz2")
  // ...
sc.stop()

The above Spark runs would be quite inefficient since the schema would have to be determined via an additional loop over the full data (costly when big). Instead one would want to provide the schema as shown below .

Poor man's scaling ("spilling")

May be useful to your average scientist who may have access to powerful machines (think qsub) but not to conveniently provisioned clusters. Sadly this is a very common occurrence in research settings and the author cares deeply about this problem.

"/data/huge.tsv.bz2"
  // uses an GNU sort-based approach to sorting/grouping/joining
  .stream(_.iteratorMode)
    .rename('gene).to('hugo_symbol)
    .groupBy('mutation_id).as('genes)
    // ...

Notes:

All wide transformations can be written in terms of an external sort such as GNU sort
We can combine such operations and leverage pipes to ensure the execution tree is executed lazily (forking however would benefit from a form of checkpointing)
GNU sort is favored for now because replacing it would constitute an significant endeavour, and even then it would be extremely hard to beat performance-wise
Ideally this would be an alternative run mode for Spark itself
The current implementation can be seen in action in the GeneMania processing sub-project
This feature is only partially implemented. It's basically enabled via the _.stream(_.iteratorMode.[...]) call, and follows this type of invocation paths: Streamer.groupByKey -> Iterator's -> utility -> GNU sort wrapper

Types (explicitly)

Let's revisit the SQL-like example. Note that the Whatever type placeholder is being used (basically an Any wrapper that accepts very basic operations such as +, <, etc.)

// the following two expressions are equivalent:
//
//          omitting type implies the use of Whatever here  and here
//         v                 v                            v          v
z.fuse(         'first ,          'last ).as('name).using(_  + " " + _)
z.fuse(_.string('first), _.string('last)).as('name).using(_  + " " + _)
//                                                        ^          ^
//                                                         vs strings

A more disciplined and powerful approach than relying on Whatever is to be explicit about the type, which gives access to all the type's operations

z.fuse(_.string('first), _.string('last)).as('name)
   // .head and .toUpperCase require knowledge of the type (String here)
   .using { (f, n) => s"${f.head}${n.toUpperCase}" }

More types than the currently supported ones will be added in the future

Schema (metadata)

Gallia is "schema-aware", meaning it keeps track of schema changes for every step. This allows the library to detect many errors prior to seeing the actual data.

As we've seen before, there are multiple ways to explicitly provide the data's underlying schema. This saves the library the task of looping over the data first to "infer" said schema.

By using a case class

case class Foo(foo: String, bar: Int, baz: Boolean, qux: String)

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""".read[Foo]

By providing it "manually"

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  // underscore means optional (since can't conveniently use '?' in Scala)
  .read('foo.string, 'bar.int, 'baz.boolean, 'qux.string, 'corge.string_)

By providing an external resource that contains a JSON-serialized version of the schema

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
    .read("/meta/myschema.json")

Where "/meta/myschema.json" contains: {"fields":[{"key":"foo","info":...

More interactions with case classes are available (e.g. in transformations); they will be detailed in a future article.

Full blown example

I am providing a link to one of the full blow examples I've written using Gallia: turning the big dbNSFP tables into a corresponding nested structure more conducive to querying (mongodb, elasticsearch, ...). See the example input row and example output object.

It is in no way complete or 100% correct in its current form, as it is primarily designed to showcase Gallia. I only tested it on a small subset of the data, and I expect unfortunate surprises would arise from processing the entire dataset.

It showcases among other things how to turn a long String full of extractable information, e.g:

"Loss of ubiquitination at K551 (P = 0.0092); Loss of methylation [...]"

Into a more parseable object:

[
  {"type":"loss", "change_type":"ubiquitination",
     "location":"K551", "p_value":0.0092 },
  {"type":"loss", "change_type":"methylation",
      ... },
  ...
]

Via an intermediate Scala case class (which contains most of transformation logic):

// ...
.transformString(top_5_features).using(MutPred.apply)
// ...

Processing this kind of data is exactly why I designed the library in the first place. I believe a lot of useful knowledge can be unlocked by making this kind of resource more parseable (DbNsfp itself is an incredibly useful resource in terms of content). The field of bioinformatics in particular is laden with archaic technologies and practices, which in turns results in tons of lost opportunities for impactful medical discoveries. I have never dealt with it personally but I imagine the likes of computational physics and other "computational-driven" disciplines probably suffer from similar problems.

List of concrete examples

Trivial examples:
- Word Count example, the "hello world" of big data
- Count by word length example
SQL-like:
- Northwind queries: coming soon
Reproducing random examples encountered in articles on data manipulation:
- TPC-DS Sales summary example query as discussed in Andrew Ray's Databricks post: "Reshaping Data with Pivot in Apache Spark" (February 2016)
- data manipulation task for the Cars93 dataset (R MASS package), as discussed in Darren Wilkinson's blog post: "Data frames and tables in Scala" (August 2015)
- Eurostat census data example queries as discussed in Mathijs Vogelzang's Medium article: "Doing cool data science in Java: how 3 DataFrame libraries stack up" (September 2018)
- Football premier league data manipulations as discussed in Chloe Connor's Towards Data Science article: "Stop using Pandas and start using Spark with Scala" (June 2020)
- (more coming soon)
Bioinformatics examples
- re-processing clinvar VCF file
- re-processing SnpEff output
- re-processing dbNSFP table example from section just above
- re-processing GeneMania TSV files; uses the poor man's scaling approach (spilling)
- re-processing rare disease LOVD data (from EDS Variant Database)
Physics examples
- ENSDF data (WIP)
- WIP (see forum question)
Spark-powered:
- WIP: will provide an alternative version of the genemania example above that uses Spark instead of the poor man's scaling approach (spilling)
(more coming soon)

FAQ

Is this ready for production?

Not even remotely. There are known bugs, blatantly missing features, a lot of missing validation, and most importantly it performs rather slowly at the moment. There is a lot planned in the way of addressing these issues, but it will require more resources than the author working alone. In particular, performance has a prominent place in the task list.

This is temporary until I determine what the right license and funding model are going to be for this project.

UPDATE on 2021-02-23: started process of creating a Business Source License (BSL) with specific terms still to be determined, but essentially free for anyone doing important work.

How can I help?

I'm already aware of many issues and have a long list of tasks meant to address them, as well as add the features that are critically missing. As a result the most useful thing one can do to help at the moment is simply letting me know if this is an effort worth pursuing. Once a definitive license is chosen, code contributions will be more than welcome.

What are the biggest limitations by design?

At this point, a given field can only be of a given type. Ironically this prevents Gallia from having its own metaschema specified in Gallia terms. See problem in action in the code . A more thorough discussion of design choices and trade-offs/limitations will come in a future article.

Another potential trick is that there can be only one meaning to a missing value. For instance [{"foo": null}, {"foo": []}, {}] would all collapse to the same absence of a value: {}. Note that overloading the various null/Nil mechanisms with alternative meanings is probably not great data modeling practise in the first place.

In what way is readability prioritized?

We aim to make the code as readable as possible (goal #2) whenever it doesn't affect practicality (goal #1). In particular we want to make it possible for domain experts - who may not be programmers - to understand at least superficially what is happening in each step. It is obviously not always feasible for the task at hand, but this is otherwise a major goal for the library.

What are good use cases for the library?

The main use cases that come to mind at this point are batch ETL, querying, feature engineering, internal application logic, and data validation and evolution. On the batch ETL front, it would be interesting to see how alternative libraries/languages take examples such as the dbNSFP one above. In particular, how would the various thresholds (readability/practicality/scalability) be shifted by a different choice.

What about features like streaming? EDA? visualization? linear algebra? graph queries? notebooks? metadata semantics ? squaring the circle?

There are lots of features that could be added in the future, but they all require a pretty sturdy base first.

Note that the most important part of the library at this point is its client code interface. The internals could be entirely scrapped in the future, though it's more likely it would be replaced in phases short of a major design flaw.

Why no macros?

I prototyped a lot with macros and I think they will play an important role in the future of Gallia. They are also quite tricky to deal with, and since they are scheduled for a major overhaul, I am reluctant to invest a lot of time on that front at the moment. I see them helping a lot in particular with boilerplate and some compile-time validation (e.g. key validation). The very initial plan was to leverage whitebox macros for every step, but I gave up on the idea pretty early on. I'd like to re-investigate it for a subset of features/use cases at some point however, especially since there seems to be some interesting projects (e.g. quill) that already make interesting use of them.

Where is the category theory?

I'm quite impressed with the likes of cats (-> great book) or shapeless but while I find them intellectually fascinating, I do side with the "blue sky" perspective when it comes to prioritizing practicality.

What about other programming languages?

Initially the idea was for this to be a language agnostic DSL for data manipulation, with a reference implementation in Scala basically acting as specification. It may still become a reality but I'd rather focus on maturing a Scala version first.

What is aptus?

"Aptus" is latin for suitable, appropriate, fitting. It is our utility library to help smooth certain pain points of the Java/Scala ecosystem. The plan is to externalize it eventually (Apache 2 license). In fact, the Aptus code included in Gallia is a small subset of the full library, which was embedded for convenience.

Where are the tests?

They live in a different repo that will require some serious cleanup. I will release them in increments. They basically take the following form:

aobj( // the "a" in aobj stands for "Annotated"
    cls('p   .cls_('f.string  , 'g.int ), 'z.boolean))(
    obj('p -> obj ('f -> "foo", 'g -> 1), 'z -> true) )
  .generate('h)
    .from(_.obj('p))
    .using {
        _ .translate('f ~> 'F).using("foo" -> "oof")
          .remove('g) }
  .test {
    aobj(
      cls('p   .cls_('f.string, 'g.int   ), 'z.boolean, 'h .cls_ ('F.string)))(
      obj('p -> obj ('f -> "foo", 'g -> 1), 'z -> true, 'h -> obj('F -> "oof")) ) }

Where test wraps an equality assertion (I have not settled on a definitive testing library yet).

Why so few comments, especially scaladoc?

I try to leverage the language constructs as much as possible, e.g. by naming variables and methods so they convey semantics as much as possible. I then add the occasional comment when I deem it necessary, but overall expect any contributor to be sufficiently familiar with Scala to understand what's going on. As the project matures, proper scaladoc-friendly comments can hopefully be added as well.

Why does the terminology sometimes sound funny or full-on neological?

Naming things is hard. Sometimes I give up and favor an alternative until a better idea comes along. Sometimes a temporary name just sticks around, by way of organic growth. More generally I'd like to create an OWL ontology to more formally define terms that may deserve it.

What's with the IDs that look like timestamps and pop up everywhere (e.g. `210121162536`)?

They're my quick-and-dirty mechanism for ID-ing elements, and are generated by combining the date command along with xautomation, called via xbindkeys keyboard shortcuts. When they represent a task, it allows me to ID the task temporarily. Many small tasks will never see an actual issue tracking system ID assigned to them. Note that the timestamp itself is never guaranteed to be meaningful, as I occasionally hack them around (for consolidation purposes for instance).

Where does the name "Gallia" come from?

Gallia is the name of a Romano-Gallic goddess. It is also the latin name for Gaul, the area the author is originally from.

Rumor has it that the goddess Gallia appeared in 16 BCE to a group of data engineers gathered at a local tavern in Lugdunum (now Lyon), and that she told them to keepeth their code (1) practical, (2) readable, and (3) scalable (if needed), in that exact order.

Contact & Announcements

You may contact the author at:

See original announcement on the Scala Users list

For further announcements, follow me on Twitter at @AnthonyCros

Files

README.md

Latest commit

History

README.md

File metadata and controls

Introducing Gallia: a Scala library for data manipulation

Introduction

Dependencies

Preliminary examples

Process single JSON document

Process collection of JSON documents

Process CSV/TSV files

Basics

Key referencing

Path referencing

Target selection (keys or paths)

Generalization of target selection

Nested data

Renaming keys

Renaming keys "while-at-it"

Single vs Multiple objects

DAG Heads

SQL-like querying

Reduction

Aggregations

Pivoting

Renesting Tables

IO

Input

Tables

Output

Scaling

Spark RDDs

Poor man's scaling ("spilling")

Types (explicitly)

Schema (metadata)

Full blown example

List of concrete examples

FAQ

Is this ready for production?

Why an "All right reserved" license?

How can I help?

What are the biggest limitations by design?

In what way is readability prioritized?

What are good use cases for the library?

What about features like streaming? EDA? visualization? linear algebra? graph queries? notebooks? metadata semantics ? squaring the circle?

Why no macros?

Where is the category theory?

What about other programming languages?

What is aptus?

Where are the tests?

Why so few comments, especially scaladoc?

Why does the terminology sometimes sound funny or full-on neological?

What's with the IDs that look like timestamps and pop up everywhere (e.g. 210121162536)?

Where does the name "Gallia" come from?

Contact & Announcements

What's with the IDs that look like timestamps and pop up everywhere (e.g. `210121162536`)?