Skip to content

wibom/fakelake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

50 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PREDICT LAKE

A sandbox datalake to play with in order to figure out:

  1. Elasticsearch indexing
    1. What to index
    2. How to formate indices
    3. How to query indices
    4. Etc...
  2. datalake structure, data-flows, etc..

The fakelake folder contains mock data that mimic some data expected to be held within the actual lake.

Filestructure

.
β”œβ”€β”€ dev
β”‚Β Β  β”œβ”€β”€ data_templates       # mock templates used
β”‚Β Β  β”œβ”€β”€ es_mapping_templates # json templates for mapping of ES indices
β”‚Β Β  β”œβ”€β”€ meta_templates       # yaml settings to index
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ nshds
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ x1
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ x2
β”‚Β Β  β”‚Β Β  └── x3
β”‚Β Β  └── utilities
β”œβ”€β”€ elk                      # Docker-compose file to setup env.
β”œβ”€β”€ es_play
β”œβ”€β”€ fakelake                 # data, generated by `dev/spinnup_lake.R`
β”‚Β Β  β”œβ”€β”€ 1_bronze             # fake raw data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ nshds
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ x1
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_1
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── data_2
β”‚Β Β  β”‚Β Β  └── x2
β”‚Β Β  β”‚Β Β      └── data_1
β”‚Β Β  β”œβ”€β”€ 2_silver             # fake metadata to index
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ nshds
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ x1
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_1
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── data_2
β”‚Β Β  β”‚Β Β  └── x2
β”‚Β Β  β”‚Β Β      └── data_1
β”‚Β Β  └── 3_gold               # fake ES indices (.ndjson)
β”‚Β Β      β”œβ”€β”€ nshds
β”‚Β Β      β”œβ”€β”€ x1
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ data_1
β”‚Β Β      β”‚Β Β  └── data_2
β”‚Β Β      └── x2
β”‚Β Β          └── data_1
└── ref                      # reference used for createing mock-data

As a first attempt I've simply structured the data as:

  1. 1_bronze: raw data
  2. 2_silver: meta data (generated/curated by PREDICT stewards)
  3. 3_gold: Elasticsearch indices (e.g. metadata formated to .ndjson)

(Re-) Generating fakelake

The mock data was generated using:

  1. Run: dev/spinnup_lake.R
    • some lake settings are read from dev/lake_settings.yaml, as well as the yaml files under dev/meta_templates
    • other settings are hardcoded to dev/spinunup_lake.R and dev/utilities/utilities.R
      • initially all settings were hardcoded, and not all have been migrated out to yaml files...

Perhaps the same code, with appropriate changes, may be useful to re-create a new fakelake with more useful settings.

Iterate > update > find useful structure of both data and indices.

Docker image with: Elasticsearch / Kibana / Rstudio

Set up working environment including Elasticsearch, Kibana and Rstudio.

This can for example be done by:

  1. Modify the elk/docker-compose-predict.yaml & run compose up

    • modify the r / volumes: value to map local path
  2. Connect to Elasticsearch from R

    • this requiers allowing the default 'rstudio'-user read permissions on the Elasticsearch certificates files

Changing file permissions on the Elasticsearch certificates within the R-studio image can for example be achieved from the R-studio terminal, by:

# Navigate to the certificates-folder:  
`cd /usr/share/elasticsearch/`
# `chmod` the `certs` folder:
`rstudio@5556462f98de:/usr/share/elasticsearch$ sudo chmod -R 755 config/`

Test that the connection is working from R:

# Initialize ----
x <- elastic::connect(
  host = "es01", 
  transport_schema = "https", 
  user = "elastic", 
  pwd = "qqpp11--",
  cainfo = "/usr/share/elasticsearch/config/certs/ca/ca.crt"
)
x$ping()

Import indices to Elasticsearch

  • The suggested indice files (.ndjson) are under fakelake/3_gold/... (they are created by the dev/spinnup_lake.R script)

  • They can be bulk loaded into Elasticsearch (using dynamic mappings) by running the script: dev/import_indices_to_es.R

    • The script uses the Elasticsearch dynamic mapping templates listed in the dev/lake_settings.yaml (specified here: dev/es_mapping_templates)

Kibana dataviews

As an initial setting, I am playing the Kibana dataviews (a.k.a "index-patterns" in ES < 8.x), based on the index naming scheme used (see: dev/lake_settings.yaml)

  • *events
  • *--sample--events
  • *--questionnaire--events
  • datasets
  • *--analytes
  • *--vardict

Play around

Pretending to query the fakelake:

  1. es_play/search_play1.Rmd
    output: es_play/search_play1.html



Investigations

Completed

1. Reuse Kibana dashboard when underlaying indices are deleted and replaced


NOTE: I investigated this topic in the belief that Kibana dashboards broke when the underlying indices were deleted and reloaded. It turns out that is incorrect. Dashbords depend on Kibana dataviews (previously known as index-patterns), and as long as dataviews are persistent the dashboards will be too. The dataviews in turn will work as expected as long as the indices’ name patterns are persistent.


More on this topic: A Kibana dashboard is a separate object that has a parent-relationship with a Kibana dataview, see: [Kibana: Stack Management > Saved objects]. A dashboard object can be exported and imported as a .ndjson file. The .ndjson file may be modified outside Kibana.

Likewise, Kibana dataviews are separate objects (that may have no relationships, or may have a child-relationship with a given Kiabana dashboard). They represent Elasticsearch indices through a pattern match based on the indices’ given name.

The relationships (parent/child) between a Kibana dashboard object and a Kibana dataview object is defined through the dataview identifier. By default, when creating a new data-view in Kibana, it is automatically assigned a new uuid identifier. If the dataview is deleted it will essentially be impossible to replace it with the same uuid identifier. However, upon creating dataviews in Kibana GUI, one has the option to set the dataview identifier manually (i.e. without using the default uuid). It is thus possible to set a persistent dataview identifier, allowing the dataview to be deleted and replaced (with the same identifier). The parent-relationship with dashboards will then be kept intact.

Initial lookups to make dashboards persistent:

  1. Solution 1: It is possible to open each dashboard panel and selct Edit lens, and from there select which dataview to use

    • Must be done manually for each panel

  2. Solution 2: It is possible to delete all documents in an index, without deleting the actual index (uuid), through "delete by query":

  3. Solution 3: It is possible to export the dashbord object as .ndjson, modify the .ndjson file to apply it to new data-views, then import back into Kibana.

2. Use dynamic mapping template when creating new indices?

Is it worth while to set up a dynamic mapping template for each index?

  • Pros:
    • This way it is possible to map for example predict_id as a keyword, and use it as such in queries, without specifying predict_id.keyword (which would do the same if using the default dynamic mapping)
  • Cons:
    • There are likley to exist variables that we have not mapped manually, nor had in consideration when creating the dynamic template. These will be mapped dynamically by Elasticsearch's default settings. Whenever we wish to use one of these variables for "keyword-queries", we must append the .keyword suffix.
      Thus, perhaps it is better to always append the .keyword suffix, even for the variables that we know beforehand we only ever want to use as keywords (e.g. predict_id)???
  • OR..:
    • map every "string" as keyword (except for example abstract, and possibly some additional fields). When necessary to query a keyword in a non-exact way, one can use a fuzzy query. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)

      GET /_search
      {
        "query": {
          "fuzzy": {
            "PI.1.name.keyword": {
              "value": "Beatrice elin"
            }
          }
        }
      }
    • This is currently implemented through separate settings in dev/lake_settings.yaml which referes to these mapping templates: dev/es_mapping_templates

Old issues

Old issues based on misconseptions of how Elasticsearch queries work.

Quering for filepaths:

Quering for filepaths

Using the Kibana Dev Tools console:

This works:

GET /_search
{
  "size": 0, 
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "/fakelake/1_bronze/nshds/nshds_mock.tsv"
    }
  },
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

This works:

GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

This does not work:

GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

# Further tests:
# query: "/fakelake/1_bronze/x2/data_1"      -- works
# query: "\/fakelake\/1_bronze\/x2\/data_1*" -- works
# query: "/fakelake/1_bronze/x2/data_1*"     -- works
# query: "/fakelake/1_bronze/x2/data_1/*"    -- error

# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"          -- works
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# ==============================================================================
# More structured tests:
# query: "/fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error (original query)
# query: "fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# query: "/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# query: "/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# query: "/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# query: "/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# Conclusion: 
# - works with even number of '/'
# - errors with odd number of '/'

# query: "\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok  ==> need double escape of '/'

# original query with double escape:
# query: "\\/fakelake\\/1_bronze\\/x2\\/data_1\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- OK!

Solution

  • Forward slash ('/') must be double escaped when used in query
  • Also, query_string might not be the best option, probably better to use bool instead (see: [Filter and aggregate])
Filter and aggregate

Filter and aggregate

Issue

# --------------------------
# I was hoping this query would only return the individuals included in 
# the file "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
#
# Instead all included individuals in the lake are returned... 
# How do I reformulate the query to return only individuals in file?
GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "\\/fakelake\\/1_bronze\\/x1\\/data_2\\/mock_Olink_NPX_1791-02-02.csv"
    }
  },
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

Solution

Apparently query_string performes some sort of text-search that I don't understand.

Using boolean search does what I want:

# --------------------------
GET /_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "dataset.filename.keyword": "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
        }
      }
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }    
  }
}

NOTE: this way it's not necessary to escape forward slash characters.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published