A sandbox datalake to play with in order to figure out:
- Elasticsearch indexing
- What to index
- How to formate indices
- How to query indices
- Etc...
- datalake structure, data-flows, etc..
The fakelake
folder contains mock data that mimic some data expected to be held within the actual lake.
.
βββ dev
βΒ Β βββ data_templates # mock templates used
βΒ Β βββ es_mapping_templates # json templates for mapping of ES indices
βΒ Β βββ meta_templates # yaml settings to index
βΒ Β βΒ Β βββ nshds
βΒ Β βΒ Β βββ x1
βΒ Β βΒ Β βββ x2
βΒ Β βΒ Β βββ x3
βΒ Β βββ utilities
βββ elk # Docker-compose file to setup env.
βββ es_play
βββ fakelake # data, generated by `dev/spinnup_lake.R`
βΒ Β βββ 1_bronze # fake raw data
βΒ Β βΒ Β βββ nshds
βΒ Β βΒ Β βββ x1
βΒ Β βΒ Β βΒ Β βββ data_1
βΒ Β βΒ Β βΒ Β βββ data_2
βΒ Β βΒ Β βββ x2
βΒ Β βΒ Β βββ data_1
βΒ Β βββ 2_silver # fake metadata to index
βΒ Β βΒ Β βββ nshds
βΒ Β βΒ Β βββ x1
βΒ Β βΒ Β βΒ Β βββ data_1
βΒ Β βΒ Β βΒ Β βββ data_2
βΒ Β βΒ Β βββ x2
βΒ Β βΒ Β βββ data_1
βΒ Β βββ 3_gold # fake ES indices (.ndjson)
βΒ Β βββ nshds
βΒ Β βββ x1
βΒ Β βΒ Β βββ data_1
βΒ Β βΒ Β βββ data_2
βΒ Β βββ x2
βΒ Β βββ data_1
βββ ref # reference used for createing mock-data
As a first attempt I've simply structured the data as:
1_bronze
: raw data2_silver
: meta data (generated/curated by PREDICT stewards)3_gold
: Elasticsearch indices (e.g. metadata formated to .ndjson)
The mock data was generated using:
- Run:
dev/spinnup_lake.R
- some lake settings are read from
dev/lake_settings.yaml
, as well as the yaml files underdev/meta_templates
- other settings are hardcoded to
dev/spinunup_lake.R
anddev/utilities/utilities.R
- initially all settings were hardcoded, and not all have been migrated out to yaml files...
- some lake settings are read from
Perhaps the same code, with appropriate changes, may be useful to re-create a new
fakelake
with more useful settings.
Iterate > update > find useful structure of both data and indices.
Set up working environment including Elasticsearch, Kibana and Rstudio.
This can for example be done by:
-
Modify the
elk/docker-compose-predict.yaml
& runcompose up
- modify the
r / volumes:
value to map local path
- modify the
-
Connect to Elasticsearch from R
- this requiers allowing the default 'rstudio'-user read permissions on the Elasticsearch certificates files
Changing file permissions on the Elasticsearch certificates within the R-studio image can for example be achieved from the R-studio terminal, by:
# Navigate to the certificates-folder:
`cd /usr/share/elasticsearch/`
# `chmod` the `certs` folder:
`rstudio@5556462f98de:/usr/share/elasticsearch$ sudo chmod -R 755 config/`
Test that the connection is working from R:
# Initialize ----
x <- elastic::connect(
host = "es01",
transport_schema = "https",
user = "elastic",
pwd = "qqpp11--",
cainfo = "/usr/share/elasticsearch/config/certs/ca/ca.crt"
)
x$ping()
-
The suggested indice files (.ndjson) are under
fakelake/3_gold/...
(they are created by thedev/spinnup_lake.R
script) -
They can be bulk loaded into Elasticsearch (using dynamic mappings) by running the script:
dev/import_indices_to_es.R
- The script uses the Elasticsearch dynamic mapping templates listed in the
dev/lake_settings.yaml
(specified here:dev/es_mapping_templates
)
- The script uses the Elasticsearch dynamic mapping templates listed in the
As an initial setting, I am playing the Kibana dataviews (a.k.a "index-patterns" in ES < 8.x), based on the index naming scheme used (see: dev/lake_settings.yaml
)
*events
*--sample--events
*--questionnaire--events
datasets
*--analytes
*--vardict
Pretending to query the fakelake
:
es_play/search_play1.Rmd
output:es_play/search_play1.html
NOTE: I investigated this topic in the belief that Kibana dashboards broke when the underlying indices were deleted and reloaded. It turns out that is incorrect. Dashbords depend on Kibana dataviews (previously known as index-patterns), and as long as dataviews are persistent the dashboards will be too. The dataviews in turn will work as expected as long as the indicesβ name patterns are persistent.
Likewise, Kibana dataviews are separate objects (that may have no relationships, or may have a child-relationship with a given Kiabana dashboard). They represent Elasticsearch indices through a pattern match based on the indicesβ given name.
The relationships (parent/child) between a Kibana dashboard object and a Kibana dataview object is defined through the dataview identifier. By default, when creating a new data-view in Kibana, it is automatically assigned a new uuid identifier. If the dataview is deleted it will essentially be impossible to replace it with the same uuid identifier. However, upon creating dataviews in Kibana GUI, one has the option to set the dataview identifier manually (i.e. without using the default uuid). It is thus possible to set a persistent dataview identifier, allowing the dataview to be deleted and replaced (with the same identifier). The parent-relationship with dashboards will then be kept intact.
-
Solution 1: It is possible to open each dashboard panel and selct
Edit lens
, and from there select whichdataview
to use- Must be done manually for each panel
- Must be done manually for each panel
-
Solution 2: It is possible to delete all documents in an index, without deleting the actual index (uuid), through "delete by query":
- https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
-
POST my-index-000001/_delete_by_query?conflicts=proceed { "query": { "match_all": {} } }
- It is said to be an expensive process: https://discuss.elastic.co/t/delete-all-data-from-index-without-deleting-index/87661/2)
-
Solution 3: It is possible to export the dashbord object as .ndjson, modify the .ndjson file to apply it to new data-views, then import back into Kibana.
- Here is a suggestion:
https://kifarunix.com/update-change-kibana-visualization-index-pattern/
- Requiers a lot of manual steps. Can this be scripted??
- Here is a suggestion:
https://kifarunix.com/update-change-kibana-visualization-index-pattern/
Is it worth while to set up a dynamic mapping template for each index?
- Pros:
- This way it is possible to map for example
predict_id
as a keyword, and use it as such in queries, without specifyingpredict_id.keyword
(which would do the same if using the default dynamic mapping)
- This way it is possible to map for example
- Cons:
- There are likley to exist variables that we have not
mapped manually, nor had in consideration when creating the dynamic
template. These will be mapped dynamically by Elasticsearch's default
settings. Whenever we wish to use one of these variables for
"keyword-queries", we must append the
.keyword
suffix.
Thus, perhaps it is better to always append the.keyword
suffix, even for the variables that we know beforehand we only ever want to use as keywords (e.g.predict_id
)???
- There are likley to exist variables that we have not
mapped manually, nor had in consideration when creating the dynamic
template. These will be mapped dynamically by Elasticsearch's default
settings. Whenever we wish to use one of these variables for
"keyword-queries", we must append the
- OR..:
-
map every "string" as keyword (except for example
abstract
, and possibly some additional fields). When necessary to query a keyword in a non-exact way, one can use afuzzy
query. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)GET /_search { "query": { "fuzzy": { "PI.1.name.keyword": { "value": "Beatrice elin" } } } }
-
This is currently implemented through separate settings in
dev/lake_settings.yaml
which referes to these mapping templates:dev/es_mapping_templates
-
Old issues based on misconseptions of how Elasticsearch queries work.
Quering for filepaths:
Using the Kibana Dev Tools
console:
This works:
GET /_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "dataset.filename",
"query": "/fakelake/1_bronze/nshds/nshds_mock.tsv"
}
},
"aggs": {
"individuals": {
"terms": {"field": "predict_id.keyword", "size": 500}
}
}
}
This works:
GET /_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "dataset.filename",
"query": "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
}
},
"aggs": {
"individuals": {
"terms": {"field": "predict_id.keyword", "size": 500}
}
}
}
This does not work:
GET /_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "dataset.filename",
"query": "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
}
},
"aggs": {
"individuals": {
"terms": {"field": "predict_id.keyword", "size": 500}
}
}
}
# Further tests:
# query: "/fakelake/1_bronze/x2/data_1" -- works
# query: "\/fakelake\/1_bronze\/x2\/data_1*" -- works
# query: "/fakelake/1_bronze/x2/data_1*" -- works
# query: "/fakelake/1_bronze/x2/data_1/*" -- error
# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- works
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# ==============================================================================
# More structured tests:
# query: "/fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error (original query)
# query: "fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# Conclusion:
# - works with even number of '/'
# - errors with odd number of '/'
# query: "\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok ==> need double escape of '/'
# original query with double escape:
# query: "\\/fakelake\\/1_bronze\\/x2\\/data_1\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- OK!
- Forward slash ('/') must be double escaped when used in query
- Also,
query_string
might not be the best option, probably better to usebool
instead (see: [Filter and aggregate])
Filter and aggregate
# --------------------------
# I was hoping this query would only return the individuals included in
# the file "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
#
# Instead all included individuals in the lake are returned...
# How do I reformulate the query to return only individuals in file?
GET /_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "dataset.filename",
"query": "\\/fakelake\\/1_bronze\\/x1\\/data_2\\/mock_Olink_NPX_1791-02-02.csv"
}
},
"aggs": {
"individuals": {
"terms": {"field": "predict_id.keyword", "size": 500}
}
}
}
Apparently query_string
performes some sort of text-search that I don't
understand.
Using boolean search does what I want:
# --------------------------
GET /_search
{
"query": {
"bool": {
"filter": {
"term": {
"dataset.filename.keyword": "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
}
}
}
},
"aggs": {
"individuals": {
"terms": {"field": "predict_id.keyword", "size": 500}
}
}
}
NOTE: this way it's not necessary to escape forward slash characters.