Add support for column indexes #582

pauldix · 2014-05-26T20:41:19Z

We should add support for column indexes. Everything will still be indexed by time, but you'll be able to additionally index by column values. For example if you have a series:

{
  "name": "memory_used",
  "columns": ["host", "datacenter", "value"],
  "points" : ["serverA", "us-east", 2343223]
}

Later when you're doing queries, if you do select sum(value) from memory_used where host = 'serverA' it has to do a range scan to get at all of that data. Indexes would make it efficient to pull back the data.

Another possible win on indexes, is if a column is indexed, we should convert the value into a uint64, which will be more efficient to store. Note that the indexes are intended for hash lookups (i.e. =) and not range indexes.

Adding an index could look something like this:

create index memory_used_hosts
on memory_used (host)

-- or do it against a regex

create index hosts
on /.*\.h\..*/ (host)

The second example will index any series by host that has .h. in the series name. You can then create conventions around putting things in series names to have those values be indexed.

Because of the way I imagine this being implemented, you'd want to keep the total number of indexes in each DB fairly small (< 100). But only real testing will reveal what makes sense.

One other idea I heard floated for how to do this is to update the input format. Instead of specifying this as a configuration option, have a new section on ingestion like this:

{
  "name": "memory_used",
  "columns": ["value"],
  "tags": ["host", "datacenter"]
  "points" : [[2343223, "serverA", "us-east"]]
}

The idea being that tags would be automatically indexed and would be the last values in the array of a point.

This would be more efficient because we wouldn't need to worry about looking up indexes every time we write data in. Would like to hear what other people think about the two approaches.

The text was updated successfully, but these errors were encountered:

chobie · 2014-05-26T22:08:02Z

+1

fsauer65 · 2014-05-28T14:04:31Z

+1

m1keil · 2014-06-11T15:43:45Z

+1

Dieterbe · 2014-06-13T03:23:57Z

You lost me on

This would be more efficient because we wouldn't need to worry about looking up indexes every time we write data in.

It's not clear to me how the tags would be stored, and why would it be more efficient? We'd still store them as uint's right, and keep a separate table with the value-uint mapping? just like in the first approach? And we'd still need a datastructure (B-tree or whatever) to track locations of all values of all tags?

freeformz · 2014-06-17T22:52:18Z

FWIW: I like the index idea.

pauldix · 2014-06-17T22:57:52Z

@Dieterbe what I meant was that with indexes, the normal write path would be:

See if this series/column is indexed
If so, do indexing stuff

Whereas with the approach of having it explicitly specified in the input format in the form of tags, there's no logic to lookup the series/column tuple to see if it's indexed. You already know from the input. Probably not a big deal.

Dieterbe · 2014-06-18T11:41:50Z

Here's an alternative:
let's say you want a series memory_used with columns seq, time, value, and host.
instead of implementing this in one series, with an index on the host column, we could also, under the covers, implement this as a new series for every different value of host. I.e. a series memory_used.host=serverA, a series memory_used.host=serverB etc. the user doesn't need to know this and we can make the new series on the fly as records come in with a new host value.
the benefit would be there's no need to maintain indices on write, and reads for a specific value of host are fast. When reading for multiple (or all) hosts it would mean reading in several streams and merging them on the fly, which is the tradeoff.

You can extend this idea to multiple indexed columns by (behind the scenes) creating a series for every combination of values of the columns you want to index. un-indexed colums would be regular columns in each of these series.

thoughts?

pauldix · 2014-06-18T14:31:04Z

that's exactly the idea. Under the covers this is how indexes will work.

On Wed, Jun 18, 2014 at 7:41 AM, Dieter Plaetinck [email protected]
wrote:

Here's an alternative:
let's say you want a series memory_used with columns seq, time, value,
and host.
instead of implementing this in one series, with an index on the host
column, we could also, under the covers, implement this as a new series for
every different value of host. I.e. a series memory_used.host=serverA, a
series memory_used.host=serverB etc. the user doesn't need to know this
and we can make the new series on the fly as records come in with a new
host value.
the benefit would be there's no need to maintain indices on write, and
reads for a specific value of host are fast. When reading for multiple (or
all) hosts it would mean reading in several streams and merging them on the
fly, which is the tradeoff.

You can extend this idea to multiple indexed columns by (behind the
scenes) creating a series for every combination of values of the columns
you want to index. un-indexed colums would be regular columns in each of
these series.

thoughts?

—
Reply to this email directly or view it on GitHub
#582 (comment).

nickchappell · 2014-06-18T19:02:12Z

+1

dongbin · 2014-06-30T05:42:04Z

+1

otoolep · 2014-07-01T01:10:17Z

Correct me if I am wrong, but this sounds like it would make InfluxDB much more useful for ingesting log data -- log data that had been parsed in such a manner that key fields like "severity" and "hostname" has been parsed out. One could imagine custom parsers pulling up metric information from unstructured data, and then sending it into InfluxDB.

Dieterbe · 2014-07-14T21:50:57Z

if it would work like how i described, then i'm not sure about using the "index" terminology for this feature.

I don't feel strongly about this, but:

traditionally, database indexes have always been datastructures with pointers to records, and they come with certain behaviors that people come to expect: slower writes and extra disk space to maintain the extra index datastructure, and reading without where clause is just as fast as it was before (in reality, if the io device spends more time doing writes, reads are impacted when it saturates)

i'm not too familiar with the implementation details, but it looks like in this case, there's just a bit of metadata/glue, no index datastructure of pointers, barely any extra diskspace, write speed should be barely impacted (unless high cardinality on the indexed column, maybe), and read without where clause is now always slower by design (but cpu-bound instead of io-bound so hopefully only a small amount, but that's to be seen. especially on ssd the difference could be very noticeable)

the main thing that this and indexes have in common is that reads with a where clause are faster, the rest seems different.

I like the idea of calling them tags. People will, however, try to find out "how do i use indexes with influxdb" because that's the familiar term, so we could have a doc page called "indexes/tags" where we explain the differences.
or maybe "seggregation". or just "index" after all and just describe how it differs from a traditional index. ES takes this pretty far. http://www.elasticsearch.org/blog/what-is-an-elasticsearch-index/

jgerschk · 2014-09-10T20:27:50Z

+1

jordanrinke · 2014-10-01T16:08:54Z

+1 - this would make influx+grafana killer for metrics and log search.

jclusso · 2014-10-27T07:13:56Z

+1 Any updates on when we might have this?

JulienChampseix · 2014-10-27T09:44:39Z

+1

pauldix · 2014-10-27T21:53:35Z

This will probably get rolled into the API refactor. Please comment on that PR: #1059

ghost · 2014-11-19T08:22:59Z

+1 It's a important feature and I'm waiting for it

jeromegit · 2014-11-20T16:41:03Z

+1 for me too. Thinking about making the switch from mongoDB to InfluxDB but performance without indexes is too poor for me to make it happen.

christoffbotha · 2014-12-10T11:09:57Z

+1 for this feature.
How flexible will your proposed solution be in terms of adding new tags/indexed columns at a later stage?

plagtag · 2014-12-16T17:40:35Z

+1 also from me for this feature. That would be a massive feature improve.

dashesy · 2015-02-16T19:29:35Z

+1 Is the tags implementation ready now? I cannot find any documentation on how to use this great feature.

naparuba · 2015-02-24T15:59:50Z

+1 Too. What is the current status of this (great) feature? ^^ Thanks

otoolep · 2015-02-28T07:43:34Z

Seems like we have this now @pauldix ?

pauldix · 2015-02-28T23:08:03Z

Something like it. Tags in 0.9.0 should take care of this feature.

mageddo · 2017-10-30T16:04:27Z

Where can I found the documentation about how to use column indexes? All influxdb pages comes to here

pauldix mentioned this issue May 28, 2014

POC: add aggregate funciton top and bottom #409 #551

Closed

Dieterbe mentioned this issue Sep 1, 2014

[WIP] Multiraft #859

Closed

Dieterbe mentioned this issue Sep 10, 2014

Add a OpenTSDB input protocol #870

Merged

Dieterbe mentioned this issue Sep 15, 2014

Hierarchical Data in InfluxDB #693

Closed

maoe mentioned this issue Oct 2, 2014

Series data type doesn't reflect "best practices" for time-series schemas maoe/influxdb-haskell#18

Open

toddboom added the 1 - Ready label Nov 25, 2014

toddboom added this to the 0.9.0 milestone Nov 25, 2014

pauldix removed the 1 - Ready label Jan 23, 2015

pauldix closed this as completed Feb 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for column indexes #582

Add support for column indexes #582

pauldix commented May 26, 2014

chobie commented May 26, 2014

fsauer65 commented May 28, 2014

m1keil commented Jun 11, 2014

Dieterbe commented Jun 13, 2014

freeformz commented Jun 17, 2014

pauldix commented Jun 17, 2014

Dieterbe commented Jun 18, 2014

pauldix commented Jun 18, 2014

nickchappell commented Jun 18, 2014

dongbin commented Jun 30, 2014

otoolep commented Jul 1, 2014

Dieterbe commented Jul 14, 2014

jgerschk commented Sep 10, 2014

jordanrinke commented Oct 1, 2014

jclusso commented Oct 27, 2014

JulienChampseix commented Oct 27, 2014

pauldix commented Oct 27, 2014

ghost commented Nov 19, 2014

jeromegit commented Nov 20, 2014

christoffbotha commented Dec 10, 2014

plagtag commented Dec 16, 2014

dashesy commented Feb 16, 2015

naparuba commented Feb 24, 2015

otoolep commented Feb 28, 2015

pauldix commented Feb 28, 2015

mageddo commented Oct 30, 2017

Add support for column indexes #582

Add support for column indexes #582

Comments

pauldix commented May 26, 2014

chobie commented May 26, 2014

fsauer65 commented May 28, 2014

m1keil commented Jun 11, 2014

Dieterbe commented Jun 13, 2014

freeformz commented Jun 17, 2014

pauldix commented Jun 17, 2014

Dieterbe commented Jun 18, 2014

pauldix commented Jun 18, 2014

nickchappell commented Jun 18, 2014

dongbin commented Jun 30, 2014

otoolep commented Jul 1, 2014

Dieterbe commented Jul 14, 2014

jgerschk commented Sep 10, 2014

jordanrinke commented Oct 1, 2014

jclusso commented Oct 27, 2014

JulienChampseix commented Oct 27, 2014

pauldix commented Oct 27, 2014

ghost commented Nov 19, 2014

jeromegit commented Nov 20, 2014

christoffbotha commented Dec 10, 2014

plagtag commented Dec 16, 2014

dashesy commented Feb 16, 2015

naparuba commented Feb 24, 2015

otoolep commented Feb 28, 2015

pauldix commented Feb 28, 2015

mageddo commented Oct 30, 2017