Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for column indexes #582

Closed
pauldix opened this issue May 26, 2014 · 26 comments
Closed

Add support for column indexes #582

pauldix opened this issue May 26, 2014 · 26 comments
Milestone

Comments

@pauldix
Copy link
Member

pauldix commented May 26, 2014

We should add support for column indexes. Everything will still be indexed by time, but you'll be able to additionally index by column values. For example if you have a series:

{
  "name": "memory_used",
  "columns": ["host", "datacenter", "value"],
  "points" : ["serverA", "us-east", 2343223]
}

Later when you're doing queries, if you do select sum(value) from memory_used where host = 'serverA' it has to do a range scan to get at all of that data. Indexes would make it efficient to pull back the data.

Another possible win on indexes, is if a column is indexed, we should convert the value into a uint64, which will be more efficient to store. Note that the indexes are intended for hash lookups (i.e. =) and not range indexes.

Adding an index could look something like this:

create index memory_used_hosts
on memory_used (host)

-- or do it against a regex

create index hosts
on /.*\.h\..*/ (host)

The second example will index any series by host that has .h. in the series name. You can then create conventions around putting things in series names to have those values be indexed.

Because of the way I imagine this being implemented, you'd want to keep the total number of indexes in each DB fairly small (< 100). But only real testing will reveal what makes sense.

One other idea I heard floated for how to do this is to update the input format. Instead of specifying this as a configuration option, have a new section on ingestion like this:

{
  "name": "memory_used",
  "columns": ["value"],
  "tags": ["host", "datacenter"]
  "points" : [[2343223, "serverA", "us-east"]]
}

The idea being that tags would be automatically indexed and would be the last values in the array of a point.

This would be more efficient because we wouldn't need to worry about looking up indexes every time we write data in. Would like to hear what other people think about the two approaches.

@chobie
Copy link
Contributor

chobie commented May 26, 2014

+1

1 similar comment
@fsauer65
Copy link

+1

@m1keil
Copy link

m1keil commented Jun 11, 2014

+1

@Dieterbe
Copy link
Contributor

You lost me on

This would be more efficient because we wouldn't need to worry about looking up indexes every time we write data in.

It's not clear to me how the tags would be stored, and why would it be more efficient? We'd still store them as uint's right, and keep a separate table with the value-uint mapping? just like in the first approach? And we'd still need a datastructure (B-tree or whatever) to track locations of all values of all tags?

@freeformz
Copy link

FWIW: I like the index idea.

@pauldix
Copy link
Member Author

pauldix commented Jun 17, 2014

@Dieterbe what I meant was that with indexes, the normal write path would be:

  • See if this series/column is indexed
  • If so, do indexing stuff

Whereas with the approach of having it explicitly specified in the input format in the form of tags, there's no logic to lookup the series/column tuple to see if it's indexed. You already know from the input. Probably not a big deal.

@Dieterbe
Copy link
Contributor

Here's an alternative:
let's say you want a series memory_used with columns seq, time, value, and host.
instead of implementing this in one series, with an index on the host column, we could also, under the covers, implement this as a new series for every different value of host. I.e. a series memory_used.host=serverA, a series memory_used.host=serverB etc. the user doesn't need to know this and we can make the new series on the fly as records come in with a new host value.
the benefit would be there's no need to maintain indices on write, and reads for a specific value of host are fast. When reading for multiple (or all) hosts it would mean reading in several streams and merging them on the fly, which is the tradeoff.

You can extend this idea to multiple indexed columns by (behind the scenes) creating a series for every combination of values of the columns you want to index. un-indexed colums would be regular columns in each of these series.

thoughts?

@pauldix
Copy link
Member Author

pauldix commented Jun 18, 2014

that's exactly the idea. Under the covers this is how indexes will work.

On Wed, Jun 18, 2014 at 7:41 AM, Dieter Plaetinck [email protected]
wrote:

Here's an alternative:
let's say you want a series memory_used with columns seq, time, value,
and host.
instead of implementing this in one series, with an index on the host
column, we could also, under the covers, implement this as a new series for
every different value of host. I.e. a series memory_used.host=serverA, a
series memory_used.host=serverB etc. the user doesn't need to know this
and we can make the new series on the fly as records come in with a new
host value.
the benefit would be there's no need to maintain indices on write, and
reads for a specific value of host are fast. When reading for multiple (or
all) hosts it would mean reading in several streams and merging them on the
fly, which is the tradeoff.

You can extend this idea to multiple indexed columns by (behind the
scenes) creating a series for every combination of values of the columns
you want to index. un-indexed colums would be regular columns in each of
these series.

thoughts?


Reply to this email directly or view it on GitHub
#582 (comment).

@nickchappell
Copy link

+1

1 similar comment
@dongbin
Copy link

dongbin commented Jun 30, 2014

+1

@otoolep
Copy link
Contributor

otoolep commented Jul 1, 2014

Correct me if I am wrong, but this sounds like it would make InfluxDB much more useful for ingesting log data -- log data that had been parsed in such a manner that key fields like "severity" and "hostname" has been parsed out. One could imagine custom parsers pulling up metric information from unstructured data, and then sending it into InfluxDB.

@Dieterbe
Copy link
Contributor

if it would work like how i described, then i'm not sure about using the "index" terminology for this feature.

I don't feel strongly about this, but:

traditionally, database indexes have always been datastructures with pointers to records, and they come with certain behaviors that people come to expect: slower writes and extra disk space to maintain the extra index datastructure, and reading without where clause is just as fast as it was before (in reality, if the io device spends more time doing writes, reads are impacted when it saturates)

i'm not too familiar with the implementation details, but it looks like in this case, there's just a bit of metadata/glue, no index datastructure of pointers, barely any extra diskspace, write speed should be barely impacted (unless high cardinality on the indexed column, maybe), and read without where clause is now always slower by design (but cpu-bound instead of io-bound so hopefully only a small amount, but that's to be seen. especially on ssd the difference could be very noticeable)

the main thing that this and indexes have in common is that reads with a where clause are faster, the rest seems different.

I like the idea of calling them tags. People will, however, try to find out "how do i use indexes with influxdb" because that's the familiar term, so we could have a doc page called "indexes/tags" where we explain the differences.
or maybe "seggregation". or just "index" after all and just describe how it differs from a traditional index. ES takes this pretty far. http://www.elasticsearch.org/blog/what-is-an-elasticsearch-index/

@jgerschk
Copy link

+1

@jordanrinke
Copy link

+1 - this would make influx+grafana killer for metrics and log search.

@jclusso
Copy link

jclusso commented Oct 27, 2014

+1 Any updates on when we might have this?

@JulienChampseix
Copy link

+1

@pauldix
Copy link
Member Author

pauldix commented Oct 27, 2014

This will probably get rolled into the API refactor. Please comment on that PR: #1059

@ghost
Copy link

ghost commented Nov 19, 2014

+1 It's a important feature and I'm waiting for it

@jeromegit
Copy link

+1 for me too. Thinking about making the switch from mongoDB to InfluxDB but performance without indexes is too poor for me to make it happen.

@toddboom toddboom added this to the 0.9.0 milestone Nov 25, 2014
@christoffbotha
Copy link

+1 for this feature.
How flexible will your proposed solution be in terms of adding new tags/indexed columns at a later stage?

@plagtag
Copy link

plagtag commented Dec 16, 2014

+1 also from me for this feature. That would be a massive feature improve.

@pauldix pauldix removed the 1 - Ready label Jan 23, 2015
@dashesy
Copy link

dashesy commented Feb 16, 2015

+1 Is the tags implementation ready now? I cannot find any documentation on how to use this great feature.

@naparuba
Copy link

+1 Too. What is the current status of this (great) feature? ^^ Thanks

@otoolep
Copy link
Contributor

otoolep commented Feb 28, 2015

Seems like we have this now @pauldix ?

@pauldix
Copy link
Member Author

pauldix commented Feb 28, 2015

Something like it. Tags in 0.9.0 should take care of this feature.

@pauldix pauldix closed this as completed Feb 28, 2015
@mageddo
Copy link

mageddo commented Oct 30, 2017

Where can I found the documentation about how to use column indexes? All influxdb pages comes to here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests