Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Tick determine what is a "useful" time series? #2

Open
durple opened this issue Nov 4, 2014 · 18 comments
Open

How does Tick determine what is a "useful" time series? #2

durple opened this issue Nov 4, 2014 · 18 comments
Labels

Comments

@durple
Copy link
Contributor

durple commented Nov 4, 2014

So we could go one of the three ways here:

  1. Create a time series for everything if a stream has attributes A,B,C we create a time series for A,B, AB, AC, BC, ABC.
    • Pros: We have a time series for everything imaginable.
    • Cons: We have a time series for everything imaginable and also not useful leading to a wasteland of time series data e.g. a time series of unique identifiers that appear only once or identifiers that are constantly changing.
  2. Have a user pick and choose by some mechanism. So if I have users, locations and event_id. I can pick users over time and users over time broken by locations.
    • Pros: User gets just the data he/she wants and it can be made available
    • Cons: It sort of defeats the purpose of having tick, there is no experimentation in this problem.
  3. Lastly, have tick determine what is "useful" using some measure of the volume, dimensions of a stream and cardinality of the attributes themselves. The method of determination can be tweaked in various ways and experimented with. It may or may not always yield the right time series but can possibly be optimized over time to give better results.
    • Pros: less waste. No user selection
    • Cons: We don't know what the hell we are talking about here.
@durple durple added the question label Nov 4, 2014
@mikedewar
Copy link
Collaborator

think this has different answers depending on the dimensionality of the series..

@durple durple changed the title How does Tick determine what is a "useful" time series How does Tick determine what is a "useful" time series? Nov 4, 2014
@mikedewar
Copy link
Collaborator

in 1D (like views on a page) you could make a case that volume is a good indicator of useful, or maybe variance? in >1D I bet covariance would be a good starting point.

A useless time series then is one that is always zero, or more generally alway the same.

@mikedewar
Copy link
Collaborator

hey also I bet there is a proper answer to this in terms of information content. Like a useful timeseries is one that is hard to compress, has low entropy etc. There's a lovely green book on my desk by Mackay that has opinions...

@nikhan
Copy link

nikhan commented Nov 4, 2014

I am more curious as to how this makes sense for a user

Whatever determination you use, it means that there will be a result for some queries and no result for others. And neither of those results would necessarily mean "because there was no data"

which is confusing to me.

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

That is a very fat book!

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Whatever determination you use, it means that there will be a result for some queries and no result for others.

This is quite implementation specific, I think. We could implement tick such that a user knows what time series are being made available once it starts listening to the stream.

But you are right, if I have A, B & C and Tick determined that A, A & B are the only useful time series but the user was interested in B & C. I don't know how to handle that. It almost becomes a back to the drawing board problem to solve.

@mikedewar
Copy link
Collaborator

What about thinking of it more as a compression problem? If there is no
information in the series given the other time series then you should be
able to recreate the series using other series at query time...

Alternatively, a "no information" response to a query is an interesting
thing for a db to respond with...

M
On Nov 4, 2014 1:21 PM, "Deep Kapadia" [email protected] wrote:

Whatever determination you use, it means that there will be a result for
some queries and no result for others.

This is quite implementation specific, I think. We could implement tick
such that a user knows what time series are being made available once it
starts listening to the stream.

But you are right, if I have A, B & C and Tick determined that A, A & B
are the only useful time series but the user was interested in B & C. I
don't know how to handle that. It almost becomes a back to the drawing
board problem to solve.


Reply to this email directly or view it on GitHub
#2 (comment).

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Still wrapping my head around thinking of it as a compression problem...just grabbed the green book

Alternatively, a "no information" response to a query is an interesting
thing for a db to respond with...

But is it useful if I am looking for something very specific?

@nikhan
Copy link

nikhan commented Nov 4, 2014

Alternatively, a "no information" response to a query is an interesting
thing for a db to respond with...

only if it can be explained simply

@nikhan
Copy link

nikhan commented Nov 4, 2014

If you have timeseries for each key, couldn't you create what A&B would be? why do you need a time series for groups?

@nikhan
Copy link

nikhan commented Nov 4, 2014

Oh right, intersection vs exclusive. oh well

@nikhan
Copy link

nikhan commented Nov 4, 2014

Can I have table "key" with row "co occurrence" by time?

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

If you have timeseries for each key, couldn't you create what A&B would be? why do you need a time series for groups?

No. Consider for example the following stream:

{user: Deep, location: NYC, ts:1}
{user: Deep, location: NJ, ts:1}
{user: Nik, location: NYC, ts: 1}
{user: Nik location: SFO: ts 1}
{user Mike, location: NYC, ts:1}
{user Deep, location: NYC, ts:1}

##Time series:

user
Deep ->(ts:1, count:3)
Nik ->(ts:1, count:2)
Mike->(ts:1, count1)

location 
NYC -> (ts:1,count: 4)
NJ -> (ts:1, count: 1)
SFO ->(ts:1, count:1)

And if my question is give me all the times Deep was in NYC, I can't decipher it from the above time series. I can however decipher it from

user,location
Deep,NYC->(ts:1,count:2)
Deep,NJ->(ts:1,count:1)
Nik,NYC->(ts:1,count:1)
Nik,SFO->(ts:1,count:1)
Mike,NYC->(ts:1,count:1)

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Oh right, intersection vs exclusive. oh well

Great! I spent 5 minutes building time series by hand from a stream of imaginary JSON.

@nikhan
Copy link

nikhan commented Nov 4, 2014

sorry 😧

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Can I have table "key" with row "co occurrence" by time?

Not sure if I understand. Isn't that the same as having more than one column as a primary key? If so, it becomes the same as what I mentioned in the example

@nikhan
Copy link

nikhan commented Nov 4, 2014

what is wrong with that?

@mikedewar
Copy link
Collaborator

amen re: explaining no data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants