Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest events backwards #6834

Closed
mariusandra opened this issue May 10, 2021 · 15 comments
Closed

Ingest events backwards #6834

mariusandra opened this issue May 10, 2021 · 15 comments
Assignees
Labels
P1 Urgent, non-breaking (no crash but low usability) stale

Comments

@mariusandra
Copy link
Collaborator

Whenever we ingest events, we end up with entries in 3 tables: events, persons and person_distinct_ids.

Excluding plugins for now, if we ingest the same events again in the same order, we will (most likely) end up with the same entries in those tables.

However, I think we should also get exactly the same entries in those tables if we ingest events... backwards. If we do, it would make ingestion, retries and exports a lot simpler. For example, we could export multiple months in parallel from one posthog instance to another and know that all persons were transferred as they are.

This task touches two things:

  • Store person property updated_at timestamps... and then compare them before making any changes (in denormalized properties?). We need this anyway for $set_once issues, but this could also be added to $set (if a property has been set by an event in the future, don't reset it to something if this event is in the past).
  • Store some metadata on when users were merged and use that to prevent past changes from overriding future changes (e.g. somehow prevent renaming the new user to the old user)

The details and feasability of the above points need to be specced out.

@yakkomajuri
Copy link
Contributor

@tiina303

@tiina303
Copy link
Contributor

A case for importance (being able to replay events later is already called out and we saw a need for that this week), but additionally network is unpredictable - sometimes a packet sent later will arrive earlier, sometimes an earlier packet fails and will be retried later.

For set we can just look at timestamps, however for set_once we can't as it shouldn't overwrite anything that was earlier, so imagine:

Timestamp | function | value
1         | set      |  1
2         | set_once |  2

Regardless of the order of arrival we want the value to be 1 in the end. If 2 is sent first, we need to know it was a set_once call that set it to know that we should overwrite it with the earlier value.

Furthermore this needs to also end up with val = 1 even if packet 1 is sent later.

Timestamp | function | value
1         | set_once |  1
2         | set_once |  2

I propose we keep an additional column set_once:bool that tells us if it was set by set_once, if it was and we see an earlier set or an earlier set_once, we need to overwrite it and update the timestamp to earlier.

Second tricky function is alias, currently we say we resolve conflicts by overwriting with the first distinct ID values if there's a conflict, however here if we get out of order events it gets tricky to resolve them, instead we could just rely on timestamps (latest one is used, well the same set/set_once applies as above). Then we can just look a the current values with timestamps & later coming values and we can just see this as one person and set/set_once already work in any order (I assume if we get a later set/set_once for someone who was aliased we can follow the redirect to the new person).

@mariusandra
Copy link
Collaborator Author

Hmm... how come we need to complicate this?

We give each property a timestamp. Then if $set comes in, we override if its timestamp is later than the property's timestamp. If so, we override. If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

Or what am I missing?

@tiina303
Copy link
Contributor

If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

Imagine someone uses set_once and set both for the same key, we expect a value written by a later timestamped set not to get overwritten by an earlier timestamped set_once. A real world example of this could be we have some places, where we have weak evidence for user, where we want to only write those, if there wasn't a value before. For example imagine an App, where a user can set location in their profile but the app also uses the location based on where the user is visiting the site as the weak evidence. So Alice visited a site from London (which sends a set_once) and updated her location to be Dublin (which sends a set request), due to network problems the set_once('Alice', {'location':'London'}) request arrives later, but has an earlier timestamp (compared to the stored timestamp that the set request had). We don't want that to override the value, i.e. we want in the end to have 'location':'Dublin' regardless of which order the set & set_once requests arrive.

@mariusandra
Copy link
Collaborator Author

I'm still not following. If requests arrive later because of network issues, they should still contain the original timestamp, no? That's why we have all these now, sent_at and timestamp fields in the system.

@tiina303
Copy link
Contributor

tiina303 commented Oct 1, 2021

That's why we have all these now, sent_at and timestamp fields in the system.

I'm not actually sure how & where these three are set, I was making the assumption that we only have timestamp (which is based on what we want to either overwrite the value or not).

I'm still not following.

Sorry this can be confusing & maybe I'm misunderstanding something. You can check the table below, maybe that's more helpful.

To elaborate on my earlier example ... let me provide the exact calls I was thinking & what we'd want to happen. Also adding numbering to my statements, so it's easier to call out where it doesn't make sense anymore.

set_once('Alice',  {'location':'Dublin'}, timestamp="16:00")
set('Alice', {'location':'London'}, timestamp="16:01")

(1) For this in the end we want the outcome to be Alice.location == London
(2) if the calls arrive in this order, it's simple we first set the location to be Dublin and later see the set call, which overrides it to London.
(3) The above works based on your proposal

if $set comes in, we override if its timestamp is later than the property's timestamp.

(4) Now imagine network was bad and the requests arrive in the reverse order

set('Alice', {'location':'London'}, timestamp="16:01")
set_once('Alice',  {'location':'Dublin'}, timestamp="16:00")

(5) After having processed the first set call, we have Alice.location == London with the timestamp "16:01".
(6) Now processing the set_once call. If we apply what you proposed ,

If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

(7) then we see that set_once came in
(8) we see that it does have an earlier timestamp compared to the stored one
(9) so you say we should override
(10) we override and end up with Alice.location == Dublin
(11) but that's not what we wanted, we should get Alice.location = London despite that the requests arrived in this order.


Here's a table to explain when the override should happen. The example I provided above corresponds to row number 7. Based on rows 3&4 (also 7&8) differing depending on what the previous call that wrote the value was I can't think of a way to override correctly without that information.

  | call     | value exists  | call TS is ___ existing TS | previous fn | write/override 
 1| set      | no            | N/A                        | N/A         | yes
 2| set_once | no            | N/A                        | N/A         | yes
 3| set      | yes           | before                     | set         | no
 4| set      | yes           | before                     | set_once    | yes
 5| set      | yes           | after                      | set         | yes
 6| set      | yes           | after                      | set_once    | yes 
 7| set_once | yes           | before                     | set         | no 
 8| set_once | yes           | before                     | set_once    | yes
 9| set_once | yes           | after                      | set         | no
10| set_once | yes           | after                      | set_once    | no

@mariusandra
Copy link
Collaborator Author

Thanks for being patient and taking time to break this down. Now I get the issue and adding this column does make a lot of sense!

@PostHog PostHog deleted a comment from posthog-contributions-bot bot Oct 5, 2021
@yakkomajuri
Copy link
Contributor

This discussion (+ @tiina303's table particularly) was so useful

#6259

@posthog-contributions-bot
Copy link
Contributor

This issue has 1133 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

@tiina303 tiina303 self-assigned this Oct 25, 2021
@tiina303 tiina303 transferred this issue from PostHog/plugin-server Nov 3, 2021
@tiina303 tiina303 added data-trust P0 Critical, breaking issue (page crash, missing functionality) labels Nov 4, 2021
@tiina303
Copy link
Contributor

tiina303 commented Dec 1, 2021

#7454 Person Created At needs to also handle ingestion in any order

@marcushyett-ph
Copy link
Contributor

Curious what the status of this is? Anything I can to do help at all (e.g. manual testing, etc)?

@tiina303
Copy link
Contributor

tiina303 commented Dec 6, 2021

Potentially ... here's how I'm thinking about doing the testing:
I have a DigitalOcean instance already (do-lon1-tiina-testing-2021-11-30-2054 with hostname/user/pass here: https://posthog.slack.com/archives/C01MM7VT7MG/p1638494244381100?thread_ts=1638494082.380800&cid=C01MM7VT7MG).
The plan is:

  1. migrate the old way in-order into a new project from hogflix demo app - user props data should match hogflix, but wanted to use this as the base to avoid confusion.
  2. migrate the old way backwards into a new project (haven't fully figured this out, but thinking of edit/fork migrator plugin) - user props data shouldn't match, verify seeing some differences
  3. update the instance to use an image on top of Merging people handling for any order ingestion #7109 (needs some changes) on top of Person and group created at for any order ingestion #7472 & set NEW_PERSON_PROPERTIES_UPDATE_ENABLED_TEAMS to be all the appropriate project IDs, my network was too bad for image push 🙄
  4. migrate new way in-order into a new project - user props data should probably match with (1), but could not
  5. migrate new way reverse order into a new project - user props data should probably match with (3)

I suspect we'll see some differences that we'll need to ignore, e.g. maybe geoIP or maybe turning that off before is better.

@tiina303 tiina303 added P1 Urgent, non-breaking (no crash but low usability) and removed P0 Critical, breaking issue (page crash, missing functionality) labels Feb 16, 2022
@tiina303
Copy link
Contributor

Relevant recent conversation (tied to person properties updates): https://posthog.slack.com/archives/C0374DA782U/p1660840149534569

We have properties_last_operation and properties_last_updated_at but they aren't always updated properly.
Given how our js lib sends browser etc every time we would increase the number of writes to postgres if we updated those fields (specifically timestamp for properties_last_updated_at) properly.

@posthog-bot
Copy link
Contributor

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

@posthog-bot
Copy link
Contributor

This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.

@posthog-bot posthog-bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent, non-breaking (no crash but low usability) stale
Projects
None yet
Development

No branches or pull requests

6 participants