Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions page event list does a "over all of time" query. #5459

Closed
sentry-io bot opened this issue Aug 5, 2021 · 14 comments · Fixed by #7855
Closed

Actions page event list does a "over all of time" query. #5459

sentry-io bot opened this issue Aug 5, 2021 · 14 comments · Fixed by #7855
Labels
bug Something isn't working right feature/actions Feature Tag: Actions feature/activity Feature Tag: Activity (fka Events) P1 Urgent, non-breaking (no crash but low usability) performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@sentry-io
Copy link

sentry-io bot commented Aug 5, 2021

Sentry Issue: POSTHOG-3C9

ServerException: Code: 159.
DB::Exception: Timeout exceeded: elapsed 180.020712036 seconds, maximum: 180: While executing Remote. Stack trace:

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. DB::ExecutionSpeedLimits::checkTimeLimit(unsigned long, DB::OverflowMode) const @ 0xf313e28 in /usr/bin/clickhouse
2. DB::SourceWithProgress::work() @ 0x1062ee11 in /usr/bin/clickhouse
3. ? @ 0x104aa3dd in /usr/bin/clickhouse
4. DB::PipelineExecutor::executeStepImpl(unsigned long, unsigned long, std::__1::atomic<bool>*) @ 0x104a6f71 in /usr/bin/clickhouse
5. ? @ 0x104ab9f6 in /usr/bin/clickhouse
6. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8bacedf in /usr/bin/clickhouse
7. ? @ 0x8bb0403 in /usr/bin/clickhouse
8. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
9. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

(6 additional frame(s) were not displayed)
...
  File "clickhouse_driver/client.py", line 453, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
  File "clickhouse_driver/client.py", line 117, in receive_result
    return result.get_result()
  File "clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "clickhouse_driver/client.py", line 133, in packet_generator
    packet = self.receive_packet()
  File "clickhouse_driver/client.py", line 150, in receive_packet
    raise packet.exception
@macobo macobo added bug Something isn't working right feature/activity Feature Tag: Activity (fka Events) labels Aug 5, 2021
@macobo
Copy link
Contributor

macobo commented Aug 5, 2021

Looking at the query, seems like the issue is filtering without a time range for a very specific action.

@macobo macobo added the feature/actions Feature Tag: Actions label Aug 5, 2021
@macobo
Copy link
Contributor

macobo commented Aug 5, 2021

How you might reproduce this:

  1. Have a big enough organization
  2. Create a very specific action matching no events
  3. See event table never load

@paolodamico @clarkus What do you think about solving this in the frontend? Currently we try to fetch last 100 times the action has occurred for the table without a time limit. Even limiting it to last 1 or 3 months would help a lot.

If we really wanted to we could also add a "load more" button but not sure it's really needed or if it'd get usage.

@clarkus
Copy link
Contributor

clarkus commented Aug 5, 2021

I think we could focus this on communicating how the user can recover from the issue. If no events are loading, a more meaningful empty state might be worth trying?

"There isn't a time range defined for this action. We've loaded the last N months of data. Apply a time range to your action to resolve this." Secondary to that, we can communicate the limit we applied in the absence of one provided by the user.

@paolodamico
Copy link
Contributor

@macobo can you clarify where this is occurring? Is this for the event matches list in the action edit page? If so, I definitely agree with having something like a 3-month limit with a load more, we can definitely communicate this in a clear way and I don't think you even need to go before that. This view is usually just used for debugging that you defined your action correctly, so I don't think you'll care for historic data on 99% of cases.

@macobo
Copy link
Contributor

macobo commented Aug 6, 2021

Good question, sorry for the confusing description.

It's the editing action page, the query is for the table underneath the fold:

image

(Also lol the double headers and wrong calculating time. Weird)

@macobo macobo added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Aug 6, 2021
@clarkus
Copy link
Contributor

clarkus commented Aug 9, 2021

That makes more sense. 😄 A simple communication of the time range being displayed and some action to append more results seems like a good start to me.

@paolodamico
Copy link
Contributor

I think something like this could work. Changing the communication to make it clear that the important thing of the list is using it to troubleshoot your action rather than intended to be used for analytics or other purposes.

@macobo
Copy link
Contributor

macobo commented Aug 10, 2021

Thank to you both! Will take a quick stab at this soon :)

@macobo
Copy link
Contributor

macobo commented Oct 8, 2021

This started causing trouble for a key customer on their self-hosted instance - worth fixing soon!

@macobo macobo added the P1 Urgent, non-breaking (no crash but low usability) label Oct 8, 2021
@yakkomajuri
Copy link
Contributor

Adding context here as I reached this issue from Slack. It seems customer V was experiencing an issue with CH crashing under certain conditions related to actions. Customer N is also now having issues too.

What I found:

Like Customer V reported, when a lot of actions tabs were opened, ClickHouse crashed. I just saw this too. Opened a bunch of tabs and first I got some logs like this:

2021.11.01 11:29:13.533523 [ 537 ] {bc5c482e-3bf6-430a-806c-c86530c19a0c} <Debug> MergeTreeReadPool: Slow read, event №1: read 1048576 bytes in 1.074106822 sec., 953.35 KiB/s.
2021.11.01 11:29:13.552948 [ 528 ] {de4b8b77-6615-4dbf-94fe-1a9eb3c1ff93} <Debug> MergeTreeReadPool: Slow read, event №1: read 1048576 bytes in 1.02310175 sec., 1000.88 KiB/s.
2021.11.01 11:29:13.568935 [ 396 ] {d8e6da10-61be-4832-b363-39a49bed0193} <Debug> MergeTreeReadPool: Slow read, event №1: read 1048576 bytes in 1.130112392 sec., 906.10 KiB/s.
2021.11.01 11:29:13.578915 [ 224 ] {b83bb9d9-b573-4639-ad8f-574f7c9de38e} <Debug> MergeTreeReadPool: Slow read, event №2: read 1048576 bytes in 1.031102545 sec., 993.11 KiB/s.
2021.11.01 11:29:13.584407 [ 275 ] {8293123f-2294-4785-a582-74433d9df5ac} <Debug> MergeTreeReadPool: Will lower number of threads to 2
2021.11.01 11:29:13.584522 [ 331 ] {f816677e-bb41-48c8-80ab-5f2f1557a631} <Debug> MergeTreeReadPool: Will lower number of threads to 1
2021.11.01 11:29:13.584571 [ 338 ] {4c8567bb-e7ac-43c1-8602-df02e63cb985} <Debug> MergeTreeReadPool: Will lower number of threads to 2
2021.11.01 11:29:13.584428 [ 496 ] {915ed97f-18bc-410a-b157-fc4ff6de28b6} <Debug> MergeTreeReadPool: Will lower number of threads to 1
2021.11.01 11:29:13.584486 [ 365 ] {91853cff-140d-4755-8ffb-7ab4ae167831} <Debug> MergeTreeReadPool: Will lower number of threads to 1
2021.11.01 11:29:13.592537 [ 224 ] {b83bb9d9-b573-4639-ad8f-574f7c9de38e} <Debug> MergeTreeReadPool: Will lower number of threads to 1
2021.11.01 11:29:13.585568 [ 71 ] {24a4128e-0cd4-4222-ae73-56fef74dfad6} <Debug> MergeTreeReadPool: Will lower number of threads to 1

Then, suddenly, CH died:

2021.11.01 11:29:15.240020 [ 421 ] {915ed97f-18bc-410a-b157-fc4ff6de28b6} <Error> executeQuery: Code: 210, e.displayText() = DB::NetException: Connection reset by peer, while reading from socket (REDACTED) (version 21.6.5.37 (official build)) (from REDACTED) (in query: REDACTED),** Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. DB::ReadBufferFromPocoSocket::nextImpl() @ 0xf5d2b53 in /usr/bin/clickhouse
2. DB::TCPHandler::isQueryCancelled() @ 0x1043a7ba in /usr/bin/clickhouse
3. DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x104365ac in /usr/bin/clickhouse
4. DB::TCPHandler::runImpl() @ 0x1042f9be in /usr/bin/clickhouse
5. DB::TCPHandler::run() @ 0x10441839 in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x12a3fd4f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x12a417da in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x12b7ab39 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x12b76b2a in /usr/bin/clickhouse
10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
11. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

@yakkomajuri
Copy link
Contributor

Would like to bump this to P0 if possible.

@yakkomajuri
Copy link
Contributor

yakkomajuri commented Nov 1, 2021

Additionally, this appears to also be triggered via a dashboard with multiple panels that leverage actions

@macobo macobo mentioned this issue Nov 26, 2021
2 tasks
@macobo macobo changed the title Event list query timeout Actions page event list does a "over all of time" query. Dec 11, 2021
@macobo
Copy link
Contributor

macobo commented Dec 16, 2021

Additionally, this appears to also be triggered via a dashboard with multiple panels that leverage actions

@yakkomajuri are you sure it's the same root cause? The discussion here has been around events table doing an over all of time query, sounds like that's a separate bug. If you have a repro, mind filing it?

@yakkomajuri
Copy link
Contributor

I'm not sure of anything.

I believe me and you did some digging at the time and determined that their issues were something else. Lack of materialized columns etc.

I don't have context anymore so just ignore me until this comes back (if it does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right feature/actions Feature Tag: Actions feature/activity Feature Tag: Activity (fka Events) P1 Urgent, non-breaking (no crash but low usability) performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants