Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhouse-local build small enough to fit in a Lambda function #29378

Closed
occasionallydavid opened this issue Sep 26, 2021 · 21 comments
Closed
Labels

Comments

@occasionallydavid
Copy link

occasionallydavid commented Sep 26, 2021

Use case

clickhouse-local is probably the most powerful logs analysis tool I've ever come across, and it seems to obviate most of the pain with running a full ClickHouse installation (ops work, ETL) without sacrificing much (if any?) performance for offline batch tasks.

For use cases where large amount of logs are stored in object stores like GCS or S3, ability to streaming read from object store and process with clickhouse-local is very desirable: no large expensive VMs running a permanent database with a copy of the original data.

This raises the possibility: can ClickHouse be made totally serverless? It is very cheap to run short S3/GCS scan jobs from Lambda or Cloud Functions. Many organizations already use this pattern, but usually they write custom code to run in Lambda, but ClickHouse already provides for many use cases in a much nicer form. However current ClickHouse binary is much too large to fit in a Lambda ZIP file (current max: 50mb, vs. 300mb+ for current official binaries).

Describe the solution you'd like

Potentially a custom build, or simply some documentation steps (PGO build?), to slim down clickhouse-local to read only from S3/GCS/URLs and enough functionality disabled so it will fit within 50mb. Is it possible? I see lots of template-heavy C++. Perhaps this is why binary is so large, and it won't change :)

Describe alternatives you've considered

Obvious alternative is spinning up a container or spot instance to run a job, but this requires inventing some ops framework for managing the VMs and containers. Lambda functions can be fed e.g. by SNS, auto-scaled according to queue of user queries with zero management.

Additional context

None. Low priority, just an idea, but one I've already tried because it makes so much sense over here.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 26, 2021

Lowering the size to 50 MB should be doable.

  1. Disable unneeded libraries during build (LLVM, Hyperscan, HDFS, LDAP and Kerberos, Kafka, RabbitMQ, Cassandra, ODBC...)
  2. Disable unused functions, aggregate functions, table functions, storage engines, dictionary types by removing them.
  3. Disable unused tools: server, keeper, etc.
  4. Disable unused datatypes like 128 and 256 bit integers.
  5. Strip the binary.
  6. Additionally remove dynamic symbol table from the binary.
  7. Pack to self-extracting executable, let it extract to memory.

@alexey-milovidov
Copy link
Member

The binary from master almost fits:

milovidov@milovidov-desktop:~/work/ClickHouse/programs/server$ wc -c clickhouse-master 
2212355520 clickhouse-master
milovidov@milovidov-desktop:~/work/ClickHouse/programs/server$ strip clickhouse-master
milovidov@milovidov-desktop:~/work/ClickHouse/programs/server$ wc -c clickhouse-master 
325325920 clickhouse-master
milovidov@milovidov-desktop:~/work/ClickHouse/programs/server$ zstd -9 -k clickhouse-master
clickhouse-master    : 21.69%   (325325920 => 70548111 bytes, clickhouse-master.zst) 
milovidov@milovidov-desktop:~/work/ClickHouse/programs/server$ zstd -22 -k clickhouse-master
Warning : compression level higher than max, reduced to 19 
zstd: clickhouse-master.zst already exists; overwrite (y/N) ? y
clickhouse-master    : 18.77%   (325325920 => 61071695 bytes, clickhouse-master.zst)

@occasionallydavid
Copy link
Author

Packing the official build with UPX yields a 77mb binary that takes ~1 second to start up. This doesn't use up much of /tmp or /dev/shm storage, and overhead is fine for a 30-60 second job. I think putting the binary in its own S3 bucket and downloading it at startup will work. Will report back later

@nvartolomei
Copy link
Contributor

nvartolomei commented Sep 30, 2021

@occasionallydavid where the 50MB memory limit comes from? I did some experiments with clickhouse on aws lambda and was able to build and run even 2GB custom debug docker images that run successfully on lambda.

https://docs.aws.amazon.com/lambda/latest/dg/images-create.html

@occasionallydavid
Copy link
Author

occasionallydavid commented Oct 14, 2021

Just a little update on this, I got it working.

ClickHouse needs a small patch to allow PR_SET_NAME to fail. I used a custom build with most libraries disabled, with only -local bundled, then UPX'd the binary and copied it into an AWS Docker runtime image. Without UPX, cold start is much worse, perhaps due to random IO to the binary over a high latency network

With 7076 MiB allocation (4 vCPUs), container boots in around 5500 ms, and can COUNT(*) 33 million rows from a local S3 zstd-compressed tab separated file at around 44 MiB/sec / 651 MiB/sec decompressed. With 10000 MiB allocation (6 vCPUS), 59 MiB/s / 873 MiB/s decompressed.

That works out around $2.60/TB compared to $5.00/TB for Redshift Spectrum, and still without playing with an ARM build (which reduces cost by 20%)

I haven't yet tried many complex queries. Lambda only has 500 MB /tmp and no /dev/shm, so heap is the only real available storage, which tops out at 10 GiB.

The ideal goal for this is using around 220 functions to get 10 GiB/s throughput, then performing query in 2 steps.. first produces results for individual file / partition, then a final combining query. Do you think there is any easy way to tease useful information out of ClickHouse parser to automatically write a combining (or the partitioned) query?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Oct 14, 2021

@occasionallydavid Good news!

Maybe you can send a PR with the applied changes (a "draft" PR to be used as an example).
Then we can structure it to be easily enabled and add it as a custom build to CI.

Do you think there is any easy way to tease useful information out of ClickHouse parser to automatically write a combining (or the partitioned) query?

Yes and it is already implemented, see the s3Cluster table function: #22012

@occasionallydavid
Copy link
Author

occasionallydavid commented Oct 14, 2021

I will definitely send a PR for that PR_SET_NAME change.

Just thinking about how this could be 'done properly', and especially addressing the cost of UPX (900 ms per invocation -- approx 10% the entire runtime for my current input files), would it be crazy to think about a dedicated local-like mode just for Lambda?

Current stripped+UPX'd binary is 48 MiB, almost comfortably small enough to fit in a ZIP rather than a Docker image (faster cold start). But to stay under 50 MiB, need wrapper code no larger than 2 MiB. A little static linked HTTP server would do it, but still on each invocation paying the UPX decompression cost.

So what about reusing Poco HTTPServer etc ClickHouse already links, and implement the Lambda runtime interface directly? UPX would then only run once during cold start, and new special mode takes care to clean up and reinitialize after each invocation.

It seems it would not be a huge amount of work, but maybe cleaning up state between runs is more involved than I imagine right now.

In any case it's still unclear how useful this whole setup is in general. Lambda's storage + RAM limits are quite severe, so this may always only be of limited use for e.g. simple log filtering / counting tasks.

@alexey-milovidov
Copy link
Member

We can implement self-extracting executable with zstd, it should be slightly better (I expect around 100..300 ms for decompression).

It's possible to create a Docker image with overhead of only 1.5 MB:
https://github.com/ClickHouse/ClickHouse/tree/master/docker/bare

Trying to implement Lambda API in the existing HTTP server is possible and fairly easy.

Lambda's storage + RAM limits are quite severe, so this may always only be of limited use for e.g. simple log filtering / counting tasks.

I think, it's better to try Fargate.

@nvartolomei
Copy link
Contributor

nvartolomei commented Oct 14, 2021

My experience:

I did implement lambda runtime interface and distributed query processing across multiple lambdas. My conclusion is that it's unusable in the current lambda design.

The major issue is the limitations in lambda request/response size, 6MB https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

That severely limits the amount of queries you can run, especially when there are aggregations which need to transfer intermediate results which can be on the order of GB/s. Also, no support for streaming.

An alternative would be to spill requests/results to s3 and use eg Dynamo to implement communication between lambdas but things get complicated (hacky?) very quickly.

I've noticed that google cloud run supports streaming grpc. That could be a more appropriate pay-per-use target.

https://twitter.com/nvartolomei/status/1427409540790767616?t=yv5i60h0ljTYFXMvTpYlzQ&s=19

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Oct 14, 2021

6 MB limit is ridiculous 🤣

And for distributed queries we have to somehow discover other lambdas and connect to the exposed ports directly...
that most likely not possible.

@occasionallydavid
Copy link
Author

occasionallydavid commented Oct 14, 2021

@nvartolomei for handling a mutating underlying set of files (e.g. growing logs), there is a lot of sense in writing individual results to S3, for a little extra cost in latency and price. This could be either 1:1 mapping between input object and some query result object, or 1:1 mapping to S3 multipart upload chunk. I was considering the former, as it means old results can be reused if a query is later re-run.

It would also be possible to stream the result out over some TCP connection, but that would again need some extra design/infrastructure

Re: Fargate, it is a good idea, but its cold start time is far too high for interactive use.

edit: using memfd_create() it is possible to zstd unpack ClickHouse once during container init, and the final result is <40 MiB, allowing use of ZIP deployment. Now cold start is 1100 ms (using Python runtime for wrapper), probably final cold start with a statically compiled wrapper will be closer to 900-1000ms. Per invocation overhead is around 40ms. Considering this 'solved', now just needs tidied up into a real solution

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Oct 14, 2021

Ok. Two more ideas:

  1. It is possible to emulate distributed query by storing intermediate data in s3. To get intermediate data you just need to specify it with --stage=with_mergeable_state argument (somehow does not work in clickhouse-local) or by rewriting all aggregate functions adding the -State suffix, example: SELECT uniqState(number) FROM numbers(10) Then do final aggregation with another query with -Merge suffix in aggregate function. The whole construction becomes surprisingly similar to map-reduce with all its disadvantages.

Note: it won't work for queries that have more than one stage of coordination.

  1. Does lambda allow "out of API" incoming connections? If yes - then we can use ZooKeeper for service discovery (every lambda will register their address) and then do connections directly. If not, it definitely should support outgoing connections. We can implement "network relay" servers like TCP proxies. All clickhouse-nodes will connect to all network relays and wrap the wire protocol, describing where they actually wants to connect. The relay will forward all the traffic accordingly.

@occasionallydavid
Copy link
Author

Does lambda allow "out of API" incoming connections

This sits far outside the Lambda model.. it is definitely possible to start some background process to wire up networking, but the concept of having functions communicate while not serving an active request is not intended. All functions must have some request active for each VM to remain unfrozen and to prevent random destruction by the orchestrator.

It would definitely be nice to explore running clickhouse-server inside Lambda. I'm content focusing on -local, as I probably don't have enough experience for the -server route yet. Your pointer to AggregateFunction documentation was extremely useful, especially when combined with discovery of the ANTLR grammar.

Will look at higher level execution problems later.. for now, just getting robust build of a wrapper + ClickHouse built for Amazon Linux is enough of a pain :)

@alexey-milovidov
Copy link
Member

ClickHouse built for Amazon Linux

What is special about the build for Amazon Linux?

@occasionallydavid
Copy link
Author

occasionallydavid commented Oct 16, 2021

What is special about the build for Amazon Linux?

Not much, just older libc, ancient compilers, and some build issues. There is a vendored readpassphrase() but somehow it does not get linked, and the libc does not provide one. I could not find any cmake statements that would cause it to be linked, eventually got a working binary by manually editing the final link command line. This might be due to some old/ancient cmake in use, or cmake cluelessness on my part.

I got it building with an ugly Dockerfile that also builds clang from scratch, as there don't seem to be any RPMs with a modern enough version of clang targeting the RHEL 7-like environment of Amazon Linux (though I didn't spend much time looking). The custom clang build may well be pointless, I tried that after seeing the readpassphrase() errors assuming it might be some problem with the ancient linker

Current config is:

  • -DCMAKE_C_COMPILER=/usr/local/bin/clang-13 \ -DCMAKE_CXX_COMPILER=/usr/local/bin/clang++ \ -DCMAKE_BUILD_TYPE=Release \ -DENABLE_CLICKHOUSE_ALL=OFF \ -DENABLE_CLICKHOUSE_LOCAL=ON \ -DENABLE_AVX=ON \ -DENABLE_AVX2=ON \ -DENABLE_LIBRARIES=OFF \ -DUSE_UNWIND=ON \ -DENABLE_UTILS=OFF \ -DENABLE_TESTS=OFF \ -DENABLE_SSL=ON

  • strip -s resulting binary

  • zstd --ultra --long -22

  • Final ZIP: https://im-clickhouse-lambda.s3.eu-west-1.amazonaws.com/clickhouse.zip

  • Lambda using Python 3 runtime with fake startup script to avoid being charged for cold start (roughly -10% cost in my use case)

  • Lambda handler written in Rust implementing in-memory zstd to a memfd, and using execve([/proc/self/fd/(memfd) to start. Ideally the Lambda handler would move into ClickHouse, and the bootstrap would just implement zstd + execing ClickHouse over the top of itself, because the 40ms overhead is pretty huge when doing smaller queries (e.g. using LIMIT 100000 etc)

Lambda definition:

{
    "Configuration": {
        "FunctionName": "clickhouse",
        "Runtime": "python3.9",
        "Handler": "lambda_billing_hack.lambda_handler",
        "CodeSize": 40700003,
        "Description": "",
        "Timeout": 900,
        "MemorySize": 7076,
        "Version": "$LATEST",
        "TracingConfig": {
            "Mode": "PassThrough"
        },
        "PackageType": "Zip"
    },
    "Code": {
        "RepositoryType": "S3",
        "Location": "..."
    },
    "Concurrency": {
        "ReservedConcurrentExecutions": 500
    }
}

Example event:

{
    "args": ["CLICKHOUSE", "local", "-q", "SELECT 1"]
}

@mikeTWC1984
Copy link

I got it building with an ugly Dockerfile that also builds clang from scratch, as there don't seem to be any RPMs with a modern enough version of clang targeting the RHEL 7-like environment of Amazon Linux (though I didn't spend much time looking). The custom clang build may well be pointless, I tried that after seeing the readpassphrase() errors assuming it might be some problem with the ancient linker

@occasionallydavid Can you share your Dockerfile?

@occasionallydavid
Copy link
Author

occasionallydavid commented Nov 5, 2021

Hi Mike,

IIRC the attached Dockerfile build will fail, the final link command line is missing a reference to libreadpassphrase. You must edit the command line to include libreadpassphrase (which does get otherwise built) via docker run -it --rm [last build step ID]

Dockerfile.clickhouse-build.txt

pr-set-name.patch.txt

If you just want a binary to play with, there is one in the ZIP file at https://im-clickhouse-lambda.s3.eu-west-1.amazonaws.com/clickhouse.zip

@mikeTWC1984
Copy link

@occasionallydavid Thanks, I had hard time building ch until using your patch.
One more question - I see you use some custom bootstrapper to launch clickhouse. Is it faster than upx-ed binary or just smaller? I was just looking for a slim version of clickhouse-local to run on arbitrary linux environment, but that bootstrapper doesn't seem to work outside lambda. Is it possible to launch it within same image generated by your Dockerfile?

@occasionallydavid
Copy link
Author

Hi @mikeTWC1984,

The custom bootstrap allow decompressing (via zstd) once at function cold start. Subsequent invocations do not pay the cost of decompression again, saving ~1-2 seconds per execution. See the comments above about memfd_create().

You can just use UPX, but there is this large expensive decompress step on every invocation in that case, which was a double digit percentage of my overall runs

@pkit
Copy link
Contributor

pkit commented Jul 3, 2022

Ok. Two more ideas:

  1. It is possible to emulate distributed query by storing intermediate data in s3. To get intermediate data you just need to specify it with --stage=with_mergeable_state argument (somehow does not work in clickhouse-local) or by rewriting all aggregate functions adding the -State suffix, example: SELECT uniqState(number) FROM numbers(10) Then do final aggregation with another query with -Merge suffix in aggregate function. The whole construction becomes surprisingly similar to map-reduce with all its disadvantages.

Note: it won't work for queries that have more than one stage of coordination.

  1. Does lambda allow "out of API" incoming connections? If yes - then we can use ZooKeeper for service discovery (every lambda will register their address) and then do connections directly. If not, it definitely should support outgoing connections. We can implement "network relay" servers like TCP proxies. All clickhouse-nodes will connect to all network relays and wrap the wire protocol, describing where they actually wants to connect. The relay will forward all the traffic accordingly.

I have tried all of these (and much more, including NAT piercing) approx a year ago.
TL;DR It doesn't work (highly unreliable).
The only reliable pattern is storing any intermediate data on S3 and then iterating on that in a M/R pattern.
Also low latency k/v stores can help (Redis, etc.) but these are so expensive that it's just not feasible.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Aug 21, 2022

I've compiled ClickHouse with clang-16 (trunk) on my machine with the patch #40460 and the following options:

CC=clang CXX=clang++ cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_CLICKHOUSE_ALL=OFF \
-DENABLE_CLICKHOUSE_LOCAL=ON \
-DENABLE_AVX=ON \
-DENABLE_AVX2=ON \
-DENABLE_LIBRARIES=OFF \
-DUSE_UNWIND=ON \
-DENABLE_UTILS=OFF \
-DENABLE_TESTS=OFF \
-DENABLE_EMBEDDED_COMPILER=0 \
-DENABLE_HDFS=0 \
-DENABLE_AZURE_BLOB_STORAGE=0 \
-DENABLE_CASSANDRA=0 \
-DENABLE_KRB5=0 \
-DENABLE_ODBC=0 \
-DENABLE_AMQPCPP=0 \
-DENABLE_LDAP=0 \
-DENABLE_SSL=ON

It builds successfully, and the size of the binary is

programs/self-extracting$ ls -l clickhouse 
-rwxrwxr-x 1 milovidov milovidov 52233594 aug 21 12:14 clickhouse

Which is:

SELECT formatReadableSize(52233594)

Query id: 11b47726-3816-42d0-b71a-162c427927ee

┌─formatReadableSize(52233594)─┐
│ 49.81 MiB                    │
└──────────────────────────────┘

lmangani added a commit to lmangani/chdb that referenced this issue Apr 15, 2023
AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378) 

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp
auxten pushed a commit to chdb-io/chdb that referenced this issue Apr 16, 2023
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378) 

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
auxten pushed a commit to chdb-io/chdb that referenced this issue Jun 27, 2023
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378) 

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
auxten pushed a commit to chdb-io/chdb that referenced this issue Jun 28, 2023
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378) 

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
auxten pushed a commit to chdb-io/chdb that referenced this issue Aug 15, 2023
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378)

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
auxten pushed a commit to chdb-io/chdb that referenced this issue Nov 9, 2023
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378)

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
auxten pushed a commit to chdb-io/chdb that referenced this issue Jun 7, 2024
* PR_SET_NAME workaround

AWS Lambdas (and other virtualized platforms) lack of support for PR_SET_NAME causing a blocking exception. Pending an upstream PR or fix in ClickHouse, this patch allows this function to fail unharmed. The resulting executable has been tested on various platforms without drawbacks and discussed in clickhouse issue [29378](ClickHouse/ClickHouse#29378)

$ sed -i '/Cannot set thread name/c\' /ClickHouse/src/Common/setThreadName.cpp

* Disable AVX2 support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants