Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object_store] Self-signed certificates used by OneLake (Azure) #5696

Closed
martroben opened this issue Apr 26, 2024 · 20 comments
Closed

[object_store] Self-signed certificates used by OneLake (Azure) #5696

martroben opened this issue Apr 26, 2024 · 20 comments
Labels
question Further information is requested

Comments

@martroben
Copy link

Which part is this question about

object_store features

Describe your question

Is there some setting for object_store that would enable accessing Azure storage if the storage is using a self-signed certificate?

Additional context

I would like to use the Python polars library on the Microsoft Fabric platform.

Polars uses delta-rs which in turn uses object_store to interact with OneLake. (OneLake is the Azure Data Lake Storage Gen2 flavor used for storage in Fabric).

Everything works fine when accessing OneLake from a local device. However, when I try to use Polars in a Fabric notebook, I get the following error: error trying to connect: invalid peer certificate: Other(CaUsedAsEndEntity).

I checked the certificate with openssl s_client -connect onelake.blob.fabric.microsoft.com:443 -showcerts | openssl x509 -text in the Fabric Notebook environment. It turns out that onelake.blob.fabric.microsoft.com indeed uses a self-signed Certificate Authority (CA) certificate for connections coming from Fabric Notebooks. If I run the same check from a local device, the server provides a normal End Entity certificate.

I suspect that the error is caused by some upstream certificate validator function in webpki, which is in turn used by rustls. I don't speak rust very well, but I assume that object_store also uses rustls.

The issue has been raised before in webpki and rustls repos, but the maintainers of both have made it clear that they're not going to provide an out-of-the-box option for accepting connections that are trying to use CA certificate as EE.

rustls includes an option to implement custom certificate verifier. However, since I don't speak the language, I'm not sure if it is already included in object_store - and if not, could it be done?

The only check that would need to be omitted is the one asserting that CA needs to be FALSE in the certificate:

Certificate:
    Data:
        X509v3 extensions:
            X509v3 Basic Constraints: critical
                CA:FALSE

I have also opened a support case with the Microsoft Fabric team, asking them to start using valid certificates (request #2404260050002071). However, if anybody's going to solve it, my money is on the open source community. As of now, rust-based libraries are basically unusable in Microsoft Fabric, which is unfortunate.

PS: I'm aware of the allow_invalid_certificates option and it does fix the problem. But my love for polars isn't big enough to start using unsecure connections in production.

@tustvold
Copy link
Contributor

tustvold commented Apr 26, 2024

Following #5056 we by default use the system trust store to source the CA.

I am not very familiar with Microsoft Fabric, but if they are using self-signed certificates I would have expected them to populate this. If they aren't, you may be able to override SSL_CERT_DIR to wherever they are placing the certs (I presume they must be placing them somewhere).

#5517 may also be related.

However, if anybody's going to solve it, my money is on the open source community

This is depressingly likely true, although FWIW I grow increasingly frustrated at the amount of bodgery required to support Azure... They're up to 4 or possibly 5 object store like products that are all slightly different, expose slightly different functionality, and require their own hacks to make work correctly, it's getting kind of ridiculous...

@roeap I don't suppose you know anything about what might be going on here?

@martroben
Copy link
Author

@tustvold, thank you for the background information.

Downloading the certificate and manually pointing the SSL_CERT_FILE env variable to it is something that we used to solve a similar problem in the same setup (polars in Fabric). It worked well until about a few weeks ago, but not anymore.

In the previous incarnation of the same problem, interacting with a delta table would fail with an error from openssl, complaining about the peer certificate being self-signed. Now the error is coming from webpki (or so I think), saying that the problem is using a CA cert as EE cert. So I suppose I named the question somewhat misleadingly.

I'm not sure whether Microsoft started using a CA cert as EE on top of using self-signed certificates or if it was something else that changed - but as of now, object_store can't interact with OneLake delta tables from Fabric, even with a good CA file.

@tustvold
Copy link
Contributor

Other than perhaps trying building with OpenSSL which may be more permissive about certificates being used in non-spec compliant ways, rustls is quite right to refuse to accept this, I don't really know what to suggest.

Ultimately the onus is on Microsoft to fix this, especially given the security critical nature of this area, Microsoft should really do better

@hnasrullakhan
Copy link

https://www.ibm.com/docs/en/informix-servers/14.10?topic=openssl-x509v3-certificate-extension-basic-constraints

If you want your self-signed certificate to be recognized as a trusted CA certificate by systems like CentOS, you need to include the basicConstraints = CA:TRUE field in the certificate. This field specifies that the certificate is a CA certificate, allowing it to be recognized as such by the trust store mechanisms.

Without this field set to CA:TRUE, your self-signed certificate won't be recognized as a CA certificate, and therefore it won't be added to the CA trust store when you run the update-ca-trust command. This is why you're observing that the CA trust store isn't updated when basicConstraints is omitted or set to CA:FALSE.

So, to ensure that your self-signed certificate is added to the CA trust store and recognized as a CA certificate, you need to include basicConstraints = CA:TRUE when generating the certificate.

Looks like this change introduces this issue above where self signed certs used are faling.
https://github.com/apache/arrow-rs/pull/5056/files.

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

Yes but the issue as far as I understand is that Azure is then using a CA certificate as an endpoint certificate which is not only terrible practice, but also not really permitted by the specification. CAs should be used to issue endpoint certificates not act as them.

Looks like this change introduces this issue above where self signed certs used are faling.
https://github.com/apache/arrow-rs/pull/5056/files.

Have you tested this or just presuming. Prior to that PR we ignored the system roots, instead using bundled root certs, so I would have thought there would be even less chance of it working

@hnasrullakhan
Copy link

I maybe assuming because only that maybe the related change between object_store version.

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

Is there a version that works currently, the information presented thus far would suggest Azure changed something on their end

@hnasrullakhan
Copy link

deltalake==0.16.2 which was before object_store version 0.9.1 . works fine.

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

The TLS changes were in 0.9.0 not 0.9.1, is it possible there was a change in delta-rs? I know they used to override and use openssl

@hnasrullakhan
Copy link

delta-io/delta-rs#2449
they claim they have no changes.

CaUsedAsEndEntity and self signed certs related issues have been reported before on webpki. is there regression recently ?

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

Perhaps you could share the exact error you are getting, to make sure we're not dealing with two unrelated issues here?

@hnasrullakhan
Copy link

OSError: Generic MicrosoftAzure error: Error after 10 retries in 6.067636775s, max_retries:10, retry_timeout:180s, source:error sending request for url (https://onelake.blob.fabric.microsoft.com/TEST1/test2.Lakehouse/Tables/Test/_delta_log/_last_checkpoint): error trying to connect: invalid peer certificate: Other(CaUsedAsEndEntity)

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

I'm afraid I don't know why this worked on older versions, perhaps webpki got more strict. You could file an upstream ticket but to be completely honest the error is correct, the problem is onelake

@hnasrullakhan
Copy link

hnasrullakhan commented May 3, 2024

@martroben can you also confirm that it works with deltalake-0.16.1
Apologies, i mentioned before that it is available on 0.16.2 but it is not reproducible on 0.16.,1

import pandas as pd
!pip install deltalake==0.16.1
from deltalake.writer import write_deltalake

@martroben
Copy link
Author

@hnasrullakhan, yes same for me.

deltalake-0.16.1 works, but versions starting from 0.16.2 don't.
I also included this new information under deltalake issue 2449.

@roeap
Copy link
Contributor

roeap commented May 8, 2024

FWIW I grow increasingly frustrated at the amount of bodgery required to support Azure

@tustvold @martroben - unfortunately I can only agree with this for now.

I'll investigate if I can find something in delta-rs that has changed, but why MSFT is putting us through all of this is hard to understand ...

@konjac
Copy link
Contributor

konjac commented May 14, 2024

The One Lake endpoint inside Fabric Spark is actually resolved to a local proxy.

image

So the observed self-signed cert is the cert of that proxy.

@hnasrullakhan
Copy link

hnasrullakhan commented May 30, 2024

@roeap @konjac any luck with findings. 0.16.2 is the one which causes the issue. builds before that have no issues.

@hnasrullakhan
Copy link

@roeap any findings?

@martroben
Copy link
Author

It seems that upstream packages with object store 0.10.1 are no longer experiencing this error. I'm not sure if it was object store that fixed it or something else up or down the chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants