Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling tabix on S3 files via jupyter notebook fails #1625

Closed
agilly opened this issue May 26, 2023 · 5 comments
Closed

Calling tabix on S3 files via jupyter notebook fails #1625

agilly opened this issue May 26, 2023 · 5 comments
Assignees

Comments

@agilly
Copy link

agilly commented May 26, 2023

I am experiencing a very peculiar issue. I am using jupyter notebook to test some R code that calls tabix to query some files stored in S3 buckets.

The following command is an example of what I would be running on the bash command line:

AWS_PROFILE=tabix tabix s3://path/to.tsv.gz 2:1234-1234

The above command works. Now wrapping it in R (4.1.2):

out=system("AWS_PROFILE=tabix tabix s3://path/to.tsv.gz 2:1234-1234", intern=T)

This also works.

But now, using the exact same code in a jupyter notebook running IRKernel crashes. The log of the kernel reports the following:

[E::bgzf_read_block] Invalid BGZF header at offset 356561
Reading "s3://path/to.tsv.gz" failed: Resource temporarily unavailable

I am querying 100s of files but this happens only for a handful of them, systematically, for seemingly any region. I tried reindexing the file, the bug still occurs. I tried with htslib 1.12 and the latest 1.17, same thing.

This is probably going to be difficult to debug because there are many moving parts: tabix, R, and jupyter. Just checking if you have an idea why specifically calling something from jupyter would cause the error above? I cannot send the tsv files as they are confidential (and very large) but I am happy to do more exploration/debugging if you tell me what to do.

may be related to #1037

TIA!

@daviesrob daviesrob self-assigned this May 30, 2023
@daviesrob
Copy link
Member

The "Resource temporarily unavailable" message comes from errno getting set to EAGAIN or EWOULDBLOCK, which indicates that a system call was interrupted. This may be harmless, if the call was restarted, or it may indicate that something isn't handling such interruptions correctly.

You could try running tabix --verbosity=8 to increase the amount of debugging output printed (be warned this could be quite a lot). If we're lucky, that might give some clues as to what's going wrong.

Otherwise, if I were debugging this I'd attempt to find out what's going on at the point where bgzf_read_block reports its failure. As you're running lots of tabix processes, this might involve adding something before this return statement to either print more information about what's going on, or allow a debugger to be attached (either by dumping a core file with abort() and inspecting the results, or making it sleep for long enough for a debugger to be manually attached using gdb --pid assuming this is possible).

@daviesrob
Copy link
Member

Did you manage to get any more information on this problem?

@agilly
Copy link
Author

agilly commented Jun 7, 2023

Hi, I tried increasing the verbosity, but I don't think there is much that's helpful there, the connection just seems to close (I replaced sensitive info and hashes by ... or xxx. Maybe you'll see something I missed?

If it's indeed not much help, I can go on adding tracers and recompiling tabix, but this is likely to take a long time on my side because I have to juggle between several projects at the moment.

[D::init_add_plugin] Loaded "mem"                 
[D::init_add_plugin] Loaded "crypt4gh-needed"      
[D::init_add_plugin] Loaded "libcurl"                
[D::init_add_plugin] Loaded "gcs"     
[D::init_add_plugin] Loaded "s3"                    
[D::init_add_plugin] Loaded "s3w"                                                                                                                            
*   Trying 3.5.29.212...                                 
* TCP_NODELAY set                        
* Connected to xxx.s3.amazonaws.com (x.x.x.x) port 443 (#0)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt                                                                                                                                                                                                                                                             
* found 411 certificates in /etc/ssl/certs                                                                                                                                                                                                                                                                                 
* ALPN, offering http/1.1                                                                                                                                                                                                                                                                                                  
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256                                                                                                                                                                                                                                                               
*        server certificate verification OK                                   
*        server certificate status verification SKIPPED                                                                                                                                                                                              
*        common name: *.s3.amazonaws.com (matched)                                               
*        server certificate expiration date OK                                                           
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: 
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL            
* ALPN, server accepted to use http/1.1             
> GET /path/to.tsv.gz HTTP/1.1
Host: rgc-tag-data.s3.amazonaws.com      
User-Agent: htslib/1.12 libcurl/7.58.0
Accept: */*                  
x-amz-security-token: ...
Authorization: AWS4-HMAC-SHA256 Credential=.../.../us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=...
x-amz-date: 20230530T173337Z                                  
x-amz-content-sha256: xxx
                         
< HTTP/1.1 200 OK                                           
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:38 GMT             
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes                 
< Content-Type: text/tab-separated-values         
< Server: AmazonS3                                 
< Content-Length: 25973005583                        
<                         
* stopped the pause stream!            
* Closing connection 0                                                                                                                                       
* Hostname rgc-tag-data.s3.amazonaws.com was found in DNS cache
*   Trying 3.5.29.212...
* TCP_NODELAY set                     
* Connected to rgc-tag-data.s3.amazonaws.com (3.5.29.212) port 443 (#0)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt
* found 411 certificates in /etc/ssl/certs
*        ALPN, offering http/1.1
*        SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK                                   
*        server certificate status verification SKIPPED
*        common name: *.s3.amazonaws.com (matched)
*        server certificate expiration date OK                                        
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3                                                                                          
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL            
* ALPN, server accepted to use http/1.1             
> GET /path/to.tsv.gz HTTP/1.1
Host: rgc-tag-data.s3.amazonaws.com                      
User-Agent: htslib/1.12 libcurl/7.58.0   
Accept: */*       
x-amz-security-token: xxx
Authorization: AWS4-HMAC-SHA256 Credential=xxx/xxx/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=xxx
x-amz-date: 20230530T173338Z                                                                     
x-amz-content-sha256: xxx  
< HTTP/1.1 200 OK
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:39 GMT
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes
< Content-Type: text/tab-separated-values
< Server: AmazonS3
< Content-Length: 25973005583
< 
[I::hts_idx_check_local] Using alignment file 'to.tsv.gz'
* Found bundle for host rgc-tag-data.s3.amazonaws.com: xxx [can pipeline]
*   Trying x.x.x.x...
* TCP_NODELAY set
* Connected to rgc-tag-data.s3.amazonaws.com (3.5.29.212) port 443 (#1)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt
* found 411 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: *.s3.amazonaws.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL
* ALPN, server accepted to use http/1.1
> GET /path/to.tsv.gz HTTP/1.1
Host: xxx.s3.amazonaws.com
Range: bytes=3565615677-
User-Agent: htslib/1.12 libcurl/7.58.0
Accept: */*
x-amz-security-token: xxx
Authorization: AWS4-HMAC-SHA256 Credential=xxx/xxx/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=xxx
x-amz-date: 20230530T173338Z
x-amz-content-sha256: xxx

< HTTP/1.1 206 Partial Content
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:39 GMT
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes
< Content-Range: bytes 3565615677-25973005582/25973005583
< Content-Type: text/tab-separated-values
< Server: AmazonS3
< Content-Length: 22407389906
< 
* stopped the pause stream!
* Closing connection 0
[E::bgzf_read_block] Invalid BGZF header at offset 3565615
Reading "s3://path/to.tsv.gz" failed: Resource temporarily unavailable

@daviesrob
Copy link
Member

Unfortunately there's not much to go on there. It does look like the connection got closed before the "Invalid BGZF header" message was printed, but it's not clear why. It looks like it hangs up very quickly - it that the case? One other mystery is that the request said:

Range: bytes=3565615677-

but the error shows offset 3565615. Where did the 677 go? Was it lost when copying the messages over to GitHub?

Is the version of libcurl you're using the same when running on the command line and in the Jupyter notebook? 7.58.0 is quite old now, it might be worth trying a more up-to-date version. Apart from that, I fear you may have to resort to using strace or gdb to see if you can get more information about what's happening.

@daviesrob
Copy link
Member

Hopefully this was fixed by #1676. Please add a comment if you're still having problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants