Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lazy vcf overreading into next record #224

Merged
merged 3 commits into from
Jan 22, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Jan 22, 2024

Hi, using the lazy vcf reader, I think I found an issue w.r.t. how the fields are being parsed. In cases where not all fields are present, read_field will read into the next line causing record.buf to contain the record in question as it was parsed and the "raw" next record (because of the last read_line in read_lazy_record).

For example, if you modify test_read_lazy_record test to have two records, &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];, the test fails because record.buf is actually sq01.A..PASS.\nsq01\t.\tA\t.\t.\tPASS\t. (the samples field has the next record).

    #[tokio::test]
    async fn test_read_lazy_record() -> io::Result<()> {
        let mut src = &b"sq0\t1\t.\tA\t.\t.\tPASS\t.\nsq0\t1\t.\tA\t.\t.\tPASS\t."[..];

        let mut record = lazy::Record::default();
        read_lazy_record(&mut src, &mut record).await?;

        assert_eq!(record.buf, "sq01.A..PASS.");

        assert_eq!(record.bounds.chromosome_end, 3);
        assert_eq!(record.bounds.position_end, 4);
        assert_eq!(record.bounds.ids_end, 5);
        assert_eq!(record.bounds.reference_bases_end, 6);
        assert_eq!(record.bounds.alternate_bases_end, 7);
        assert_eq!(record.bounds.quality_score_end, 8);
        assert_eq!(record.bounds.filters_end, 12);
        assert_eq!(record.bounds.info_end, 13);

        Ok(())
    }

To attempt a fix, I took perhaps the dumbest and least idiomatic approach :), but I thought I'd open it up to see if you agree there's an issue and if so, thoughts on the best way to fix it.

Thanks,

@zaeleus
Copy link
Owner

zaeleus commented Jan 22, 2024

I've been reworking the alignment formats, and I actually very recently fixed this in SAM: 92c5119. The current reader fails when optional fields aren't present, but I haven't started reworking the variant formats yet.

The first 8 fields are required, so I don't think it's valid to return success if a line feed is reached.

@tshauck
Copy link
Contributor Author

tshauck commented Jan 22, 2024

Cool... do you have plans to work on the variant format soon / want me to hold off, or mind if I adapted your approach w/ SAM files to VCFs? It's causing those files to give incorrect results, so I'd like to get it resolved.

@zaeleus
Copy link
Owner

zaeleus commented Jan 22, 2024

Allow me to fix the blocking reader and reuse that in a buffered async read. I'll let you know when that's done.

The alignment format rework is close to a checkpoint, but I don't have a precise timeline. I will use my findings from that to improve the variant crates.

@tshauck
Copy link
Contributor Author

tshauck commented Jan 22, 2024

Sounds good to me, thanks.

This allows the blocking lazy record reader to be used.
@zaeleus zaeleus force-pushed the fix-lazy-record-overread branch from 941cbb1 to 1a4fe48 Compare January 22, 2024 22:02
@zaeleus zaeleus merged commit 4a59eac into zaeleus:master Jan 22, 2024
3 checks passed
@zaeleus
Copy link
Owner

zaeleus commented Jan 22, 2024

Thanks for using and reporting issues with the VCF lazy records!

@tshauck tshauck deleted the fix-lazy-record-overread branch January 22, 2024 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants