Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I write my own data using your lib? #73

Closed
stela2502 opened this issue Jan 17, 2025 · 2 comments
Closed

How do I write my own data using your lib? #73

stela2502 opened this issue Jan 17, 2025 · 2 comments

Comments

@stela2502
Copy link

Hi Jack,

I want to convert bam to bigwig as the tools we use are clocking our server.
https://github.com/stela2502/bam_tide
My main problem is to get the data out as biwigs. And here is where your library comes in.
But I do not understand how I could create a writer from your library and feed it data from mine.
I would greatly appreciate if you could give me the basics of how to make my my class fit for being exported by you lib.
And I get the feeling that if I just would know what the different functions of yours would return I would get it.

My class looks like this:

pub struct BedData {
genome_info: Vec<(String, usize, usize)>, // (chromosome name, length, bin offset)
coverage_data: Vec, // coverage data array for bins
bin_width: usize, // bin width for coverage
}

The genome_info contains the chr name, the chr length and the start id in the coverage array. The coverage array is just all the data for the whole genome in one array and bin_width is exactly that. The genome_info can easily be converted to the chrom_sizes you need to initialize your Writer https://docs.rs/bigtools/latest/bigtools/struct.BigWigWrite.html#method.create_file, but the Traits I need to implement to get my data into yours are to complicated for me. But I get the impression it should not be too complicated - or?

Please help me!

Thank you!

@jackh726
Copy link
Owner

What you want will look something like this:

let runtime = tokio::runtime::Builder::new_multi_thread()
	.worker_threads(6)
	.build()
	.expect("Unable to create runtime.");


struct DataIter<'a> {
  data: &'a BedData,
  current_bin: usize
}

impl Iterator for DataIter<'a> {
  impl Item = (String, bigtools::Value);
  fn next(&mut Self) -> Option<Self::Item> {
    let bin = data.coverage_data[self.current_bin];
	// Figure out the chromosome and data
	let chrom: String = ...;
	let val = bigtools::Value {
	  start: ...,
	  end: ...,
	  value: ...,
	};
	self.current_bin += 1;
	(chrom, val)
  }
}


let data: BedData = ...;

let chrom_map: HashMap<String, u32> = data.genome_info.iter().map(|(chrom, len, _)| (chrom.clone(), len as u32)).collect();

let iter = DataIter { data: &data, current_bin: 0 };
let data = BedParserStreamingIterator::wrap_infallible_iter(iter);

let outfile: Path = ...;
let outb = BigWigWrite::create_file(outfile, chrom_map).unwrap();
outb.write(data, runtime).unwrap();

Any place with a ... will be what you need to fill in.


I'm looking at https://github.com/stela2502/bam_tide/blob/main/src/bed_data.rs right now. First thing that pops out to me, is that it would probably be good if you stored your coverage data as a Vec<(String, Vec<u32>)>, keyed by chromosome, or at the very least maintained some separate state with the offsets for each chromosome into the coverage vec. Otherwise, you're going to end up calculating the current chromosome for each value (though there are various ways to precompute this, but you already have that info when you create the BedData, so I suggest you just keep it around). As a bonus, storing the String in the Vec also allows you to return a &'a str for each value instead of a String, which will heavily reduce your allocation costs.

I'll also note that if your input data in sorted within chromosomes by start, then you can likely even make the entire coverage generation + writing completely lazy, but I'll leave that as an exercise for the reader.

@stela2502
Copy link
Author

stela2502 commented Jan 18, 2025 via email

Repository owner locked and limited conversation to collaborators Jan 21, 2025
@jackh726 jackh726 converted this issue into discussion #74 Jan 21, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants