How do I write my own data using your lib? #74

stela2502 · 2025-01-17T10:16:27Z

stela2502
Jan 17, 2025

Hi Jack,

I want to convert bam to bigwig as the tools we use are clocking our server.
https://github.com/stela2502/bam_tide
My main problem is to get the data out as biwigs. And here is where your library comes in.
But I do not understand how I could create a writer from your library and feed it data from mine.
I would greatly appreciate if you could give me the basics of how to make my my class fit for being exported by you lib.
And I get the feeling that if I just would know what the different functions of yours would return I would get it.

My class looks like this:

pub struct BedData {
genome_info: Vec<(String, usize, usize)>, // (chromosome name, length, bin offset)
coverage_data: Vec, // coverage data array for bins
bin_width: usize, // bin width for coverage
}

The genome_info contains the chr name, the chr length and the start id in the coverage array. The coverage array is just all the data for the whole genome in one array and bin_width is exactly that. The genome_info can easily be converted to the chrom_sizes you need to initialize your Writer https://docs.rs/bigtools/latest/bigtools/struct.BigWigWrite.html#method.create_file, but the Traits I need to implement to get my data into yours are to complicated for me. But I get the impression it should not be too complicated - or?

Please help me!

Thank you!

jackh726 · 2025-01-17T14:31:56Z

jackh726
Jan 17, 2025
Maintainer

What you want will look something like this:

let runtime = tokio::runtime::Builder::new_multi_thread()
	.worker_threads(6)
	.build()
	.expect("Unable to create runtime.");


struct DataIter<'a> {
  data: &'a BedData,
  current_bin: usize
}

impl Iterator for DataIter<'a> {
  impl Item = (String, bigtools::Value);
  fn next(&mut Self) -> Option<Self::Item> {
    let bin = data.coverage_data[self.current_bin];
	// Figure out the chromosome and data
	let chrom: String = ...;
	let val = bigtools::Value {
	  start: ...,
	  end: ...,
	  value: ...,
	};
	self.current_bin += 1;
	(chrom, val)
  }
}


let data: BedData = ...;

let chrom_map: HashMap<String, u32> = data.genome_info.iter().map(|(chrom, len, _)| (chrom.clone(), len as u32)).collect();

let iter = DataIter { data: &data, current_bin: 0 };
let data = BedParserStreamingIterator::wrap_infallible_iter(iter);

let outfile: Path = ...;
let outb = BigWigWrite::create_file(outfile, chrom_map).unwrap();
outb.write(data, runtime).unwrap();

Any place with a ... will be what you need to fill in.

I'm looking at https://github.com/stela2502/bam_tide/blob/main/src/bed_data.rs right now. First thing that pops out to me, is that it would probably be good if you stored your coverage data as a Vec<(String, Vec<u32>)>, keyed by chromosome, or at the very least maintained some separate state with the offsets for each chromosome into the coverage vec. Otherwise, you're going to end up calculating the current chromosome for each value (though there are various ways to precompute this, but you already have that info when you create the BedData, so I suggest you just keep it around). As a bonus, storing the String in the Vec also allows you to return a &'a str for each value instead of a String, which will heavily reduce your allocation costs.

I'll also note that if your input data in sorted within chromosomes by start, then you can likely even make the entire coverage generation + writing completely lazy, but I'll leave that as an exercise for the reader.

0 replies

stela2502 · 2025-01-18T09:48:28Z

stela2502
Jan 18, 2025
Author

Thank you! That looks like the thing I was lacking! I‘ll test that first thing on Monday!

…

________________________________ Von: Jack Huey ***@***.***> Gesendet: Freitag, 17. Januar 2025 15:32 An: jackh726/bigtools ***@***.***> Cc: Stefan Lang ***@***.***>; Author ***@***.***> Betreff: Re: [jackh726/bigtools] How do I write my own data using your lib? (Issue #73) What you want will look something like this: let runtime = tokio::runtime::Builder::new_multi_thread() .worker_threads(6) .build() .expect("Unable to create runtime."); struct DataIter<'a> { data: &'a BedData, current_bin: usize } impl Iterator for DataIter<'a> { impl Item = (String, bigtools::Value); fn next(&mut Self) -> Option<Self::Item> { let bin = data.coverage_data[self.current_bin]; // Figure out the chromosome and data let chrom: String = ...; let val = bigtools::Value { start: ..., end: ..., value: ..., }; self.current_bin += 1; (chrom, val) } } let data: BedData = ...; let chrom_map: HashMap<String, u32> = data.genome_info.iter().map(|(chrom, len, _)| (chrom.clone(), len as u32)).collect(); let iter = DataIter { data: &data, current_bin: 0 }; let data = BedParserStreamingIterator::wrap_infallible_iter(iter); let outfile: Path = ...; let outb = BigWigWrite::create_file(outfile, chrom_map).unwrap(); outb.write(data, runtime).unwrap(); Any place with a ... will be what you need to fill in.

________________________________ I'm looking at https://github.com/stela2502/bam_tide/blob/main/src/bed_data.rs right now. First thing that pops out to me, is that it would probably be good if you stored your coverage data as a Vec<(String, Vec<u32>)>, keyed by chromosome, or at the very least maintained some separate state with the offsets for each chromosome into the coverage vec. Otherwise, you're going to end up calculating the current chromosome for each value (though there are various ways to precompute this, but you already have that info when you create the BedData, so I suggest you just keep it around). As a bonus, storing the String in the Vec also allows you to return a &'a str for each value instead of a String, which will heavily reduce your allocation costs. I'll also note that if your input data in sorted within chromosomes by start, then you can likely even make the entire coverage generation + writing completely lazy, but I'll leave that as an exercise for the reader. — Reply to this email directly, view it on GitHub<#73 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC5TO2X3JSLKHDOPM5X2CGD2LEH7HAVCNFSM6AAAAABVLSRPMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJYGQ4TOMBVGY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

stela2502 · 2025-01-21T13:33:20Z

stela2502
Jan 21, 2025
Author

My problem could be fixed with the additional info you gave me here!
Thank you so much!

0 replies

stela2502 · 2025-01-21T13:38:33Z

stela2502
Jan 21, 2025
Author

Just in case this might become useful for somebody else, too:

I have a class that defined the data as

pub struct BedData {
    pub genome_info: Vec<(String, usize, usize)>, // (chromosome name, length, bin offset)
    pub search: HashMap<String, usize>, // get id for chr
    pub coverage_data: Vec<u32>, // coverage data array for bins
    pub bin_width: usize, // bin width for coverage
    pub threads: usize, // how many worker threads should we use here?
}

And I have implemented the Value like that:

/// Represents a single value in a bigWig file
#[derive(Copy, Clone, Debug, PartialEq)]
#[cfg_attr(feature = "write", derive(Serialize, Deserialize))]
pub struct Value {
    pub start: u32,
    pub end: u32,
    pub value: f32,
}

impl Value{
	pub fn flat(&self) -> ( u32, u32, f32) {
		( self.start, self.end, self.value )
	}
}

I put the iterator into a new file:

use bigtools;
use crate::bed_data::BedData;



pub struct DataIter<'a> {
    /// all data
    pub data: &'a BedData,
    /// the pointer to the BedData::coverage_data id
    pub current_bin: usize,
    /// the current chromosme name and offset for that chromosome
    pub current_chr: Option<( String, usize, usize )>, 
}

impl<'a> DataIter<'a> {
  pub fn new ( data: &'a BedData) ->Self {
    let ret = Some( (
        data.genome_info[0].0.clone(), 
        data.genome_info[0].1,
        data.genome_info[0].2 
      )
    );
    Self{
      data,
      current_bin: 0,
      current_chr: ret,
    }
  }
}

impl<'a> Iterator for DataIter<'a> {
  type Item = (String, bigtools::Value);

  fn next(&mut self) -> Option<Self::Item> {
    match &self.current_chr {
      Some((chr, size, offset)) => {
        let rel_bin = self.current_bin - offset;

        // Create the return value based on current values
        let ret = (
          chr.to_string(),
          bigtools::Value {
            start: (rel_bin * self.data.bin_width).try_into().unwrap(),
            end: (rel_bin * self.data.bin_width + self.data.bin_width)
              .min(*size)
              .try_into()
              .unwrap(),
            value: self.data.coverage_data[self.current_bin] as f32,
          },
        );

        // Increment the bin index
        self.current_bin += 1;

        // Check if the current bin exceeds the size of the chromosome
        let rel_bin = self.current_bin - offset;
        if (rel_bin * self.data.bin_width) >= *size {
          // Move to the next chromosome
          self.current_chr = self.data.current_chr_for_id(self.current_bin);
        }
        Some(ret)
      }
      None => None, // No more chromosomes to process
    }
  }
}

And the function that saves the data is this in BedData:

	pub fn write_bigwig( &self, file: &str) -> Result<(),String>{	

		let runtime = tokio::runtime::Builder::new_multi_thread()
		     .worker_threads( self.threads )
		    .build()
		     .expect("Unable to create runtime.");

		let iter = DataIter::new( self );
		let data = BedParserStreamingIterator::wrap_infallible_iter(iter, true);

		let chrom_map: HashMap<String, u32> = self.genome_info.iter().map(|(chrom, len, _)| (chrom.clone(), *len as u32)).collect();

		let outfile = Path::new(file);

		// Create the BigWig writer
        let outb = BigWigWrite::create_file(outfile, chrom_map)
            .map_err(|e| format!("Failed to create BigWig file: {}", e))?;

        // Write data using the runtime
        outb.write(data, runtime)
            .map_err(|e| format!("Failed to write BigWig file: {}", e))?;

        Ok(())
	}

Hope that helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I write my own data using your lib? #74

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do I write my own data using your lib? #74

stela2502 Jan 17, 2025

Replies: 4 comments

jackh726 Jan 17, 2025 Maintainer

stela2502 Jan 18, 2025 Author

stela2502 Jan 21, 2025 Author

stela2502 Jan 21, 2025 Author

stela2502
Jan 17, 2025

jackh726
Jan 17, 2025
Maintainer

stela2502
Jan 18, 2025
Author

stela2502
Jan 21, 2025
Author

stela2502
Jan 21, 2025
Author