Extracting genomic coordinates of differential features #10

cgirardot · 2020-10-27T17:47:32Z

Dear chesser,

I am using the code in example_analysis.ipynb to learn where the differential regions are. I am not really a python guy and would appreciate if you could help me to solve the following issues:

the script currently plot (at the end) 3 matrices. The 2 first one are OK to visualize the TAD structures but the scale is often not adapted to loop visualisation. I'd like to replace the HiC signal with the observed/exp value in the first 2 matrices; but I don t know how to do this. The matrices I am loading are fanc hic matrices.
the script has a commented margin that I believe would allow to increase the visualized region. It is a great idea for the smaller regions and I would like to use this. Unfortunately, when I uncomment the code, the squares (corresponding to the extracted regions) are completely off. Could you help having a version of the script working with the margin ?

Thank you !

The text was updated successfully, but these errors were encountered:

kaukrise · 2020-10-28T09:16:01Z

Hi,

please have a look at the documentation for FAN-C, where you can find detailed descriptions of the plotting API, which I think is going to be very useful in addressing your questions: https://fan-c.readthedocs.io/en/latest/api/plot.html

If you feel like this may be beyond your expertise, feel free to contact the Vaquerizas lab to enquire about a possible collaboration: https://www.vaquerizaslab.org/contact/

cgirardot · 2020-10-28T10:02:08Z

OK. I thought having the margin working in your demo script would be of general interest. I'll check the docs to find out how to get the OE from the matrix. I also assumed it was an easy one for you. Sorry for bothering. Thank you.

cgirardot · 2020-10-28T17:40:55Z

sorry I closed this too fast.
Reading the fanc doc, I managed to plot what I want wrt point 1 above. But I am still trying to understand how I can add margin when the window size is smaller to improve the display. I can't find information on the format of eg gained_features.tsv. Could you please hint?

nickmachnik · 2020-10-28T19:24:52Z

Hi @cgirardot!
Do I understand correctly that you want to plot a large region (larger than a single window in your chess sim run) and display all the rectangles for the extracted features in that?
If yes, then I am sorry to say that we currently have no code for that. The right person to help here is @sgalan, but for now maybe I can provide some information that will help: in gained_features.csv each row corresponds to a single rectangle, i.e. a single feature extracted. The first columns are, in that order:

the pair id (same as the first column in the chess sim output) of the region the feature is in
an id for that specific feature
x max
x min
y max
y min, which are the four corners of the rectangle marking the feature in the region matrix.

In order to plot these rectangles at the correct positions in a larger matrix (e.g. with margins), the x and y coordinates have to be transformed. I don't know how to do this at the moment. If @sgalan has no easy solution I will mark this as requested enhancement and place it on our to-do list.
I hope that helps a bit, good luck!

nickmachnik · 2020-10-28T19:32:33Z

@cgirardot , also thanks for pointing out that the information about the columns is missing in the docs, have added this.

cgirardot · 2020-11-05T17:13:33Z

I am a bit confused with the formats. Let s take an example.

After cross-correlation, I have (in subregions_4_clusters_lost):

So I should look in "lost_features" for the feature IDs 20 to 23 :

And here is a view on this region :

we are looking at the blue boxes in the middle plot.

why do I have only 2 rectangles but 4 features in the lost_features file? My guess is it is because it is symmetrical ?
Are the coordinates in lost_features in "bins" ? The region on display is chr3L:80000002-10000001, what is the easy way to extract the regions involved in e.g. feature 20? I am sure it is too late and I am missing the obvious here.

sgalan · 2020-11-06T16:22:55Z

Dear User, 1. Your guess is correct. Both symmetrical features are kept in the file, that is why you only observed half of them when you plot half of the contact matrix. 2. Yes, the coordinates retrieved are in bins. The file contains features file contains in the first 2 columns: - The region ID, which is the same in the pairs or sim file. - The feature ID: it is a number that is given to each feature. The next 4 columns contain the xmin, xmax, ymin, ymax, in bins, that are the corners of the rectangles or squares that encircle the captured feature. Hopes it helps, S El jue., 5 nov. 2020 a las 18:13, cgirardot (<[email protected]>) escribió:

…

I am a bit confused with the formats. Let s take an example. After cross-correlation, I have (in subregions_4_clusters_lost): [image: Screenshot 2020-11-05 at 18 04 23] <https://user-images.githubusercontent.com/10990795/98272398-56af9a80-1f91-11eb-8dfb-5758c19d731c.png> So I should look in "lost_features" for the feature IDs 20 to 23 : [image: Screenshot 2020-11-05 at 18 06 34] <https://user-images.githubusercontent.com/10990795/98272690-a4c49e00-1f91-11eb-9511-415fc98d35ab.png> And here is a view on this region : [image: Screenshot 2020-11-05 at 18 07 14] <https://user-images.githubusercontent.com/10990795/98272769-bc9c2200-1f91-11eb-8422-98961be4924e.png> we are looking at the blue boxes in the middle plot. 1. why do I have only 2 rectangles but 4 features in the lost_features file? My guess is it is because it is symmetrical ? 2. Are the coordinates in lost_features in "bins" ? The region on display is chr3L:80000002-10000001, what is the easy way to extract the regions involved in e.g. feature 20? I am sure it is too late and I am missing the obvious here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZPOZHNX7OL4FD6VH3CO6LSOLMM3ANCNFSM4TBFQ5NQ> .

-- *Silvia Galan Martínez - PhD student * *Centre Nacional d'Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG)* *Structural Genomics Dpt.* Parc Científic de Barcelona – Torre I Baldiri Reixac, 4 08028 Barcelona Tel +34 9340 20580 Email [email protected]

cgirardot · 2020-11-06T17:03:09Z

Hi @sgalan
thanks for the confirmation. Could you also hint on the last question, which is the most important to move on for me. If you look at the data and images above: how can I extract the coordinates of regions involved in e.g. feature 20?
thx

sgalan · 2020-11-09T15:52:19Z

Hi, So you have an ID for the coordinates of the region you are comparing (generated by the chess pairs command line), which is the same obtained for the extraction output file. With their intersection you can obtain the coordinates of the entire region, and then you have the bin numbers which contain the differential structure. Then when you run chess extract command, will be stored two files in *features* folder: *gained_features.tsv* and *lost_features.tsv*, for the gained and lost features in the query matrix compared to the reference, respectively. This files contain the information about the position, and the shape of the features comma-separated. The first value correspond to an ID of the region, while the second to an ID of the feature. The following four values, correspond to the position of the corners of the rectangle that contain the feature (xmin, xmax, ymin and ymax). This four values will be used to highlight and localize their position. After all this information, it is stored the contact matrix of this rectangle. Imagine that you know that the coordinates of the entire region that CHESS has identified goes from 1 to 3 Mb, using a resolution of 5 kb, it means that the starting bin will be 200. Then, using chess extract you know that in this region the structural feature is located from the bin 1 to 2 of the contact matrix. It means that its relative position will be from 200+1 = 201 * 5kb = 1.005.000 bp to 200+2 = 202*5 kb = 1.010.000 bp. Hope it helps, S El vie., 6 nov. 2020 a las 18:03, cgirardot (<[email protected]>) escribió:

…

Hi @sgalan <https://github.com/sgalan> thanks for the confirmation. Could you also hint on the last question, which is the most important to move on for me. If you look at the data and images above: how can I extract the *coordinates* of regions involved in e.g. feature 20? thx — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZPOZDKP43SFRELYGWBIJTSOQT57ANCNFSM4TBFQ5NQ> .

-- *Silvia Galan Martínez - PhD student * *Centre Nacional d'Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG)* *Structural Genomics Dpt.* Parc Científic de Barcelona – Torre I Baldiri Reixac, 4 08028 Barcelona Tel +34 9340 20580 Email [email protected]

PerrineLacour · 2020-11-18T10:01:46Z

Hello @sgalan ,

I am a bit confused with the formula you provided. For example with the loops found in @cgirardot 's data, the x and y coordinates of the rectangle don't refer to the genomic regions that interact (the x and y are on a horizontal and vertical axes while the interaction is between two regions found by following the diagonals).

How can we get the coordinates of the regions involved in the loop?

Best,

Perrine

PerrineLacour · 2020-11-18T11:51:49Z

I realised that my previous comment may be unclear.
To be more precise, I have read the code regarding extracting the features and it seems to me that you scale (zoom clipped) and rotate the matrix before looking for the features. Is that correct?
If it is the case, it would explain why the x coordinates of symmetric features are the same (while x and y coordinates should be reversed between symmetric features in the original matrix), and also why the feature areas are rectangles and not diamond-shaped as it is usually the case in HiC matrices (especially the TADs).

If I understood the code correctly, it also means that the x and y coordinates retrieved from the gained and lost files do not correspond to the regions involved in the loop, but rather to the coordinates of the feature on the picture.

Is it correct?

Best,

Perrine

cgirardot · 2020-11-20T12:15:47Z

@sgalan I was wondering if you could comment on @PerrineLacour questions?

sgalan · 2020-11-20T12:35:12Z

Hi, First of all sorry for the late reply. Yes, in the feature extraction method, the matrices are clipped and rotated for the proper identification and extraction of the individual features. Then, to highlight the position of the features, chess uses the closing morphology module from scikit-image. Here you can observe an example in which the features present a round shaped, however the patches that highlight the features are rectangle-shaped ( https://scikit-image.org/docs/dev/auto_examples/segmentation/plot_label.html#sphx-glr-auto-examples-segmentation-plot-label-py ). So, the coordinates that chess retrieves are those highlighting the feature, then if you are interested, you can compute the center position from the rectangle. Hope it helps, S El vie, 20 nov 2020 a las 13:16, cgirardot (<[email protected]>) escribió:

…

@sgalan <https://github.com/sgalan> I was wondering if you could comment on @PerrineLacour <https://github.com/PerrineLacour> questions? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZPOZENNPUKZCRN2IZ7KWLSQZMYDANCNFSM4TBFQ5NQ> .

-- *Silvia Galan Martínez - PhD student * *Centre Nacional d'Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG)* *Structural Genomics Dpt.* Parc Científic de Barcelona – Torre I Baldiri Reixac, 4 08028 Barcelona Tel +34 9340 20580 Email [email protected]

cgirardot · 2020-11-20T15:20:46Z

Hi @sgalan thank you for your answer.

I understand that this is going to be a general issue for a lot of people i.e. it is not so trivial to understand. Don't you also have the same issue when analyzing data for your own papers?
If so, would you be so nice to share in the chess package a script allowing to retrieve the regions involved in a given extracted feature like a loop?

Thanks a million for your time and patience

sgalan · 2020-11-22T17:11:21Z

Hi, this information was supposed to be added in the docs, and I am sorry it is not there yet. We were not having any issue with that for the analysis we wanted to do in the paper. This version of CHESS retrieves the features using the labeling tool I sent from scikit and it gives the positions of a rectangle highlighting the feature. S El vie, 20 nov 2020 a las 16:21, cgirardot (<[email protected]>) escribió:

…

Hi @sgalan <https://github.com/sgalan> thank you for your answer. I understand that this is going to be a general issue for *a lot* of people i.e. it is not so trivial to understand. Don't you also have the same issue when analyzing data for your own papers? If so, would you be so nice to share in the chess package a script allowing to retrieve the regions involved in a given extracted feature like a loop? Thanks a million for your time and patience — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZPOZCF4YVEHED4LCUDOG3SQ2CN5ANCNFSM4TBFQ5NQ> .

-- *Silvia Galan Martínez - PhD student * *Centre Nacional d'Anàlisi Genòmica-Centre de Regulació Genòmica (CNAG-CRG)* *Structural Genomics Dpt.* Parc Científic de Barcelona – Torre I Baldiri Reixac, 4 08028 Barcelona Tel +34 9340 20580 Email [email protected]

jvaquerizas · 2020-11-22T18:57:59Z

Hi @BenxiaHu,

This is really not appropriate language to use to ask for someone to implement a software feature for you.

@ALL, as Silvia mentioned above, the current reported location of the feature corresponds to the actual position of the feature on the matrix, rather than the genomic coordinates of the feature. This is implemented as such in this version of the software since which coordinates to report can heavily depend on the type of feature being detected (eg, TAD, stripe, loop). This makes it difficult to come up with a default approach for which genomic coordinates to report. For the time being, you can use Silvia's explanation above to compute these coordinates yourself. We will review the documentation and include further guidance if necessary.

In addition, we always encourage potential contributors to write pull requests with features of interest, so we can review the, and incorporate in an updated if appropriate.

I hope that helps.

Best,
Juanma Vaquerizas

your answer really does not mean anything.

what we want is just to retrieve the genemoe coordinates based on gained/lost features.

Do you understand our question?

nickmachnik · 2020-11-23T10:54:36Z

I marked this as 'good first issue'; if we find time for this we might implement this ourselves, but as Juanma said, pull requests with solutions are more than welcome.

cgirardot closed this as completed Oct 28, 2020

cgirardot reopened this Oct 28, 2020

nickmachnik added the enhancement New feature or request label Nov 2, 2020

nickmachnik added the good first issue Good for newcomers label Nov 22, 2020

vaquerizaslab locked and limited conversation to collaborators Nov 22, 2020

vaquerizaslab unlocked this conversation Nov 22, 2020

nickmachnik changed the title ~~example_analysis.ipynb~~ Extracting genomic coordinates of differential features Nov 22, 2020

nickmachnik mentioned this issue Jan 19, 2021

how to get the actual positions of the highly dissimilar regions highlighted by chess extract #29

Closed

cmdoret mentioned this issue Mar 8, 2021

Add coordinate conversion helpers #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting genomic coordinates of differential features #10

Extracting genomic coordinates of differential features #10

cgirardot commented Oct 27, 2020

kaukrise commented Oct 28, 2020

cgirardot commented Oct 28, 2020

cgirardot commented Oct 28, 2020

nickmachnik commented Oct 28, 2020

nickmachnik commented Oct 28, 2020 •

edited

Loading

cgirardot commented Nov 5, 2020

sgalan commented Nov 6, 2020 via email

cgirardot commented Nov 6, 2020

sgalan commented Nov 9, 2020 via email

PerrineLacour commented Nov 18, 2020

PerrineLacour commented Nov 18, 2020

cgirardot commented Nov 20, 2020

sgalan commented Nov 20, 2020 via email

cgirardot commented Nov 20, 2020

sgalan commented Nov 22, 2020 via email

jvaquerizas commented Nov 22, 2020

nickmachnik commented Nov 23, 2020

Extracting genomic coordinates of differential features #10

Extracting genomic coordinates of differential features #10

Comments

cgirardot commented Oct 27, 2020

kaukrise commented Oct 28, 2020

cgirardot commented Oct 28, 2020

cgirardot commented Oct 28, 2020

nickmachnik commented Oct 28, 2020

nickmachnik commented Oct 28, 2020 • edited Loading

cgirardot commented Nov 5, 2020

sgalan commented Nov 6, 2020 via email

cgirardot commented Nov 6, 2020

sgalan commented Nov 9, 2020 via email

PerrineLacour commented Nov 18, 2020

PerrineLacour commented Nov 18, 2020

cgirardot commented Nov 20, 2020

sgalan commented Nov 20, 2020 via email

cgirardot commented Nov 20, 2020

sgalan commented Nov 22, 2020 via email

jvaquerizas commented Nov 22, 2020

nickmachnik commented Nov 23, 2020

nickmachnik commented Oct 28, 2020 •

edited

Loading