Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.anchor file format #141

Closed
hyl317 opened this issue Jun 20, 2019 · 7 comments
Closed

.anchor file format #141

hyl317 opened this issue Jun 20, 2019 · 7 comments

Comments

@hyl317
Copy link

hyl317 commented Jun 20, 2019

Hi, I understand that the first two columns are the homologs identified via LAST, but I am wondering what the third column (those integer numbers) of .anchor file means.

GSVIVT01012028001 ppa011886m 515
GSVIVT01012027001 ppa026797m 297
GSVIVT01012026001 ppa006860m 609
GSVIVT01012023001 ppa000608m 2780
GSVIVT01012018001 ppa012865m 123
GSVIVT01012018001 ppa025457m 93
GSVIVT01012012001 ppa010496m 568
GSVIVT01012008001 ppa002064m 1180

@tanghaibao
Copy link
Owner

tanghaibao commented Jun 20, 2019

@hyl317

These are scores coming from LAST, which were used in steps prior to the .anchors such as filtering based on C-score, or prioritize matches in a series of matches among tandem repeats.

Sometimes when the .anchors file is an output from "liftover" that enrich the synteny signal, i.e. .lifted.anchors. Certain pairs can have the third column ending in L, for example,

GSVIVT01012008001 ppa002064m 1180L

The L simply highlights the fact that these are low quality anchors close to high quality anchors.

Haibao

@tanghaibao
Copy link
Owner

Adding a link in the wiki in case someone else has the same question.
Closing issue.

@ohdongha
Copy link

Dear Haibao ( @tanghaibao ),
First, thank you for providing this excellent toolkit - the clarity of graphics is amazing.

Regarding this thread, could you explain a bit more about C-score and the criteria to "lift" a pair with weaker alignment?

After running the jcvi.compara.catalog, following the instruction to create a microsynteny visualization, I can see that there was a step to filter alignments based on "cscore>=0.70" and I wonder what this means. Is it something similar to the chain-score mentioned in the MCscan paper, or something totally different like the coverage of alignment over the (shorter of the two) protein length?

Also, I wonder what is the criteria of "lifting" weaker alignment. I guess if a pair can be included in a co-linear block, a weaker alignment is allowed. Will the "weak" alignment based on a more relaxed e-value or bit-score cutoff? Does it something to do with "dist=10"?

Basically, when we generate a nice microsynteny plot (like the one comparing a stretch of co-linear genes between grape, cacao, etc. in the tutorial), what we can say in the figure legend? It would be nice if we can say "Gray ribbons connect co-linear ortholog pairs identified based on ." And the could be "C-score >= 0.7" (with an explanation or reference about C-score, in the Methods section) or "e-value <1E-5" etc.

Thanks again!

@tanghaibao tanghaibao reopened this Jan 29, 2021
@tanghaibao
Copy link
Owner

@ohdongha

C-score = score(A, B) / max(score(A,), score(,B)), this has range between 0 to 1.
i.e. how the score of current pair A-B compares against all gene pairs that touch either A or B.
C-score generalizes the idea of the reciprocal best match, you can see that the reciprocal best will have a C-score of 1. Anything weaker than reciprocal best is lower than 1, the default in jcvi.compara.catalog is 0.7, which is considered "strong" enough.

So the initial synteny block is defined over "strong" pairs (C-score >= 0.7, as you saw). Then the "liftover" adds more gene pairs that are weaker (in terms of C-value <0.7) but are sufficiently close to the high-quality synteny chain (within a distance of 10, by default). This second step aims at adding more synteny signal.

Finally, checkout for example a genome paper here: https://www.nature.com/articles/ng.3435
Figure 3c is a microsynteny plot, consult the figure legends there .. and yes the C-score cutoff typically goes in the Methods section.

@ohdongha
Copy link

@tanghaibao thanks a lot.

C-score sounds like a clever way to rank blast-type hits. Is there a reference I can site when mentioning this concept (c-score filter + liftover) in the Methods? I searched around with “c-score” but couldn’t find a paper immediately. Is this mentioned in the 2008 MCscan paper (and I missed it?)

@ohdongha

C-score = score(A, B) / max(score(A,), score(,B)), this has range between 0 to 1.
i.e. how the score of current pair A-B compares against all gene pairs that touch either A or B.
C-score generalizes the idea of the reciprocal best match, you can see that the reciprocal best will have a C-score of 1. Anything weaker than reciprocal best is lower than 1, the default in jcvi.compara.catalog is 0.7, which is considered "strong" enough.

So the initial synteny block is defined over "strong" pairs (C-score >= 0.7, as you saw). Then the "liftover" adds more gene pairs that are weaker (in terms of C-value <0.7) but are sufficiently close to the high-quality synteny chain (within a distance of 10, by default). This second step aims at adding more synteny signal.

Finally, checkout for example a genome paper here: https://www.nature.com/articles/ng.3435
Figure 3c is a microsynteny plot, consult the figure legends there .. and yes the C-score cutoff typically goes in the Methods section.

@tanghaibao
Copy link
Owner

tanghaibao commented Jan 29, 2021

@ohdongha

I did not invent the use of c-score, or c-value, although arguably I was among the earliest to use it in the context of synteny inference. Reference:
https://www.pnas.org/content/107/1/472

The initial credit goes to the Amphioxus genome paper:
https://pubmed.ncbi.nlm.nih.gov/18563158/

For the lift-over approach, it is an implementation detail and I tend to gloss over it in various genome papers that I worked on over the years.

@ohdongha
Copy link

@tanghaibao Cool! Thanks again for the quick replies,
Dong-Ha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants