Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated Sequence for MJN #87

Open
Chatchamew opened this issue Mar 11, 2024 · 11 comments
Open

Duplicated Sequence for MJN #87

Chatchamew opened this issue Mar 11, 2024 · 11 comments

Comments

@Chatchamew
Copy link

Excuse me, Mr. Emmanuel. I am trying to your function mjn() but it always tells me “maybe there are duplicated sequences in your data”. My data is a set of 108 sequences; some of which are identical to each other. Does this mean that I have to make each haplotype have only one sequence? Or is there something I miss? Thank you in advance.

@emmanuelparadis
Copy link
Owner

Hello,
Try the function haplotype() (in pegas too) on your 108 sequences. Once you did it, you can check all sequences are distinct with:

h <- haplotype(<<you sequence data name>>)
all(dist.dna(h, "n") > 0)

If the result is TRUE, you should be able to use mjn(h).
Best,
E.

@Chatchamew
Copy link
Author

It came out as “FALSE”. I have already used “strict = TRUE” in the haplotype() function. By the way, some haplotypes are different through only deletions. Is this also why mjn() didn’t work?

@emmanuelparadis
Copy link
Owner

I suggest you try:

all(dist.dna(h, "n", pairwise.deletion = TRUE) > 0)

And also:

image(h)

It seems you read the help page ?haplotype so that you understand that trailing/leading gaps are a problem when identifying haplotypes. The above commands should help you to assess the situation with your data.

@Chatchamew
Copy link
Author

I have tried all() and the result came up as “FALSE”, unfortunately. I have also tried image() and there are several gaps and degenerate bases (namely N and Y). I also tried trailingGapAsN = TRUE, and the all() still came up as “FALSE”. Any advice? I’m so sorry for wasting your time.

@emmanuelparadis
Copy link
Owner

It seems you have a very difficult data set, so your inferences will be necessarily limited.

@Chatchamew
Copy link
Author

Roger that. I have dived into the data and found out that all pairs of haplotypes that have dist.dna = 0 differ only in either deletion or insertion. I really wish that you might try giving us an option to use mjn() despite the deletion, because the haplotype() function can separate them nicely.

@emmanuelparadis
Copy link
Owner

Have a look at this function in ape: DNAbin2indel. With it, you can then create a binary matrix indicating presence/absence of indels. pegas::mjn() can also analyse binary (0/1) data.

@Chatchamew
Copy link
Author

I have tried a subset of my data with only sixteen sequences. I have checked the all(dist.dna()) function. If I used pairwise.deletion = TRUE, the all() function came up as TRUE. If I used pairwise.deletion = FALSE, the all() function came up as FALSE. When I used mjn(), it said “duplicate” again. Is there anything I can do?

By the way, can mjn() consider both base difference and deletion/insertion at the same time? I have tried DNAbin2indel() and the matrix I got is concerned only on deletion/insertion? Can I mix this matrix with base difference matrix and make it go through mjn()?

@emmanuelparadis
Copy link
Owner

Two other functions from ape that could help you with your data: latag2n() and solveAmbiguousBases() (maybe you already found them in the meantime).

You can try rmst() (also in pegas): it requires distances (unlike mjn()). It's not the same algorithm of course but it can sometimes give the same network (see the last example in ?rmst).

@Chatchamew
Copy link
Author

Thank you so much for your response. I am now using rmst(). The only problem I have is that it doesn't generate median vectors. I will try to tackle with more sequences in the future. By the way, I would love to try mjn() that considers both base difference and base deletion/insertion at the same time. That would be revolutionary!

@emmanuelparadis
Copy link
Owner

By the way, I would love to try mjn() that considers both base difference and base deletion/insertion at the same time. That would be revolutionary!

There is the difficulty that indels within or on the head/tail of sequences should be treated differently. Another difficulty is that indels may also have substitutions (either before the deletion, or after the insertion). This could affect the time-reversibility of the model but I'm not sure how this is critical to the median-vectors.
The revolution has to wait a bit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants