fix bug when seqIDs are numbers. #1

cduvallet · 2016-11-22T03:33:48Z

When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype in read_table() - index_col=0 overrides it and the index is read as int64 no matter what). fasta sequence IDs are always read as strings, and so all calls to "seq_id in seq_table.index" returns False (e.g. line 129: missing_ids = [seq_id for seq_id in self.seq_abunds.index if seq_id not in self.records]).

Looks like this is still an open issue: pandas-dev/pandas#9435

When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype - index_col=0 overrides specification and index is read as int64 no matter what). Open issue: pandas-dev/pandas#9435

cduvallet · 2016-11-22T03:53:26Z

Also imported python3 print function explicitly, for python2 compatibility.

swo · 2016-11-23T01:23:03Z

re: Python 2 compatibility. Sounds good. Confirm that this is the only problem with running in Python 2?

re: sequence IDs as integers. Also sounds good, but I'd like to see a unit test for it before merging the pull request.

cduvallet · 2016-11-24T15:27:30Z

re: python2 compatibility - confirmed! All tests pass, and running dbotu.py on real data seems to work fine.

swo · 2016-11-27T17:23:52Z

One last request: Don't include the table_test.counts and .fasta files, instead make them into strings that you "read" in via StringIO. (test_dbotu line 10 gives an example).

The thing with pandas index columns is funny. I think your solution is good except for some weird edge cases (e.g., if the ID is "001" then the string is "001" but the int is 1 and the int-to-string conversion will give "1", which doesn't match the original "001"). Maybe write a test for that and then we can put it as a caveat/warning in the docs? I think it's reasonable to ask users to edit their IDs if they fall into one of these weird edge cases.

cduvallet · 2016-11-28T03:22:02Z

re: files - done.

re: edge case - I changed the way the sequence table is read in dbotu.py to handle this case as well. It's not very pretty, but it should work. I also changed the test I wrote to read in a sequence table with these kinds of sequence IDs.

cduvallet added 2 commits November 22, 2016 03:29

fix bug when seqIDs are numbers.

f6b0f79

When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype - index_col=0 overrides specification and index is read as int64 no matter what). Open issue: pandas-dev/pandas#9435

explicit python3 print function, for python2 compatibility

b80095c

cduvallet added 2 commits November 24, 2016 15:22

made tests python 2 compatible

5fe8f5b

removed unneeded import

124a925

unit tests for int seqID fix

9e9fce5

cduvallet added 2 commits November 28, 2016 03:09

test_table files to StringIO

9708c28

fix for seqID edge case (oo1, etc)

8fd2048

swo merged commit c10f608 into swo:master Nov 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug when seqIDs are numbers. #1

fix bug when seqIDs are numbers. #1

cduvallet commented Nov 22, 2016

cduvallet commented Nov 22, 2016

swo commented Nov 23, 2016

cduvallet commented Nov 24, 2016

swo commented Nov 27, 2016

cduvallet commented Nov 28, 2016

fix bug when seqIDs are numbers. #1

fix bug when seqIDs are numbers. #1

Conversation

cduvallet commented Nov 22, 2016

cduvallet commented Nov 22, 2016

swo commented Nov 23, 2016

cduvallet commented Nov 24, 2016

swo commented Nov 27, 2016

cduvallet commented Nov 28, 2016