Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read_markdown_table #104

Closed
wants to merge 1 commit into from

Conversation

sjackman
Copy link

@sjackman sjackman commented Apr 1, 2015

Read Markdown tables. For example:

A|B|C|D
-|-|-|-
0|1|2|3
4|5|6|7
A B C D
0 1 2 3
4 5 6 7

To do:

  • Determine automatically whether the column names are present.
  • Remove leading and trailing white space around each cell.

@hadley
Copy link
Member

hadley commented Apr 2, 2015

This doesn't seem quite right to me - you're manually dropping the second line (i.e. the dashes) but the column guesser will still see it so I think every column will be imported as text. Instead I think you'll need to read the column names yourself (since they're always required in this format), then call read_delim() with those names and nskip = 2. Does that make sense?

Read Markdown tables. For example:
A|B|C|D
-|-|-|-
0|1|2|3
4|5|6|7

To do:
Determine automatically whether the column names are present.
Remove leading and trailing white space around each cell.
@sjackman sjackman force-pushed the read_markdown_table branch from 8199a9d to 96e6499 Compare April 2, 2015 16:51
@sjackman
Copy link
Author

sjackman commented Apr 2, 2015

Hi, Hadley. The column names are in fact optional (see below), at least for Pandoc. The row of hyphens is not optional. See Extension: pipe_tables. GitHub Markdown preview doesn't like a missing header line. Good observation about the column formats. I hadn't noticed that. I've updated the PR.

It duplicates a lot of code from read_delimited. Would you rather add an additional argument to read_delimited named skip_after_header?

-|-|-|-
0|1|2|3
4|5|6|7
0 1 2 3
4 5 6 7

@sjackman
Copy link
Author

sjackman commented Apr 3, 2015

@hadley I'd love for this function to make it into your CRAN submission. I'm happy to do what's needed to get it in. How are you feeling about this PR? Would you prefer adding the option skip_after_header to read_delimited to avoid repetition of code?

@hadley
Copy link
Member

hadley commented Sep 23, 2015

I think this looks ok, but I'd rather call it read_md_table() (or maybe read_table_md()?), and it needs a few unit tests.

@hadley
Copy link
Member

hadley commented Sep 28, 2015

If you can do by Friday, I can include in the next version of readr

@sjackman
Copy link
Author

Hi, Hadley. Yep, I think I can get this done by Friday. I'll need to add a skip_after_header parameter to both col_spec_standardise and read_delimited to skip the row of hyphens following the header line. Does that work for you?

@hadley
Copy link
Member

hadley commented Sep 30, 2015

Hmmm, I really want to avoid have skip before and after header parameters.

@sjackman
Copy link
Author

I can duplicate the code of read_delimited—I've already done so in this PR—but I'd rather not duplicate the code of col_spec_standardise. Any alternative suggestion?

@sjackman
Copy link
Author

I could see a skip_after_header option also having a use in loading the last million records of a billion-record TSV file.

@hadley
Copy link
Member

hadley commented Oct 12, 2015

Hmmm, I think that calls for a more generalised solution which I don't have time to think about at the moment. Lets try again for the next version.

@jimhester
Copy link
Collaborator

While this idea I think is interesting in practice the number of documents which contain just markdown tables and nothing else is very small, so it is hard to envision cases when this would be used.

Since it has been languishing for a number of years I think it should be closed.

@jimhester jimhester closed this Nov 13, 2018
@sjackman
Copy link
Author

No worries at all, Jim.

We discussed reading Markdown tables using rio over at gesistsa/rio#70
The final solution was to convert Markdown tables to HTML, and then read the HTML tables, which rio already supports.
As a bonus, rio supports reading multiple tables from a single document gesistsa/rio#126

@sjackman sjackman deleted the read_markdown_table branch November 13, 2018 22:12
@sjackman sjackman restored the read_markdown_table branch November 13, 2018 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants