Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Parse breaks on comment characters that are also in rows #415

Closed
jlarmstrongiv opened this issue Feb 6, 2024 · 4 comments
Closed

CSV Parse breaks on comment characters that are also in rows #415

jlarmstrongiv opened this issue Feb 6, 2024 · 4 comments

Comments

@jlarmstrongiv
Copy link

jlarmstrongiv commented Feb 6, 2024

Describe the bug

I am using [email protected] and it isn’t good at handling comments. For example, if a line starts with #, it should be a comment. Later on, rows with # fail to parse.

To Reproduce

Trying to parse https://download.geonames.org/export/dump/countryInfo.txt with the first 50 lines of comments. The Postal Code Format column has many # characters

const defaultParseOptions: ParseOptions = {
  bom: true,
  cast: true,
  columns: false,
  // comment: "#",
  // comment_no_infix: true,
  delimiter: "  ",
  escape: null,
  groupColumnsByName: false,
  quote: null,
  record_delimiter: ["\n", "\r", "\r\n"],
  relax_quotes: true,
  skip_empty_lines: true,
};

Additional context

Workaround

class RemoveCommentTransform extends Transform {
  override _transform(
    chunk: any,
    _encoding: BufferEncoding,
    callback: TransformCallback,
  ): void {
    const line = String(chunk);
    if (line.trim() !== "" && !line.startsWith("#")) {
      callback(null, line + "\n");
    } else {
      callback(null);
    }
  }
}

  const readableStream = fs.createReadStream(filePath);
  const readlineIterator = readline.createInterface({
    crlfDelay: Number.POSITIVE_INFINITY,
    input: readableStream,
  });
  const readlineStream = Readable.from(readlineIterator);
  const removeCommentTransform = new RemoveCommentTransform();

  const parser = parse({
    ...defaultParseOptions,
    ...parseOptions,
  });

  readlineStream.pipe(removeCommentTransform).pipe(parser);
@wdavidw
Copy link
Member

wdavidw commented Feb 8, 2024

I fetch the referenced file without much trouble unless I missed something. See 28088bc#diff-d920cfdd1103bd1f5bf21e5a0dea02fb7fdef4b77ce5ef5bae5abbc84d4b544d

@jlarmstrongiv
Copy link
Author

jlarmstrongiv commented Feb 10, 2024

relax_column_count should be false though, as all the rows have the same number of columns. Those columns have the wrong number of rows due to the incorrect comment detection

@wdavidw
Copy link
Member

wdavidw commented Feb 28, 2024

Hi @jlarmstrongiv, your last comment point me to the right direction. There was a bug with the comment_no_infix option. If you look at the sample now, relax_column_count is out and comment_no_infix is used instead. Version 5.5.5 of csv-parse ships the fix.

@jlarmstrongiv
Copy link
Author

@wdavidw thank you! That’s great to hear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants