Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

Closed
icaptnbob opened this issue Oct 27, 2020 · 6 comments · Fixed by #961
Closed

Comments

@icaptnbob
Copy link

When a file is encoded as UTF-8-BOM, PapaParse CSV to Json incorrectly returns the records with the first object key name enclosed in a single quote. One cannot then reference the field called name (example below). record.name then doesn't exist. The field is record.'name' which is not easily accessible in JavaScript using record.name or record[name] etc. You can only see by printing the record to the console, or using a for-in loop.

The subsequent object keys are correct without quotes.

Change the file encoding to UTF-8 and the keys are normal, without a quote.

"PapaConfig": {
    "quotes": true,
    "quoteChar": "\"",
    "escapeChar": "\"",
    "delimiter": ",",
    "header": true,
    "skipEmptyLines": true,
    "columns": null
}

Papa.parse(csvData, PapaConfig)

csvData (subset):

name,phone
De Akker Guest House,0514442010

UTF-8-BOM encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

UTF-8 encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

Excel exports csv files to UTF-8-BOM, possibly because that encoding is supposedly faster and more reliable.
Can PapaParse be changed to handle UTF-8-BOM correctly?

@icaptnbob
Copy link
Author

This seems to be related to the fact that I read the input file using fs.readFile(filename, 'utf-8') and this apparently doesn't strip off the BOM markers.

I found #407 after posting.

It would be useful if PapaParse would handle this itself instead.

@icaptnbob
Copy link
Author

Solved by removing the first character of the readFile output:
if (data.charCodeAt(0) === 0xfeff) {
data = data.substr(1);
}

It may be useful to include this in PapaParse, to save many people encountering and struggling with this repeatedly.

@pokoli
Copy link
Collaborator

pokoli commented Oct 28, 2020

Could you please submit a pull request that add your code to papaparse and adds a test to ensure the behaviour?

We should read the first caracter before setting the encodding and if it is the BOM, we remove it and force the encoding to UTF-8.

@MikoSh95
Copy link

Hi.

Is there any update regarding this issue? I believe I've also encountered it. Here is my case:

csv file content:

Id;Number;Account Type;Description
1;105-347-266;ASST;name1606195953751
2;107-397-393;ASST;name1606001642584
3;109-380-871;ASST;name1606059520118

my code:

let csvFile = fs.readFileSync('file.csv', 'utf8', function (err) {
});

let csvFileContent = papa.parse(csvFile, {
    dynamicTyping: true,
    skipEmptyLines: true
});

assert.isNotEmpty(csvFileContent.data);
assert.sameMembers(csvFileContent.data[0], ['Id', 'Number', 'Account Type', 'Description']);

Output:

      throw new AssertionError(msg, {
      ^
AssertionError: expected [ Array(4) ] to have the same members as [ Array(4) ]
    at Object.<anonymous> (D:\Robocze\js-test\index.js:15:8)
    at Module._compile (internal/modules/cjs/loader.js:1137:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10)
    at Module.load (internal/modules/cjs/loader.js:985:32)
    at Function.Module._load (internal/modules/cjs/loader.js:878:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
    at internal/main/run_main_module.js:17:47 {
  showDiff: true,
  actual: [ 'Id', 'Number', 'Account Type', 'Description' ],
  expected: [ 'Id', 'Number', 'Account Type', 'Description' ]
}

I've run an addition check:

let expectedResult = ['Id', 'Number', 'Account Type', 'Description'];

csvFileContent.data[0].forEach((element, index) => {
    console.log(`${element}​ ${expectedResult[index]}​ ${expectedResult[index] === element}​`)
})

with following output:
image

In the output picture a whitespace character before 'Id' can be seen, but it's get lost when I copy the output.

@duhmojo
Copy link

duhmojo commented Jan 19, 2022

I'm having the same issue in 2022. I was given some external CSV file, probably edited/written on Windows, processing it on Linux with papaparse and I was unable to access the first row property defined by the header. When I console.log(row.data) I would see the property key quoted:

{
  'CID': '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I edited the original CSV and simply retyped the first character in the head, then reran:

{
  CID: '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I'm using const csvFile = fs.createReadStream(csvFilename); and I tried switching to const csvFile = fs.readFileSync(csvFilename, { encoding: 'utf-8'}); without luck. I read BOM was supposed to strip with readFileSync but it doesn't work for me at least: nodejs/node-v0.x-archive#1918

@duhmojo
Copy link

duhmojo commented Jan 20, 2022

I went with this approach, not the most efficient:

        const stripBom = function(str) {
                if (str.charCodeAt(0) === 0xfeff) {
                    return str.slice(1)
                }
                return str
        }

        papaparse.parse(csvFile, {
            step: function(row, parser) {
                ...
                const data = Object.fromEntries(
                    Object.entries(row.data).map(([k, v]) => [stripBom(k), v])
                )

Since csvFile is a read stream, not a pre-read file, I just tossed it in there for each step. I could do it only for the 1st step and skip if its anything but the 1st row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants