UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

icaptnbob · 2020-10-27T23:31:43Z

When a file is encoded as UTF-8-BOM, PapaParse CSV to Json incorrectly returns the records with the first object key name enclosed in a single quote. One cannot then reference the field called name (example below). record.name then doesn't exist. The field is record.'name' which is not easily accessible in JavaScript using record.name or record[name] etc. You can only see by printing the record to the console, or using a for-in loop.

The subsequent object keys are correct without quotes.

Change the file encoding to UTF-8 and the keys are normal, without a quote.

"PapaConfig": {
    "quotes": true,
    "quoteChar": "\"",
    "escapeChar": "\"",
    "delimiter": ",",
    "header": true,
    "skipEmptyLines": true,
    "columns": null
}

Papa.parse(csvData, PapaConfig)

csvData (subset):

name,phone
De Akker Guest House,0514442010

UTF-8-BOM encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

UTF-8 encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

Excel exports csv files to UTF-8-BOM, possibly because that encoding is supposedly faster and more reliable.
Can PapaParse be changed to handle UTF-8-BOM correctly?

The text was updated successfully, but these errors were encountered:

icaptnbob · 2020-10-27T23:44:28Z

This seems to be related to the fact that I read the input file using fs.readFile(filename, 'utf-8') and this apparently doesn't strip off the BOM markers.

I found #407 after posting.

It would be useful if PapaParse would handle this itself instead.

icaptnbob · 2020-10-27T23:53:44Z

Solved by removing the first character of the readFile output:
if (data.charCodeAt(0) === 0xfeff) {
data = data.substr(1);
}

It may be useful to include this in PapaParse, to save many people encountering and struggling with this repeatedly.

pokoli · 2020-10-28T09:16:45Z

Could you please submit a pull request that add your code to papaparse and adds a test to ensure the behaviour?

We should read the first caracter before setting the encodding and if it is the BOM, we remove it and force the encoding to UTF-8.

MikoSh95 · 2020-11-24T09:43:58Z

Hi.

Is there any update regarding this issue? I believe I've also encountered it. Here is my case:

csv file content:

Id;Number;Account Type;Description
1;105-347-266;ASST;name1606195953751
2;107-397-393;ASST;name1606001642584
3;109-380-871;ASST;name1606059520118

my code:

let csvFile = fs.readFileSync('file.csv', 'utf8', function (err) {
});

let csvFileContent = papa.parse(csvFile, {
    dynamicTyping: true,
    skipEmptyLines: true
});

assert.isNotEmpty(csvFileContent.data);
assert.sameMembers(csvFileContent.data[0], ['Id', 'Number', 'Account Type', 'Description']);

Output:

      throw new AssertionError(msg, {
      ^
AssertionError: expected [ Array(4) ] to have the same members as [ Array(4) ]
    at Object.<anonymous> (D:\Robocze\js-test\index.js:15:8)
    at Module._compile (internal/modules/cjs/loader.js:1137:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10)
    at Module.load (internal/modules/cjs/loader.js:985:32)
    at Function.Module._load (internal/modules/cjs/loader.js:878:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
    at internal/main/run_main_module.js:17:47 {
  showDiff: true,
  actual: [ 'Id', 'Number', 'Account Type', 'Description' ],
  expected: [ 'Id', 'Number', 'Account Type', 'Description' ]
}

I've run an addition check:

let expectedResult = ['Id', 'Number', 'Account Type', 'Description'];

csvFileContent.data[0].forEach((element, index) => {
    console.log(`${element} ${expectedResult[index]} ${expectedResult[index] === element}`)
})

with following output:

In the output picture a whitespace character before 'Id' can be seen, but it's get lost when I copy the output.

duhmojo · 2022-01-19T21:18:31Z

I'm having the same issue in 2022. I was given some external CSV file, probably edited/written on Windows, processing it on Linux with papaparse and I was unable to access the first row property defined by the header. When I console.log(row.data) I would see the property key quoted:

{
  'CID': '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I edited the original CSV and simply retyped the first character in the head, then reran:

{
  CID: '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I'm using const csvFile = fs.createReadStream(csvFilename); and I tried switching to const csvFile = fs.readFileSync(csvFilename, { encoding: 'utf-8'}); without luck. I read BOM was supposed to strip with readFileSync but it doesn't work for me at least: nodejs/node-v0.x-archive#1918

duhmojo · 2022-01-20T19:50:28Z

I went with this approach, not the most efficient:

        const stripBom = function(str) {
                if (str.charCodeAt(0) === 0xfeff) {
                    return str.slice(1)
                }
                return str
        }

        papaparse.parse(csvFile, {
            step: function(row, parser) {
                ...
                const data = Object.fromEntries(
                    Object.entries(row.data).map(([k, v]) => [stripBom(k), v])
                )

Since csvFile is a read stream, not a pre-read file, I just tossed it in there for each step. I could do it only for the 1st step and skip if its anything but the 1st row.

pokoli added enhancement help wanted file reader needs tests labels Oct 28, 2020

Clonkex mentioned this issue Jul 6, 2022

skipEmptyLines unable to remove ZERO WIDTH SPACE #917

Closed

peteruithoven mentioned this issue Nov 23, 2022

Handle parsing utf-8 bom encoded files #961

Merged

pokoli closed this as completed in #961 Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

icaptnbob commented Oct 27, 2020

icaptnbob commented Oct 27, 2020

icaptnbob commented Oct 27, 2020

pokoli commented Oct 28, 2020

MikoSh95 commented Nov 24, 2020

duhmojo commented Jan 19, 2022

duhmojo commented Jan 20, 2022

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote #840

Comments

icaptnbob commented Oct 27, 2020

icaptnbob commented Oct 27, 2020

icaptnbob commented Oct 27, 2020

pokoli commented Oct 28, 2020

MikoSh95 commented Nov 24, 2020

duhmojo commented Jan 19, 2022

duhmojo commented Jan 20, 2022