Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #14

brunchboy · 2020-07-12T03:43:54Z

I heard from another person who is implementing parsing of PDB files, and he was working with some Russian text, and discovered we were wrong to think these strings are UTF-16LE. Here is what he said, and I validated this by creating a playlist containing the same string in its name:

I could have something off, but here is what I'm seeing about the strings. My PDB includes a Russian song called Покинула чат ("left the chat"). The first letter here is U+041F. All the Cyrillic letters start with 04, but the spacebar between the words is the same U+0020 as in English. Here's how the track name looks in the pdb in hex:

If I skip the 0 and read little endian, I get back the desired "Покинула чат"

If I don't skip and read big endian, I get back the incorrect " окинулаРGат" It gets a lot of the letters right because there is usually a 04 every other byte, but the first letter (which turns out as U+001F "Information separator one") and the characters around the space get messed up (because of the momentary switch from leading 04 to leading 00).
English titles come out right either way, because the leading 00s for each ASCII character in UTF-16 make it forgiving.

brunchboy self-assigned this Jul 12, 2020

brunchboy closed this as completed in 22db2cf Jul 12, 2020

0xsuk mentioned this issue May 26, 2024

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE mixxxdj/mixxx#13291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #14

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #14

brunchboy commented Jul 12, 2020

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #14

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #14

Comments

brunchboy commented Jul 12, 2020