Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY ... FROM STDIN writes incorrect data for BYTES/BYTEA columns #100299

Closed
tommilligan opened this issue Mar 31, 2023 · 4 comments
Closed

COPY ... FROM STDIN writes incorrect data for BYTES/BYTEA columns #100299

tommilligan opened this issue Mar 31, 2023 · 4 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@tommilligan
Copy link
Contributor

tommilligan commented Mar 31, 2023

Describe the problem

COPY ... FROM STDIN of BYTES/BYTEA columns displays different behaviour from postgres, and does not roundtrip correctly. The data written is a literal hex escaped form of the underlying bytes. For example, b'aaaa' is roundtripped as b'\\x61616161'

This sample is reproduced with the Python psycopg3 client driver, but I have confirmed that the same behaviour is seen with manual SQL usage.

To Reproduce

  • Install python dependencies on Python 3
pip install psycopg==3.1.8 psycopg-binary==3.1.8 psycopg-pool==3.1.6
  • Save the below python script as repro.py, which shows the behaviour of trying to write data with COPY vs INSERT
#!/usr/bin/env python
# repro.py

from typing import List, Tuple

from psycopg_pool import ConnectionPool

CONNECTION_STRING = "postgres://root:[email protected]:26258/defaultdb"
INPUT_DATA = [(b"aaaa", "all As"), (b"", "empty data")]

Row = Tuple[bytes, str]


class Demo:
    def __init__(self, pool: ConnectionPool) -> None:
        self._pool = pool

    def drop_table(self) -> None:
        with self._pool.connection() as connection:
            connection.execute("DROP TABLE IF EXISTS demo")

    def create_table(self) -> None:
        with self._pool.connection() as connection:
            connection.execute(
                """CREATE TABLE demo (
                    data BYTEA NOT NULL,
                    description VARCHAR NOT NULL
                )
                """
            )

    def select_all(self) -> List[Row]:
        with self._pool.connection() as connection:
            cursor = connection.execute("SELECT data, description FROM demo")
            return cursor.fetchall()

    def copy(self, rows: List[Row]) -> None:
        with self._pool.connection() as connection:
            with connection.cursor() as cursor:
                with cursor.copy(
                    "COPY demo (data, description) FROM STDIN"
                ) as copy:
                    # Note: same behaviour with or without .set_types
                    # copy.set_types(("bytea", "varchar"))
                    for row in rows:
                        copy.write_row(row)

    def insert(self, rows: List[Row]) -> None:
        with self._pool.connection() as connection:
            with connection.cursor() as cursor:
                cursor.executemany(
                    "INSERT INTO demo (data, description) VALUES (%s, %s)",
                    rows,
                )


def main() -> None:
    pool = ConnectionPool(CONNECTION_STRING)
    demo = Demo(pool)

    print("Original data:")
    print(INPUT_DATA)
    print()

    print("Write with: 'COPY ... FROM STDIN'")
    demo.drop_table()
    demo.create_table()
    demo.copy(INPUT_DATA)
    print(demo.select_all())
    print()

    print("Write with: 'INSERT INTO ... VALUES ...'")
    demo.drop_table()
    demo.create_table()
    demo.insert(INPUT_DATA)
    print(demo.select_all())
    print()


if __name__ == "__main__":
    main()

Postgres

Run database

$ docker run --rm -p 26258:39104 -e POSTGRES_USER=root -e POSTGRES_PASSWORD=supersecret -e POSTGRES_DB=defaultdb eu.gcr.io/reinfer-gcr/postgres -p 39104

Show expected behaviour - bytes roundtrip correctly.

$ python repro.py
Original data:
[(b'aaaa', 'all As'), (b'', 'empty data')]

Write with: 'COPY ... FROM STDIN'
[(b'aaaa', 'all As'), (b'', 'empty data')]

Write with: 'INSERT INTO ... VALUES ...'
[(b'aaaa', 'all As'), (b'', 'empty data')]

Cockroach

Run database

docker run --rm --env COCKROACH_DATABASE=defaultdb --env COCKROACH_USER=root --env COCKROACH_PASSWORD=supersecret -p 26258:26257 cockroachdb/cockroach:latest start-single-node --insecure

Show incorrect behaviour - bytes are escaped and do not roundtrip correctly.

$ python repro.py
Original data:
[(b'aaaa', 'all As'), (b'', 'empty data')]

Write with: 'COPY ... FROM STDIN'
[(b'\\x61616161', 'all As'), (b'\\x', 'empty data')]

Write with: 'INSERT INTO ... VALUES ...'
[(b'aaaa', 'all As'), (b'', 'empty data')]

Expected behavior

Identical behaviour between cockroach/postgres backends.

Bytes data should be roundtripped, and not returned in escaped format.

Environment:

Cockroach running in docker under Ubuntu 20.04.

$ cockroach version
Build Tag:        v22.2.7
Build Time:       2023/03/28 19:47:29
Distribution:     CCL
Platform:         linux amd64 (x86_64-pc-linux-gnu)
Go Version:       go1.19.6
C Compiler:       gcc 6.5.0
Build Commit ID:  cbe33971930ee972e639203a73f14240b92aeff4
Build Type:       release

Python 3.8.10 with the following dependencies:

psycopg==3.1.8
psycopg-binary==3.1.8
psycopg-pool==3.1.6

Additional context

Impact: cannot load data using COPY ... FROM STDIN as documented with BYTEA column type. This makes copying bulk data fail, which makes bulk inserts as documented here incorrect: https://www.cockroachlabs.com/docs/v22.2/copy-from

I do not believe this use case in unsupported (it does not appear in the unsupported syntax here: https://www.cockroachlabs.com/docs/v22.2/copy-from#unsupported-syntax)

Jira issue: CRDB-26346

@tommilligan tommilligan added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 31, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 31, 2023

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-sessions (found keywords: pg_)
  • @cockroachdb/sql-schema (found keywords: DROP TABLE)
  • @cockroachdb/sql-queries (found keywords: import)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Mar 31, 2023
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Mar 31, 2023
@rafiss
Copy link
Collaborator

rafiss commented Apr 4, 2023

This problem also came up before in #68804 (comment)

Right now, to use bytea values with COPY, they should be octal-escaped.

Other bytea escaping rules behave differently in CockroachDB versus Postgres. This is described in detail at #26128. I do think it would be nice for us to eliminate this divergence, but I'm not sure how to do so without breaking a lot of existing applications.

@tommilligan
Copy link
Contributor Author

Ah, thank you. Apologies for not finding the other issues.

I'm happy to see if I can workaround this by using the octal escape format as is suggested in the linked thread - I will see if I can port something upstream to the psycopg package. Happy for this to be closed as tracked in other issues.

@rafiss
Copy link
Collaborator

rafiss commented Apr 4, 2023

Thanks for reporting it nonetheless! Issues are always welcome.

@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Archived in project
Development

No branches or pull requests

2 participants