Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat and hash row validations fail for values with trailing newline #1411

Open
ghhpl opened this issue Jan 28, 2025 · 2 comments · May be fixed by #1415
Open

concat and hash row validations fail for values with trailing newline #1411

ghhpl opened this issue Jan 28, 2025 · 2 comments · May be fixed by #1415
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release.

Comments

@ghhpl
Copy link

ghhpl commented Jan 28, 2025

A hash comparison of rows from two tables between Oracle 19c and Postgres 15 failed with Data Validator 7.1
We have used the concat comparison to identify the root cause: a VARCHAR2(140) column called "rmtinf" contains ASCII characters followed by a single trailing newline both in Oracle and Postgres.

The concat value created by Data Validation for Oracle contains the trailing newline but the concat value for Postgres does not.

Oracle: rtrim(t2.ifnull__cast__rmtinf)
Postgres: rtrim(t2.ifnull__cast__rmtinf, :rtrim_1)

err.txt
out.txt
validate-concat-rmtinf.txt

@nj1973 nj1973 self-assigned this Jan 29, 2025
@nj1973 nj1973 added the priority: p1 High priority. Fix may be included in the next release. label Jan 29, 2025
@nj1973 nj1973 linked a pull request Jan 29, 2025 that will close this issue
@nj1973
Copy link
Contributor

nj1973 commented Jan 29, 2025

When we build the row validation comparison string we include an rstrip expression as described here: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/README.md#hash-concat-and-comparison-fields

As it appears we have inconsistencies in which characters are stripped I first need to survey all supported engines to see what characters are removed.

We should also discuss why we have rtrim in there at all.

@nj1973
Copy link
Contributor

nj1973 commented Jan 30, 2025

It appears to me recent Ibis versions attempt to trim all white space:

https://github.com/ibis-project/ibis/blob/decb96de6d6786afe4a23f6b14e3f3053860b9c8/ibis/backends/sql/compilers/base.py#L988

   def visit_RStrip(self, op, *, arg):
        return self.f.rtrim(arg, string.whitespace)

So a lot of our inconsistencies might be due to the version of Ibis we support.

I've surveyed our current support:

System      Characters removed List controllable
----------- ------------------ ---------------------------
BigQuery    All whitespace     Yes
DB2         Space              Yes, but docs are confusing
Hive        Space              No
Impala      Space              Yes
MySQL       Space              No
Oracle      Space              Yes
PostgreSQL  All whitespace     Yes (default is space only)
Snowflake   Space              Yes
Spanner     All whitespace     Yes
SQL Server  Space              Yes
Teradata    All whitespace     Yes (default is space only)

There are maybe two problems here:

  1. Our support should be consistent which probably means all RStrip expressions should attempt to remove all whitespace. Except for Hive and MySQL which might be uncontrollable.
  2. Is it correct to trim white space before hashing rows? When migrating data, trailing white space is just the sort of awkward data that might be lost/problematic. Perhaps RStrip should only be included under an option and not by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release.
Projects
None yet
2 participants