Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Timestamp type in xlang JDBC Read and Write #22561

Merged
merged 6 commits into from
Aug 29, 2022

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Aug 2, 2022

Part of #19817

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@Abacn
Copy link
Contributor Author

Abacn commented Aug 2, 2022

Run Python 3.8 PostCommit

@Abacn Abacn force-pushed the portableinstantcoder branch 4 times, most recently from 661be51 to 77a42c6 Compare August 2, 2022 21:09
@Abacn
Copy link
Contributor Author

Abacn commented Aug 2, 2022

Run Python 3.8 PostCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

Run Python 3.8 PostCommit

@codecov
Copy link

codecov bot commented Aug 3, 2022

Codecov Report

Merging #22561 (b0484e7) into master (0c82583) will increase coverage by 0.04%.
The diff coverage is 91.30%.

@@            Coverage Diff             @@
##           master   #22561      +/-   ##
==========================================
+ Coverage   73.69%   73.73%   +0.04%     
==========================================
  Files         713      713              
  Lines       94821    94974     +153     
==========================================
+ Hits        69874    70030     +156     
+ Misses      23648    23645       -3     
  Partials     1299     1299              
Flag Coverage Δ
python 83.55% <91.30%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/typehints/schemas.py 93.84% <89.47%> (-0.31%) ⬇️
sdks/python/apache_beam/coders/row_coder.py 94.82% <100.00%> (+0.13%) ⬆️
sdks/python/apache_beam/portability/common_urns.py 100.00% <100.00%> (ø)
sdks/python/apache_beam/runners/direct/executor.py 96.46% <0.00%> (-0.55%) ⬇️
...pi/org/apache/beam/model/pipeline/v1/schema_pb2.py 100.00% <0.00%> (ø)
...g/apache/beam/model/pipeline/v1/schema_pb2_urns.py 100.00% <0.00%> (ø)
sdks/python/apache_beam/dataframe/io.py 89.26% <0.00%> (+<0.01%) ⬆️
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.09% <0.00%> (+0.15%) ⬆️
...hon/apache_beam/runners/worker/bundle_processor.py 93.79% <0.00%> (+0.49%) ⬆️
sdks/python/apache_beam/io/fileio.py 96.67% <0.00%> (+0.68%) ⬆️
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Abacn Abacn force-pushed the portableinstantcoder branch from dd7cb95 to aec0fd8 Compare August 3, 2022 03:04
@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

Run Python 3.8 PostCommit

@Abacn Abacn changed the title support beam:logical_type:datetime:v1 in python sdk support Timestamp type in xlang JDBC Read and Write Aug 3, 2022
@Abacn Abacn changed the title support Timestamp type in xlang JDBC Read and Write Support Timestamp type in xlang JDBC Read and Write Aug 3, 2022
@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

Run Python 3.8 PostCommit

@Abacn Abacn marked this pull request as ready for review August 3, 2022 18:19
@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2022

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@Abacn Abacn force-pushed the portableinstantcoder branch from c8647eb to 8d4575a Compare August 3, 2022 20:33
@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

Rebased in order to update CHANGES.md. ready for review now.

@Abacn
Copy link
Contributor Author

Abacn commented Aug 3, 2022

Run Python 3.8 PostCommit

@@ -647,9 +648,36 @@ def _from_typing(cls, typ):
('micros', np.int64)])


@LogicalType.register_logical_type
class DateTimeLogicalType(NoArgumentLogicalType[Timestamp, np.int64]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I thought the solution here was to use the MicrosInstant logical type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. removed this class the test still pass and confirmed with database content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah messed up local virtualenv. Remove @LogicalType.register_logical_type or the entire class here causes the following error:

Traceback (most recent call last):
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 408, in typing_from_runner_api
    field_py_type = self.typing_from_runner_api(field.type)
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 362, in typing_from_runner_api
    base = self.typing_from_runner_api(base_type)
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 447, in typing_from_runner_api
    return LogicalType.from_runner_api(
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 621, in from_runner_api
    raise ValueError(
ValueError: No logical type registered for URN 'beam:logical_type:datetime:v1'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
...
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 179, in typing_from_runner_api
    return SchemaTranslation(
  File "...virtualenv/py38beam/lib/python3.8/site-packages/apache_beam/typehints/schemas.py", line 412, in typing_from_runner_api
    raise ValueError(
ValueError: Failed to decode schema due to an issue with Field proto:

name: "f_timestamp"
type {
  nullable: true
  logical_type {
    urn: "beam:logical_type:datetime:v1"
    representation {
      atomic_type: INT64
    }
  }
}
id: 1
encoding_position: 1

Still need this to register the URN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused though, how did we get a field with URN beam:logical_type:datetime:v1? Based on the change in JdbcUtil I thought the intention here was to map JDBC timestamp columns to the MicrosInstant logical type, with URN beam:logical_type:micros_instant:v1, in order to avoid the proiblematic InstantCoder in DATETIME.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in JdbcUtil and Python are for different purposes. Basically there are two things missing for java<->portable timestamp: (1) Java JDBCIO support of portable timestamp; (2) Python support for java timestamp.

The JdbcUtil change is for write to jdbc. It adds support in Java's JDBCIO of converting portable timestamp type (beam:logical_type:micros_instant:v1) to sql timestamp;

The Python coder and schema change is for read from jdbc. It adds supports in Python's RowCoder of identifying (by schema change) and decoding (by coder change) the java timestamp type (datetime) to language timestamp. The field 'beam:logical_type:datetime:v1' is passed from java here when the row contains datetime field type.

Copy link
Contributor Author

@Abacn Abacn Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change coder for Instant in java side but it did not quite work (see comment here) and ended up adding support for InstantCoder coded timestamp in Python side. I understand this is less elegant but it does not touch java Instant coders which is used everywhere not only in schema.

Copy link
Member

@TheNeuralBit TheNeuralBit Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make portable timestamp and java timestamp the same type? I think it would be preferable to produce micros_instant for the timestamp type in the JDBC read. I'd really like to avoid having Java's DATETIME primitive leak into Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is preferable in terms of consistency. To do this we will need to make distinction of timestamp in schema based IOs (including JDBC) and windowing timestamp, making the former produce micros_instant instead of joda.Instant, which is similar to the approach here #11456. However given the scope of breaking changes this may leave as next major version change...

If I understood correctly it's not leaking Java's DATETIME primitive (Schema.FieldType DATETIME), the schema passed to python is the logical type beam:logical_type:datetime:v1 that not yet implemented in python. However the coder that used to decode this type is actually available in python which is TimestampCoder. This is still in line with Portable Beam Schema design doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a new logical type then for use in JDBC read and write that uses millisecond precision, and maps to joda.Instant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now renamed this usage to millis_instant logical type

// MicrosInstant uses native java.time.Instant instead of joda.Instant.
java.time.Instant value =
element.getLogicalTypeValue(fieldWithIndex.getIndex(), java.time.Instant.class);
ps.setTimestamp(i + 1, value == null ? null : new Timestamp(value.toEpochMilli()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a more general solution for mapping these to java time vs. joda time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a mapping from java.time.Instant to java.sql.Timestamp, joda.time is not involved. The comment was a clarification because I see joda.time.Instant is usually used in the codebase (that's why I did not declare import Instant but write down the whole class name here)

Do you think we need a general Row.getJavaInstant() as Row.getDateTime() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, my mistake. Regardless there's no action needed here, I was just lamenting that this date/time type situation is messy.

@Abacn Abacn requested a review from TheNeuralBit August 8, 2022 18:59
# Special case for datetime logical type.
# DATETIME_URN explicitly uses TimestampCoder which deals with fix length
# 8-bytes big-endian-long instead of varint coder.
return TimestampCoder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do implement beam:logical_type:datetime in Python we should:

  • Add the logical type in schema.proto, and reference the URN from that file:
  • Consider if we can keep the logic contained in the logical type implementation, e.g. can we do the byte flipping in a to_language_type and to_representation_type implementation?
  • Add some test cases to standard_coders.yaml, like this one:
    coder:
    urn: "beam:coder:row:v1"
    # f_timestamp: logical(micros_instant), f_string: string, f_int: int64
    payload: "\n\x7f\n\x0bf_timestamp\x1ap:n\n#beam:logical_type:micros_instant:v1\x1aG2E\nC\n\r\n\x07seconds\x1a\x02\x10\x04\n\x0c\n\x06micros\x1a\x02\x10\x04\x12$4d3f6e8f-7412-4ad7-bfd9-b424a1664aef\n\x0e\n\x08f_string\x1a\x02\x10\x07\n\x0b\n\x05f_int\x1a\x02\x10\x04\x12$33dafd37-397c-4083-a84e-42177d122221"
    examples:
    "\x03\x00\x02\x00\xb6\x95\xd5\xf9\x05\xc0\xc4\x07\x1b2020-08-13T14:14:14.123456Z\xc0\xf7\x85\xda\xae\x98\xeb\x02": {f_timestamp: {seconds: 1597328054, micros: 123456}, f_string: "2020-08-13T14:14:14.123456Z", f_int: 1597328054123456}
    You might also need to skip this case in Go if the "logical" filter doesn't catch it:
    var filteredCases = []struct{ filter, reason string }{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the instructions! Will complete the proto change. beam:logical_type:datetime:v1 is the millisecond precision timestamp backed by a fixed length INT64 that encoded with BigEndianLong. Will complete the proto change.

can we do the byte flipping in a to_language_type and to_representation_type implementation?

Considered this before. Under the current framework the value is already decoded with VarInt coder before sent to to_language_type or to_representation_type which is incorrect. All I need is a 8-byte fixed long integer primitive which does not exist in portable primitives, and there is not yet a consensus on reference coder (context). Therefore I ended up this solution to make things work while not introducing new primitives or other fundamental changes.

Copy link
Contributor Author

@Abacn Abacn Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah... test case in standard_coders.yaml won't work either because Both micros_instant and millis_instant has Timestamp language type and the former will take over the priority of millis_instant (by design). When a timestamp is decoded to an integer, the test will call micros_instant's to_language_type.
i.e. MillisInstant->Timestamp conversion is one direction and Timestamp <-> MicrosInstant is bidirectional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'd really like to have this tested at the standard_coders.yaml level as a unit test, it's not ideal if the only thing verifying compatibility is an integration test in Python.

If you have time tomorrow maybe we could do a quick video call to live debug what's going wrong here, and see if we can work around it. If it will be painful we can leave it as future work, but I'd like to understand the level of effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to get a test case using the following code snippet

def generate_millis():
    # Logical type that deals with millis_instant urn (MillisInstant)
    MillisLogicalType = LogicalType._known_logical_types.get_logical_type_by_urn('beam:logical_type:millis_instant:v1')
    # Original Logical type used to represent Timestamp (MicrosInstant)
    TimestampLogicalType = LogicalType._known_logical_types.get_logical_type_by_language_type(Timestamp)
    
    LogicalType._known_logical_types.by_language_type[Timestamp] = MillisLogicalType

    schema = beam.typehints.schemas.named_tuple_to_schema(TestTuple)
    coder = beam.coders.row_coder.RowCoder(schema)
    print("payload = %s" % schema.SerializeToString())
    examples = (TestTuple(
        f_timestamp=Timestamp.from_rfc3339("2020-08-13T14:14:14.123Z"),
        f_string="2020-08-13T14:14:14.123Z",
        f_int=1597328054123),)
    for example in examples:
        print("example = %s" % coder.encode(example))
    
    # recover original registration
    LogicalType._known_logical_types.by_language_type[Timestamp] = TimestampLogicalType

The workaround is temporarily change the mapping of Timestamp -> MillisInstant logical type. Without it Timestamp always maps to MicrosInstant logical type in Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh ok, so it looks like the issue was using Python to generate the test values. Another strategy might be to create the schema proto directly. But this way works too and is probably less code, thanks!

@github-actions github-actions bot added the model label Aug 23, 2022
@Abacn Abacn force-pushed the portableinstantcoder branch from d0498fa to df31d95 Compare August 23, 2022 19:11
sdks/python/apache_beam/typehints/schemas.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/typehints/schemas.py Outdated Show resolved Hide resolved
# Special case for datetime logical type.
# DATETIME_URN explicitly uses TimestampCoder which deals with fix length
# 8-bytes big-endian-long instead of varint coder.
return TimestampCoder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'd really like to have this tested at the standard_coders.yaml level as a unit test, it's not ideal if the only thing verifying compatibility is an integration test in Python.

If you have time tomorrow maybe we could do a quick video call to live debug what's going wrong here, and see if we can work around it. If it will be painful we can leave it as future work, but I'd like to understand the level of effort.

@Abacn
Copy link
Contributor Author

Abacn commented Aug 25, 2022

Run Python_PVR_Flink PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 25, 2022

Run Java PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 25, 2022

Run Python 3.8 PostCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 25, 2022

Run Java PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 26, 2022

postcommit test apache_beam.io.external.xlang_jdbcio_it_test.CrossLanguageJdbcIOTest.test_xlang_jdbc_read Failed

Apparently the branch itself works but now failing when merge onto master. Investigating.

Update: related to the change of #22679. After it tests fail.

Abacn and others added 6 commits August 26, 2022 16:35
* Support beam:logical_type:datetime:v1 in python sdk

* Support micro_instant logical type in java jdbc

* Fix test timestamp milli precision support in mysql
@Abacn Abacn force-pushed the portableinstantcoder branch from a6f94e8 to b0484e7 Compare August 26, 2022 20:36
@Abacn
Copy link
Contributor Author

Abacn commented Aug 26, 2022

Rebased onto latest master; the last commit is a fix of test

@Abacn
Copy link
Contributor Author

Abacn commented Aug 26, 2022

Run Python 3.8 PostCommit

@@ -587,6 +588,7 @@ def to_language_type(value):
def register_logical_type(cls, logical_type_cls):
"""Register an implementation of LogicalType."""
cls._known_logical_types.add(logical_type_cls.urn(), logical_type_cls)
return logical_type_cls
Copy link
Contributor Author

@Abacn Abacn Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These places were some minor bugs. If not return logical_type_cls, class decorated with @register_logical_type cannot be imported (will be None) and docstring is also missing from pydoc; signatures of to_representation_type and to_language_type should have a value parameter as its child classes. Fixed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Abacn Abacn requested a review from TheNeuralBit August 26, 2022 20:42
@Abacn
Copy link
Contributor Author

Abacn commented Aug 27, 2022

Run Java PreCommit

@Abacn
Copy link
Contributor Author

Abacn commented Aug 27, 2022

Run Python PreCommit

1 similar comment
@Abacn
Copy link
Contributor Author

Abacn commented Aug 29, 2022

Run Python PreCommit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants