Remorph Reconciliation

Reconcile is an automated tool designed to streamline the reconciliation process between source data and target data residing on Databricks. Currently, the platform exclusively offers support for Snowflake, Oracle and other Databricks tables as the primary data source. This tool empowers users to efficiently identify discrepancies and variations in data when comparing the source with the Databricks target.

Types of Report Supported
Report Type-Flow Chart
Supported Source System
TABLE Config JSON filename
TABLE Config Elements
Key Considerations for Oracle JDBC Reader Options
Reconciliation Example
DataFlow Example
Aggregates Reconcile

Types of Report Supported

report type	sample visualisation	description	key outputs captured in the recon metrics tables
schema	schema	reconcile the schema of source and target. - validate the datatype is same or compatible	- schema_comparison - schema_difference
row	row	reconcile the data only at row level(hash value of the source row is matched with the hash value of the target).Preferred when there are no join columns identified between source and target.	- missing_in_src(sample rows that are available in target but missing in source + sample rows in the target that don't match with the source) - missing_in_tgt(sample rows that are available in source but are missing in target + sample rows in the source that doesn't match with target) NOTE: the report won't differentiate the mismatch and missing here.
data	data	reconcile the data at row and column level- `join_columns` will help us to identify mismatches at each row and column level	- mismatch_data(the sample data with mismatches captured at each column and row level ) - missing_in_src(sample rows that are available in target but missing in source) - missing_in_tgt(sample rows that are available in source but are missing in target) - threshold_mismatch(configured column will be reconciled based on percentile or threshold boundary or date boundary) - mismatch_columns(consolidated list of columns that has mismatches in them)
all	all	this is a combination of data + schema	- data + schema outputs

Source	Schema	Row	Data	All
Oracle	Yes	Yes	Yes	Yes
Snowflake	Yes	Yes	Yes	Yes
Databricks	Yes	Yes	Yes	Yes

config_name	data_type	description	required/optional	example_value
source_name	string	name of the source table	required	product
target_name	string	name of the target table	required	product
aggregates	list[Aggregate]	list of aggregates, refer Aggregate for more information	optional(default=None)	"aggregates": [{"type": "MAX", "agg_columns": ["<COLUMN_NAME_4>"]}],
join_columns	list[string]	list of column names which act as the primary key to the table	optional(default=None)	["product_id"] or ["product_id", "order_id"]
jdbc_reader_options	string	jdbc_reader_option, which helps to parallelise the data read from jdbc sources based on the given configuration.For more info jdbc_reader_options	optional(default=None)	"jdbc_reader_options": {"number_partitions": 10,"partition_column": "s_suppkey","upper_bound": "10000000","lower_bound": "10","fetch_size":"100"}
select_columns	list[string]	list of columns to be considered for the reconciliation process	optional(default=None)	["id", "name", "address"]
drop_columns	list[string]	list of columns to be eliminated from the reconciliation process	optional(default=None)	["comment"]
column_mapping	list[ColumnMapping]	list of column_mapping that helps in resolving column name mismatch between src and tgt, e.g., "id" in src and "emp_id" in tgt.For more info column_mapping	optional(default=None)	"column_mapping": [{"source_name": "id","target_name": "emp_id"}]
transformations	list[Transformations]	list of user-defined transformations that can be applied to src and tgt columns in case of any incompatibility data types or explicit transformation is applied during migration.For more info transformations	optional(default=None)	"transformations": [{"column_name": "s_address","source": "trim(s_address)","target": "trim(s_address)"}]
column_thresholds	list[ColumnThresholds]	list of threshold conditions that can be applied on the columns to match the minor exceptions in data. It supports percentile, absolute, and date fields. For more info column_thresholds	optional(default=None)	"thresholds": [{"column_name": "sal", "lower_bound": "-5%", "upper_bound": "5%", "type": "int"}]
table_thresholds	list[TableThresholds]	list of table thresholds conditions that can be applied on the tables to match the minor exceptions in mismatch count. It supports percentile, absolute. For more info table_thresholds	optional(default=None)	"table_thresholds": [{"lower_bound": "0%", "upper_bound": "5%", "model": "mismatch"}]
filters	Filters	filter expr that can be used to filter the data on src and tgt based on respective expressions	optional(default=None)	"filters": {"source": "lower(dept_name)>’ it’”, "target": "lower(department_name)>’ it’”}

field_name	data_type	description	required/optional	example_value
number_partitions	string	the number of partitions for reading input data in parallel	required	"200"
partition_column	string	Int/date/timestamp parameter defining the column used for partitioning, typically the primary key of the source table. Note that this parameter accepts only one column, which is especially crucial when dealing with a composite primary key. In such cases, provide the column with higher cardinality.	required	"employee_id
upper_bound	string	integer or date or timestamp without time zone value as string), that should be set appropriately (usually the maximum value in case of non-skew data) so the data read from the source should be approximately equally distributed	required	"1"
lower_bound	string	integer or date or timestamp without time zone value as string), that should be set appropriately (usually the minimum value in case of non-skew data) so the data read from the source should be approximately equally distributed	required	"100000"
fetch_size	string	This parameter influences the number of rows fetched per round-trip between Spark and the JDBC database, optimising data retrieval performance. Adjusting this option significantly impacts the efficiency of data extraction, controlling the volume of data retrieved in each fetch operation. More details on configuring fetch size can be found here	optional(default="100")	"10000"

field_name	data_type	description	required/optional	example_value
source_name	string	source column name	required	"dept_id"
target_name	string	target column name	required	"department_id"

field_name	data_type	description	required/optional	example_value
column_name	string	the column name on which the transformation to be applied	required	"s_address"
source	string	the transformation sql expr to be applied on source column	required	"trim(s_address)" or "s_address"
target	string	the transformation sql expr to be applied on source column	required	"trim(s_address)" or "s_address"

Transformation Expressions
filename	function / variable	transformation_rule	description
sampling_query.py	_get_join_clause	transform(coalesce, default="_null_recon_", is_string=True)	Applies the coalesce transformation function for String column and defaults to `_null_recon_` if column is NULL
expression_generator.py	DataType_transform_mapping	(coalesce, default='_null_recon_', is_string=True)	Default String column Transformation rule for all dialects. Applies the coalesce transformation function and defaults to `_null_recon_` if column is NULL
expression_generator.py	DataType_transform_mapping	"oracle": DataType...NCHAR: ..."NVL(TRIM(TO_CHAR..,'_null_recon_')"	Transformation rule for oracle dialect 'NCHAR' datatype. Applies TO_CHAR, TRIM transformation functions. If column is NULL, then defaults to `_null_recon_`
expression_generator.py	DataType_transform_mapping	"oracle": DataType...NVARCHAR: ..."NVL(TRIM(TO_CHAR..,'_null_recon_')"	Transformation rule for oracle dialect 'NVARCHAR' datatype. Applies TO_CHAR, TRIM transformation functions. If column is NULL, then defaults to `_null_recon_`

field_name	data_type	description	required/optional	example_value
column_name	string	the column that should be considered for column threshold reconciliation	required	"product_discount"
lower_bound	string	the lower bound of the difference between the source value and the target value	required	-5%
upper_bound	string	the upper bound of the difference between the source value and the target value	required	5%
type	string	The user must specify the column type. Supports SQLGLOT DataType.NUMERIC_TYPES and DataType.TEMPORAL_TYPES.	required	int

field_name	data_type	description	required/optional	example_value
source	string	the sql expression to filter the data from source	optional(default=None)	"lower(dept_name)='finance'"
target	string	the sql expression to filter the data from target	optional(default=None)	"lower(dept_name)='finance'"

source_type	data_type	source_transformation	target_transformation	source_value_example	target_value_example	comments
Oracle	number(10,5)	trim(to_char(coalesce(<col_name>,0.0), ’99990.99999’))	cast(coalesce(<col_name>,0.0) as decimal(10,5))	1.00	1.00000	this can be used for any precision and scale by adjusting accordingly in the transformation
Snowflake	array	array_to_string(array_compact(<col_name>),’,’)	concat_ws(’,’, <col_name>)	[1,undefined,2]	[1,2]	in case of removing "undefined" during migration(converts sparse array to dense array)
Snowflake	array	array_to_string(array_sort(array_compact(<col_name>), true, true),’,’)	concat_ws(’,’, <col_name>)	[2,undefined,1]	[1,2]	in case of removing "undefined" during migration and want to sort the array
Snowflake	timestamp_ntz	date_part(epoch_second,<col_name>)	unix_timestamp(<col_name>)	2020-01-01 00:00:00.000	2020-01-01 00:00:00.000	convert timestamp_ntz to epoch for getting a match between Snowflake and data bricks

field_name	data_type	description	required/optional	example_value
type	string	Supported Aggregate Functions	required	MIN
agg_columns	list[string]	list of columns names on which aggregate function needs to be applied	required	["product_discount"]
group_by_columns	list[string]	list of column names on which grouping needs to be applied	optional(default=None)	["product_id"] or None

Files

README.md

Latest commit

History

README.md

File metadata and controls

Remorph Reconciliation

Types of Report Supported

Report Type-Flow Chart

Supported Source System

TABLE Config Json filename:

TABLE Config Elements:

jdbc_reader_options

Key Considerations for Oracle JDBC Reader Options:

column_mapping

transformations

column_thresholds

table_thresholds

filters

Key Considerations:

Guidance for Oracle as a source

Driver

Option 1

Option 2

Commonly Used Custom Transformations

Reconciliation Example:

DataFlow Example

Remorph Aggregates Reconciliation

Summary

Supported Aggregate Functions

Flow Chart

aggregate

TABLE Config Examples:

Key Considerations:

Aggregates Reconciliation JSON Example

DataFlow Example