Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) #678

Closed
pinakigit opened this issue May 17, 2024 · 7 comments
Closed
Labels
accepted Accepted for implementation enhancement New feature or request question Further information is requested

Comments

@pinakigit
Copy link

I have a copybook where the field is defied as PIC S9(10)V USAGE COMP-3. When we read the file in cobrix this field is being created as Long instead of Decimal(10,0). Is it expected this way because its spark property or can it be created as Decimal(10,0)

@pinakigit pinakigit added the question Further information is requested label May 17, 2024
@yruslan
Copy link
Collaborator

yruslan commented May 17, 2024

Yes, Cobrix prefers short, int and long for numbers that fit these types and doesn't have scale. You can convert them to decimal using cast: .withColumn("my_field", cast(col("my_field"), "decimal(10,0)").

@pinakigit
Copy link
Author

Sure. We can do that but we will have to maintain a mapping for each file for which fields to convert and hence was asking.

@yruslan
Copy link
Collaborator

yruslan commented May 30, 2024

One way to automate this is to use extended metadata. It adds more info from the copybook for each field.

.option("metadata", "extended")

You can view which metadata fields are available like this:

df.schema.fields.foreach{ field =>
  val metadata = field.metadata
  println(s"Field: ${field.name}, metadata: ${metadata.json} ")
}

// Returns: {"originalName":"MY_FIELD","precision":4,"usage":"COMP-3","signed":true,"offset":17,"byte_size":2,"sign_separate":false,"pic":"S9(4)","level":2} 

The precision is the one we need.

You can apply the casting logic something like this:

val columns = df.schema.fields.map { field =>
  val metadata = field.metadata
  if (metadata.contains("precision")) {
    val precision = metadata.getLong("precision").toInt
    if (field.dataType == LongType || field.dataType == IntegerType || field.dataType == ShortType)  {
      println(s"Cast ${field.name} to decimal($precision,0)")
      col(field.name).cast(DecimalType(precision, 0)).as(field.name)
    } else {
      col(field.name)
    }
  } else {
    col(field.name)
  }
}

val dfNew = df.select(columns: _*)

dfNew.printSchema

yruslan added a commit that referenced this issue May 30, 2024
yruslan added a commit that referenced this issue May 30, 2024
yruslan added a commit that referenced this issue Jun 3, 2024
yruslan added a commit that referenced this issue Jun 3, 2024
@yruslan
Copy link
Collaborator

yruslan commented Jun 4, 2024

An experimental method SparkUtils.covertIntegralToDecimal() has been added in 2.7.1. You can use it to post-process dataframe returned by Cobrix to map the schema the way you want:

val df = sparkread.format("cobol")
  .option("metadata", "extended")
  .load("/path/to/file")

val df2 = SparkUtils.covertIntegralToDecimal(df)

@yruslan yruslan added enhancement New feature or request accepted Accepted for implementation labels Jul 11, 2024
@yruslan
Copy link
Collaborator

yruslan commented Jul 11, 2024

This is something we are going to implement as a new feature.

Something like:

.option("strict_integral_precision", "true")

(not final)

@yruslan
Copy link
Collaborator

yruslan commented Jul 15, 2024

The changes are available in the master branch. @pinakigit, please test if you can.
The behavior is turned on when using this option:

.option("strict_integral_precision", "true")

@yruslan
Copy link
Collaborator

yruslan commented Jul 17, 2024

@yruslan yruslan closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants