#251 Fix glob support and divibility check for large amount of files. #253

yruslan · 2020-02-22T07:31:12Z

No description provided.

bart-at-qqdatafruits · 2020-02-22T16:40:16Z

It took me a while to figure out where to find the repository, your snapshot was located in Nexus operated by sonatype
publish.sbt put me on a lead to nexus

-- could use toree magic AddDeps property "repository" to choose another one than maven central

-- basic example below, without any gadgets, fails for a reason maybe related to missing dependencies
%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.4-SNAPSHOT --transitive --repository https://oss.sonatype.org/service/local/repositories/snapshots/content/

Marking za.co.absa.cobrix:spark-cobol_2.11:2.0.4-SNAPSHOT for download
Obtained 10 files

val sparkBuilder = SparkSession.builder().appName("Example")

sparkBuilder = org.apache.spark.sql.SparkSession$Builder@440d53e2

org.apache.spark.sql.SparkSession$Builder@440d53e2

val spark = sparkBuilder .getOrCreate()

spark = org.apache.spark.sql.SparkSession@685c4ab3

org.apache.spark.sql.SparkSession@685c4ab3

`
//import org.apache.spark.sql.functions._

//import org.apache.spark.sql.SparkSession

//spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/SOURCE/COPYBOOK.txt")
.load("file:///home/jovyan/data/SOURCE/BRAND/initial_transformed/PREFIX*")
//.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`

Name: java.lang.ClassNotFoundException
Message: Failed to find data source: za.co.absa.cobrix.spark.cobol.source. Please find packages at http://spark.apache.org/third-party-projects.html
StackTrace: at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 44 elided
Caused by: java.lang.ClassNotFoundException: za.co.absa.cobrix.spark.cobol.source.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)

bart-at-qqdatafruits · 2020-02-22T16:42:42Z

it seems that some dependencies may be missing
current version downloads 14 instead of 10

%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive --repository https://oss.sonatype.org/service/local/repositories/snapshots/content/

Marking za.co.absa.cobrix:spark-cobol_2.11:2.0.3 for download
Obtained 14 files

lastException: Throwable = null

yruslan · 2020-02-22T17:38:58Z

How do you build your Spark Application? If you use Maven, you can allow snapshot repositories by adding this profile to your pom.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

and using this dependency:

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol_2.11</artifactId>
            <version>2.0.4-SNAPSHOT</version>
        </dependency>

bart-at-qqdatafruits · 2020-02-22T18:47:13Z

Hi @yruslan , @kriswijnants

In short:

I confirm the solution.

@yruslan I thank you for the solution

@kriswijnants please take note of this

Detailed information:

The repo-URL you provided in the maven configuration allowed me to add the correct dependency in the "Apache Toree - Scala" Jupyter kernel

this using its magic
%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.4-SNAPSHOT --transitive --repository https://oss.sonatype.org/content/repositories/snapshots

I basically coded in a Jupyter notebook via a Jupyter docker container.

This approach is a good good simulation of the actual deployment platform, a Databricks cluster.

This approach allows me to analyse the data format, extraction method, library stability and issues without interfering with the infrastructure and data-platform setup by lads like @kriswijnants

Regards,

Bart Debersaques

yruslan · 2020-02-22T18:59:07Z

Great! This will be released next week.

#251 Fix glob support and divibility check for large amount of files.

e7803fb

yruslan mentioned this pull request Feb 22, 2020

file wildcards / file globbing unstable #251

Closed

yruslan merged commit eea7fb1 into master Feb 22, 2020

yruslan deleted the bugfix/251-ignore-hidden-files branch February 22, 2020 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#251 Fix glob support and divibility check for large amount of files. #253

#251 Fix glob support and divibility check for large amount of files. #253

yruslan commented Feb 22, 2020

bart-at-qqdatafruits commented Feb 22, 2020 •

edited

Loading

bart-at-qqdatafruits commented Feb 22, 2020

yruslan commented Feb 22, 2020

bart-at-qqdatafruits commented Feb 22, 2020 •

edited

Loading

yruslan commented Feb 22, 2020

#251 Fix glob support and divibility check for large amount of files. #253

#251 Fix glob support and divibility check for large amount of files. #253

Conversation

yruslan commented Feb 22, 2020

bart-at-qqdatafruits commented Feb 22, 2020 • edited Loading

bart-at-qqdatafruits commented Feb 22, 2020

yruslan commented Feb 22, 2020

bart-at-qqdatafruits commented Feb 22, 2020 • edited Loading

yruslan commented Feb 22, 2020

bart-at-qqdatafruits commented Feb 22, 2020 •

edited

Loading

bart-at-qqdatafruits commented Feb 22, 2020 •

edited

Loading