Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#251 Fix glob support and divibility check for large amount of files. #253

Merged
merged 1 commit into from
Feb 22, 2020

Conversation

yruslan
Copy link
Collaborator

@yruslan yruslan commented Feb 22, 2020

No description provided.

@bart-at-qqdatafruits
Copy link

bart-at-qqdatafruits commented Feb 22, 2020

  • It took me a while to figure out where to find the repository, your snapshot was located in Nexus operated by sonatype

  • publish.sbt put me on a lead to nexus

-- could use toree magic AddDeps property "repository" to choose another one than maven central

-- basic example below, without any gadgets, fails for a reason maybe related to missing dependencies
%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.4-SNAPSHOT --transitive --repository https://oss.sonatype.org/service/local/repositories/snapshots/content/

Marking za.co.absa.cobrix:spark-cobol_2.11:2.0.4-SNAPSHOT for download
Obtained 10 files

val sparkBuilder = SparkSession.builder().appName("Example")

sparkBuilder = org.apache.spark.sql.SparkSession$Builder@440d53e2

org.apache.spark.sql.SparkSession$Builder@440d53e2

val spark = sparkBuilder .getOrCreate()

spark = org.apache.spark.sql.SparkSession@685c4ab3

org.apache.spark.sql.SparkSession@685c4ab3

`
//import org.apache.spark.sql.functions._

//import org.apache.spark.sql.SparkSession

//spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/SOURCE/COPYBOOK.txt")
.load("file:///home/jovyan/data/SOURCE/BRAND/initial_transformed/PREFIX*")
//.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`

Name: java.lang.ClassNotFoundException
Message: Failed to find data source: za.co.absa.cobrix.spark.cobol.source. Please find packages at http://spark.apache.org/third-party-projects.html
StackTrace: at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 44 elided
Caused by: java.lang.ClassNotFoundException: za.co.absa.cobrix.spark.cobol.source.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)

@bart-at-qqdatafruits
Copy link

  • it seems that some dependencies may be missing
  • current version downloads 14 instead of 10

%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive --repository https://oss.sonatype.org/service/local/repositories/snapshots/content/

Marking za.co.absa.cobrix:spark-cobol_2.11:2.0.3 for download
Obtained 14 files

lastException: Throwable = null

@yruslan
Copy link
Collaborator Author

yruslan commented Feb 22, 2020

How do you build your Spark Application? If you use Maven, you can allow snapshot repositories by adding this profile to your pom.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

and using this dependency:

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol_2.11</artifactId>
            <version>2.0.4-SNAPSHOT</version>
        </dependency>

@bart-at-qqdatafruits
Copy link

bart-at-qqdatafruits commented Feb 22, 2020

Hi @yruslan , @kriswijnants

In short:

I confirm the solution.

@yruslan I thank you for the solution

@kriswijnants please take note of this

Detailed information:

The repo-URL you provided in the maven configuration allowed me to add the correct dependency in the "Apache Toree - Scala" Jupyter kernel

this using its magic
%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.4-SNAPSHOT --transitive --repository https://oss.sonatype.org/content/repositories/snapshots

I basically coded in a Jupyter notebook via a Jupyter docker container.

This approach is a good good simulation of the actual deployment platform, a Databricks cluster.

This approach allows me to analyse the data format, extraction method, library stability and issues without interfering with the infrastructure and data-platform setup by lads like @kriswijnants

Regards,

Bart Debersaques

@yruslan
Copy link
Collaborator Author

yruslan commented Feb 22, 2020

Great! This will be released next week.

@yruslan yruslan merged commit eea7fb1 into master Feb 22, 2020
@yruslan yruslan deleted the bugfix/251-ignore-hidden-files branch February 22, 2020 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants