Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cobrix with RDD as input #576

Closed
saikumare-a opened this issue Jan 27, 2023 · 7 comments
Closed

cobrix with RDD as input #576

saikumare-a opened this issue Jan 27, 2023 · 7 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@saikumare-a
Copy link

spark.read.csv(file or rdd) supports input of file or rdd as input, does cobrix also supports the same.

we are getting some additional rows. using rdd, we are able to remove extra rows. if rdd's can be used with cobrix, this eliminates the extra processing in single step.

would like to know whether cobrix supports using "rdds" as input.

@saikumare-a saikumare-a added the question Further information is requested label Jan 27, 2023
@yruslan
Copy link
Collaborator

yruslan commented Jan 27, 2023

I'm not sure, Cobrix just implements data source interfaces. You can try and let me know.
If it does not support this, could be a good feature to implement in the future.

There is a longer path. Take a look at https://github.com/AbsaOSS/cobrix#reading-ascii-text-file
(working example 3)

@yruslan
Copy link
Collaborator

yruslan commented Feb 2, 2023

Loading of CSV files this way is a Spark feature specifically for CSV. But it makes perfect sense to have a similar functionality in Cobrix.

It could look like

val rdd: RDD[Array[Byte]] = ...
val df = Cobrix.fromRDD(rdd)
   .option("something", "somehting")
   .load()

I'm thinking supporting RDD[String] and RDD[Array[Byte]]. Where each RDD entry is a record.

This feature could be a nice alternative to custom record extractors.

Does it make sense for your use case?

@saikumare-a
Copy link
Author

yes, this helps

@yruslan
Copy link
Collaborator

yruslan commented Feb 2, 2023

I was thinking... this feature is going to be available only from Scala (or in Python via JVM gateway). Still useful?

@saikumare-a
Copy link
Author

we use only python api. if only scala case, we won't be able to use

@yruslan yruslan added the help wanted Extra attention is needed label Feb 2, 2023
@yruslan
Copy link
Collaborator

yruslan commented Feb 2, 2023

That's unfortunate. Still, this is a very nice feature to have.
Once implemented, if you can find a way to expose the functionality from Python - the contribution is welcome. :)
I've marked this issue as 'help wanted' for this matter.

@saikumare-a
Copy link
Author

Sure and Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants