Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROOT to Parquet conversion #6

Open
fedecolombina opened this issue Sep 23, 2021 · 0 comments
Open

ROOT to Parquet conversion #6

fedecolombina opened this issue Sep 23, 2021 · 0 comments

Comments

@fedecolombina
Copy link

Hi,

I'm having some problems in converting my root files into parquet files. I'm using k8s with https://swan.cern.ch and I used this notebook [1] as a guide, basically I only changed the path to the root files. The conversion seems to work properly, but it seems there's a problem while writing the parquet file, I get this error:


Py4JJavaError: An error occurred while calling o141.parquet.
: org.apache.spark.SparkException: Job aborted
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
[...]
Caused by: java.lang.StackOverflowError
	at java.util.jar.JarFile.getEntry(JarFile.java:240)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
	at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
[...]

This is the full code:

  import os
  import glob
  
  baseDir_Run2017_UL = '/eos/user/f/fcolombi/root/Run2017_UL'
  fnamesMap = {
      'Z': {
          'Run2017_UL': {
              'Run2017B': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017B/tree*.root')) if 'hadd' not in f],
              'Run2017C': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017C/tree*.root')) if 'hadd' not in f],
              'Run2017D': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017D/tree*.root')) if 'hadd' not in f],
              'Run2017E': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017E/tree*.root')) if 'hadd' not in f],
              'Run2017F': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017F/tree*.root')) if 'hadd' not in f],
              'DY17': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'DY17/tree*.root')) if 'hadd' not in f],
          },
      },
      'JPsi': {
      },
  }
  
  def convert(resonance,era,subEra):
  
      fnames = ['root://eosuser'+f for f in fnamesMap.get(resonance,{}).get(era,{}).get(subEra,[])]
  
      outDir = os.path.join('parquet',resonance,era,subEra)
      outname = os.path.join(outDir,'tnp.parquet')
  
      treename = 'Events'
      
      # process 1000 files at a time
      # this is about the limit that can be handled when writing
      batchsize = 1000
      new = True
      while fnames:
          current = fnames[:batchsize]
          fnames = fnames[batchsize:]
          
          rootfiles = spark.read.format("root").option('tree', treename).load(current)
  
          if new:
              rootfiles.write.parquet(outname)
              new = False
          else:
              rootfiles.write.mode('append').parquet(outname)

  resonance = 'Z'
  era = 'Run2017_UL'
  subEra = 'DY17'
  convert(resonance, era, subEra)

and I also execute the following commands before running it:

!wget -N https://repo1.maven.org/maven2/edu/vanderbilt/accre/laurelin/1.0.0/laurelin-1.0.0.jar &&
wget -N https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.13.0/log4j-api-2.13.0.jar &&
wget -N https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.13.0/log4j-core-2.13.0.jar

Does someone have an idea of where the problem might be? I really cannot understand it. Thanks a lot for the help, let me know if you need more details.

[1] https://github.com/dntaylor/spark_tnp/blob/master/notebooks/RootToParquet.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant