ROOT to Parquet conversion #6

fedecolombina · 2021-09-23T20:07:11Z

Hi,

I'm having some problems in converting my root files into parquet files. I'm using k8s with https://swan.cern.ch and I used this notebook [1] as a guide, basically I only changed the path to the root files. The conversion seems to work properly, but it seems there's a problem while writing the parquet file, I get this error:


Py4JJavaError: An error occurred while calling o141.parquet.
: org.apache.spark.SparkException: Job aborted
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
[...]
Caused by: java.lang.StackOverflowError
	at java.util.jar.JarFile.getEntry(JarFile.java:240)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1005)
	at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
[...]

This is the full code:

  import os
  import glob
  
  baseDir_Run2017_UL = '/eos/user/f/fcolombi/root/Run2017_UL'
  fnamesMap = {
      'Z': {
          'Run2017_UL': {
              'Run2017B': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017B/tree*.root')) if 'hadd' not in f],
              'Run2017C': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017C/tree*.root')) if 'hadd' not in f],
              'Run2017D': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017D/tree*.root')) if 'hadd' not in f],
              'Run2017E': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017E/tree*.root')) if 'hadd' not in f],
              'Run2017F': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'Run2017F/tree*.root')) if 'hadd' not in f],
              'DY17': [f for f in glob.glob(os.path.join(baseDir_Run2017_UL, 'DY17/tree*.root')) if 'hadd' not in f],
          },
      },
      'JPsi': {
      },
  }
  
  def convert(resonance,era,subEra):
  
      fnames = ['root://eosuser'+f for f in fnamesMap.get(resonance,{}).get(era,{}).get(subEra,[])]
  
      outDir = os.path.join('parquet',resonance,era,subEra)
      outname = os.path.join(outDir,'tnp.parquet')
  
      treename = 'Events'
      
      # process 1000 files at a time
      # this is about the limit that can be handled when writing
      batchsize = 1000
      new = True
      while fnames:
          current = fnames[:batchsize]
          fnames = fnames[batchsize:]
          
          rootfiles = spark.read.format("root").option('tree', treename).load(current)
  
          if new:
              rootfiles.write.parquet(outname)
              new = False
          else:
              rootfiles.write.mode('append').parquet(outname)

  resonance = 'Z'
  era = 'Run2017_UL'
  subEra = 'DY17'
  convert(resonance, era, subEra)

and I also execute the following commands before running it:

!wget -N https://repo1.maven.org/maven2/edu/vanderbilt/accre/laurelin/1.0.0/laurelin-1.0.0.jar &&
wget -N https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.13.0/log4j-api-2.13.0.jar &&
wget -N https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.13.0/log4j-core-2.13.0.jar

Does someone have an idea of where the problem might be? I really cannot understand it. Thanks a lot for the help, let me know if you need more details.

[1] https://github.com/dntaylor/spark_tnp/blob/master/notebooks/RootToParquet.ipynb

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROOT to Parquet conversion #6

ROOT to Parquet conversion #6

fedecolombina commented Sep 23, 2021

ROOT to Parquet conversion #6

ROOT to Parquet conversion #6

Comments

fedecolombina commented Sep 23, 2021