Skip to content

2019-08-23 (GCS 2.0.0, BQ 1.0.0)

Compare
Choose a tag to compare
@medb medb released this 24 Aug 00:37
· 614 commits to master since this release

Changelog

Cloud Storage connector:

  1. Remove Hadoop 1.x support.

  2. Do not convert path to directory path for inferred implicit directories.

  3. Do not parallelize GCS list requests, because it leads to too high QPS.

  4. Fix bug when GCS connector lists all files in directory instead of specified limit.

  5. Eagerly initialize GoogleCloudStorageReadChannel metadata if fs.gs.inputstream.fast.fail.on.not.found.enable set to true.

  6. Add support for Hadoop Delegation Tokens (based on HADOOP-14556). Configurable via fs.gs.delegation.token.binding property.

  7. Remove obsolete fs.gs.file.size.limit.250gb property.

  8. Repair implicit directories during delete and rename operations instead of list and glob operations.

  9. Log HTTP 429 Too Many Requests responses from GCS at 1 per 10 seconds rate.

  10. Remove obsolete fs.gs.create.marker.files.enable property.

  11. Remove system bucket feature and related properties:

    fs.gs.system.bucket
    fs.gs.system.bucket.create
    
  12. Remove obsolete fs.gs.performance.cache.dir.metadata.prefetch.limit property.

  13. Add a property to parallelize GCS requests in getFileStatus and listStatus methods to reduce latency:

    fs.gs.status.parallel.enable (default: false)
    

    Setting this property to true will cause GCS connector to send more GCS requests which will decrease latency but also increase cost of getFileStatus and listStatus method calls.

  14. Add a property to enable GCS direct upload:

    fs.gs.outputstream.direct.upload.enable (default: false)
    
  15. Update all dependencies to latest versions.

  16. Support Cooperative Locking for directory operations:

    fs.gs.cooperative.locking.enable (default: false)
    fs.gs.cooperative.locking.expiration.timeout.ms (default: 120,000)
    fs.gs.cooperative.locking.max.concurrent.operations (default: 20)
    
  17. Add FSCK tool for recovery of failed Cooperative Locking for directory operations:

    hadoop jar /usr/lib/hadoop/lib/gcs-connector.jar \
        com.google.cloud.hadoop.fs.gcs.CoopLockFsck \
        --{check,rollBack,rollForward} gs://<bucket_name> [all|<operation_id>]
    
  18. Implement Hadoop File System append method using GCS compose API.

  19. Disable support for reading GZIP encoded files (HTTP header Content-Encoding: gzip) because processing of GZIP encoded files is inefficient and error-prone in Hadoop and Spark.

    This feature is configurable with the property:

    fs.gs.inputstream.support.gzip.encoding.enable (default: false)
    
  20. Remove parent directory timestamp update feature and related properties:

    fs.gs.parent.timestamp.update.enable
    fs.gs.parent.timestamp.update.substrings.excludes
    fs.gs.parent.timestamp.update.substrings.includes
    

    This feature was enabled by default only for job history files, but it's not necessary anymore for Job History Server to work properly after MAPREDUCE-7101.

BigQuery connector:

  1. Remove Hadoop 1.x support.

  2. Remove deprecated features and associated properties:

    mapred.bq.input.query
    mapred.bq.query.results.table.delete
    mapred.bq.input.sharded.export.enable
    
  3. Remove obsolete mapred.bq.output.async.write.enabled property.

  4. Support nested record type in field schema in BigQuery connector.

  5. Remove dependency on GCS connector code.

  6. Add a property to specify BigQuery tables partitioning definition:

    mapred.bq.output.table.partitioning
    
  7. Add a new DirectBigQueryInputFormat for processing data through BigQuery Storage API.

    This input format is configurable via properties:

    mapred.bq.input.sql.filter
    mapred.bq.input.selected.fields
    mapred.bq.input.skew.limit
    
  8. Update all dependencies to latest versions.

  9. Add a property to control max number of attempts when polling for next file. By default max number of attempts is unlimited (-1 value):

    mapred.bq.dynamic.file.list.record.reader.poll.max.attempts (default: -1)
    
  10. Add a property to specify output table create disposition:

    mapred.bq.output.table.createdisposition (default: CREATE_IF_NEEDED)