Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util to generate ingest template from library #1258

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Nov 20, 2024

1. generate scaffold for ingest template

running

docker exec de python3 -m dcpy.cli lifecycle scripts validate_ingest convert dcp_ct2020

generates

id: dcp_ct2020

acl: public-read

ingestion:
  source:
    one_of:
    - type: local_file
      path: ''
    - type: file_download
      url: ''
    - type: api
      endpoint: ''
      format: ''
    - type: s3
      bucket: ''
      key: ''
    - type: socrata
      org: ''
      uid: ''
      format: ''
    - type: edm_publishing_gis_dataset
      name: ''
  file_format:
    type: csv, json, xlsx, shapefile, geojson, geodatabase
    crs: EPSG:2263
  processing_steps: []

columns: []

library_dataset:
  name: dcp_ct2020
  acl: public-read
  source:
    url:
      path: https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/nyct2020_{{
        version }}.zip
      subpath: nyct2020_{{ version }}/nyct2020.shp
    geometry:
      SRS: EPSG:2263
      type: MULTIPOLYGON
    options:
    - AUTODETECT_TYPE=NO
    - EMPTY_STRING_AS_NULL=YES
  destination:
    geometry:
      SRS: EPSG:4326
      type: MULTIPOLYGON
    options:
    - OVERWRITE=YES
    - PRECISION=NO
    fields: []
    sql: null
  info:
    description: '### The Census Tracts for the 2020 US Census (Clipped to shoreline).

      '
    url: https://www1.nyc.gov/site/planning/data-maps/open-data/districts-download-metadata.page
    dependents: []

Not quite perfect, but a much nicer starting point

2. Parquet Metadata

Added a second commit for fun - we can get parquet metadata!

Unfortunately, pyarrow has its own S3FileSystem implementation which seems to not play nicely with moto, so this is tough to test. Tried using a s3fs.S3FileSystem which seemed like it might play nicer - that also didn't work. See here: aio-libs/aiobotocore#755

3. remove library_dataset from ingest templates

This was useful when we were developing ingest and piecemeal putting together the new templates. But would like to not get them committed

4. Add columns to geom comparison for keyed report

This just seemed useful to diagnose any type differences

________________________________________________________________________________
Data comparison
    Key columns
        sdlbl
    Left only: None
    Right only: None
    Columns with diffs
        wkb_geometry
    Differences by column
        Wkb geometry
            74 rows. First 20 shown
                   ordering_equal  spatially_equal   left_geom_type right_geom_type
            sdlbl                                                                  
            125th           False             True  ST_MultiPolygon      ST_Polygon
            BPC             False             True  ST_MultiPolygon      ST_Polygon
            BR              False             True  ST_MultiPolygon      ST_Polygon
            BSC             False             True  ST_MultiPolygon      ST_Polygon
            BNY             False             True  ST_MultiPolygon      ST_Polygon
            CD              False             True  ST_MultiPolygon      ST_Polygon
            CL              False             True  ST_MultiPolygon      ST_Polygon
            CR-1            False             True  ST_MultiPolygon      ST_Polygon
            CR-2            False             True  ST_MultiPolygon      ST_Polygon
            CR-4            False             True  ST_MultiPolygon      ST_Polygon
            CR-5            False             True  ST_MultiPolygon      ST_Polygon
            CP              False             True  ST_MultiPolygon      ST_Polygon
            CI              False             True  ST_MultiPolygon      ST_Polygon
            CO              False             True  ST_MultiPolygon      ST_Polygon
            DB              False             True  ST_MultiPolygon      ST_Polygon
            DFR             False             True  ST_MultiPolygon      ST_Polygon
            DJ              False             True  ST_MultiPolygon      ST_Polygon
            EC-3            False             True  ST_MultiPolygon      ST_Polygon
            EC-4            False             True  ST_MultiPolygon      ST_Polygon
            EC-6            False             True  ST_MultiPolygon      ST_Polygon

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 20.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 69.33%. Comparing base (99fab2e) to head (8a36e97).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/scripts/validate_ingest.py 0.00% 18 Missing ⚠️
dcpy/connectors/edm/recipes.py 33.33% 4 Missing ⚠️
dcpy/data/compare.py 0.00% 1 Missing ⚠️
dcpy/utils/s3.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1258      +/-   ##
==========================================
- Coverage   69.58%   69.33%   -0.26%     
==========================================
  Files         111      111              
  Lines        5865     5892      +27     
  Branches      654      654              
==========================================
+ Hits         4081     4085       +4     
- Misses       1655     1678      +23     
  Partials      129      129              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@fvankrieken fvankrieken force-pushed the fvk-ingest-convert-template branch from 176c85e to 4456323 Compare November 20, 2024 16:57
@fvankrieken fvankrieken marked this pull request as ready for review November 20, 2024 16:57
@fvankrieken fvankrieken force-pushed the fvk-ingest-convert-template branch from cfc448a to d430288 Compare November 20, 2024 18:29
@fvankrieken
Copy link
Contributor Author

Sorry - fixed broken things, this is good to go

@fvankrieken fvankrieken merged commit f71bba1 into main Nov 21, 2024
18 of 20 checks passed
@fvankrieken fvankrieken deleted the fvk-ingest-convert-template branch November 21, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants