Skip to content

Commit

Permalink
📚 Docs Refresh: Files Source (#6663)
Browse files Browse the repository at this point in the history
* Docs Refresh: Files

* Update title for Getting Started.

* Local files disabled note.

* Update docs/integrations/sources/file.md

Co-authored-by: Sherif A. Nada <[email protected]>

* Update docs/integrations/sources/file.md

Co-authored-by: Sherif A. Nada <[email protected]>

Co-authored-by: Sherif A. Nada <[email protected]>
  • Loading branch information
avaidyanatha and sherifnada authored Oct 4, 2021
1 parent 1a4d5ef commit 15df1c6
Showing 1 changed file with 17 additions and 41 deletions.
58 changes: 17 additions & 41 deletions docs/integrations/sources/file.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
# Files

## Overview

File are often exchanged or published in various remote locations. This source aims to support an expanding range of file formats and storage providers. The File source supports Full Refresh syncs. That is, every time a sync is run, Airbyte will copy all rows in the file and columns you set up for replication into the destination in a new table.

### Output schema

At this time, this source produces only a single stream for the target file as it replicates only one file at a time for the moment. We'll be considering to improve this behavior by globing folders or use patterns to capture more files in the next iterations as well as more file formats and storage providers. Note that you should provide the `dataset_name` which dictates how the table will be identified in the destination \(since `URL` can be made of complex characters\)

### Features
## Features

| Feature | Supported? |
| :--- | :--- |
Expand All @@ -17,27 +9,11 @@ At this time, this source produces only a single stream for the target file as i
| Replicate Incremental Deletes | No |
| Replicate Folders \(multiple Files\) | No |
| Replicate Glob Patterns \(multiple Files\) | No |
| Namespaces | No |

How do we rate the functionalities below?

* Yes, means we verified and have Automated Integration Tests for it.
* Verified, means we don't have Automated Tests but were able to successfully manually test and use it with Airbyte.
* Experimental, means we tried to verify but we may have ran into edge cases that still need to be addressed to be usable, please use with cautions.
* Untested, means the library we are using claims to support such configurations in theory but we haven't tested or verified that's it's working in Airbyte yet.
* Hidden, means that we haven't tested or even hooked up the options all the way to the UI yet.

Please, don't hesitate to get in touch with us and/or provide usage feedbacks if you are able to report issues, verify, contribute some testing or even suggest an option that is not part of these list, thanks!
This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the `dataset_name` which dictates how the table will be identified in the destination \(since `URL` can be made of complex characters\).

### Storage Providers

Storage Providers are mostly enabled \(and further tested\) thanks to other open-source libraries that we are using under the hood such as:

* [smart\_open](https://pypi.org/project/smart-open/)
* [paramiko](http://docs.paramiko.org/en/stable/)
* [GCSFS](https://gcsfs.readthedocs.io/en/latest/)
* [S3FS](https://s3fs.readthedocs.io/en/latest/)

| Storage Providers | Supported? |
| :--- | :--- |
| HTTPS | Yes |
Expand All @@ -60,10 +36,6 @@ Storage Providers are mostly enabled \(and further tested\) thanks to other open

### File Formats

File Formats are mostly enabled \(and further tested\) thanks to other open-source libraries that we are using under the hood such as:

* [Pandas IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

| Format | Supported? |
| :--- | :--- |
| CSV | Yes |
Expand All @@ -76,26 +48,23 @@ File Formats are mostly enabled \(and further tested\) thanks to other open-sour
| Parquet | Yes |
| Pickle | No |

### Performance considerations
## Getting Started (Airbyte Cloud)

In order to read large files from a remote location, we are leveraging the capabilities of [smart\_open](https://pypi.org/project/smart-open/). However, it is possible to switch to either [GCSFS](https://gcsfs.readthedocs.io/en/latest/) or [S3FS](https://s3fs.readthedocs.io/en/latest/) implementations as it is natively supported by the `pandas` library. This choice is made possible through the optional `reader_impl` parameter.
Setup through Airbyte Cloud will be exactly the same as the open-source setup, except for the fact that local files are disabled.

### Limitations / Experimentation notes

* Note that for local filesystem, the file probably have to be stored somewhere in the `/tmp/airbyte_local` folder with the same limitations as the [CSV Destination](../destinations/local-csv.md) so the `URL` should also starts with `/local/`. This may not be ideal as a Source but will probably evolve later.
* The JSON implementation needs to be tweaked in order to produce more complex catalog and is still in an experimental state: Simple JSON schemas should work at this point but may not be well handled when there are multiple layers of nesting.
## Getting Started (Airbyte Open-Source)

## Getting started
1. Once the File Source is selected, you should define both the storage provider along its URL and format of the file.
2. Depending on the provider choice and privacy of the data, you will have to configure more options.

* Once the File Source is selected, you should define both the storage provider along its URL and format of the file.
* Depending on the choice made previously, more options may be necessary, especially when accessing private data.
#### Provider Specific Information
* In case of GCS, it is necessary to provide the content of the service account keyfile to access private buckets. See settings of [BigQuery Destination](../destinations/bigquery.md)
* In case of AWS S3, the pair of `aws_access_key_id` and `aws_secret_access_key` is necessary to access private S3 buckets.
* In case of AzBlob, it is necessary to provide the `storage_account` in which the blob you want to access resides. Either `sas_token` [(info)](https://docs.microsoft.com/en-us/azure/storage/blobs/sas-service-create?tabs=dotnet) or `shared_key` [(info)](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal) is necessary to access private blobs.

### Reader Options

The Reader in charge of loading the file format is currently based on [Pandas IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the `reader_options` that should be in JSON format and depends on the chosen file format. See panda's documentation, depending on the format:
The Reader in charge of loading the file format is currently based on [Pandas IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the `reader_options` that should be in JSON format and depends on the chosen file format. See pandas' documentation, depending on the format:

For example, if the format `CSV` is selected, then options from the [read\_csv](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table) functions are available.

Expand All @@ -115,7 +84,7 @@ For example, you can use the `{"orient" : "records"}` to change how orientation

#### Changing data types of source columns

Normally, Airbyte tries to infer the data type from the source, but you can use `reader_options` to force specific data types. If you input `{"dtype":"string"}`, all columns will be forced to be parsed as strings. If you only want a specific column to be parsed as a string, simply use `{"dtype" : {"column name": "string"}}`.
Normally, Airbyte tries to infer the data type from the source, but you can use `reader_options` to force specific data types. If you input `{"dtype":"string"}`, all columns will be forced to be parsed as strings. If you only want a specific column to be parsed as a string, simply use `{"dtype" : {"column name": "string"}}`.

### Examples

Expand Down Expand Up @@ -143,6 +112,13 @@ Example for SFTP:

Please see \(or add\) more at `airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py` for further usages examples.

## Performance Considerations and Notes

In order to read large files from a remote location, this connector uses the [smart\_open](https://pypi.org/project/smart-open/) library. However, it is possible to switch to either [GCSFS](https://gcsfs.readthedocs.io/en/latest/) or [S3FS](https://s3fs.readthedocs.io/en/latest/) implementations as it is natively supported by the `pandas` library. This choice is made possible through the optional `reader_impl` parameter.

* Note that for local filesystem, the file probably have to be stored somewhere in the `/tmp/airbyte_local` folder with the same limitations as the [CSV Destination](../destinations/local-csv.md) so the `URL` should also starts with `/local/`.
* The JSON implementation needs to be tweaked in order to produce more complex catalog and is still in an experimental state: Simple JSON schemas should work at this point but may not be well handled when there are multiple layers of nesting.

## Changelog

| Version | Date | Pull Request | Subject |
Expand Down

0 comments on commit 15df1c6

Please sign in to comment.