feat(cmd/influx/write): enhance the ability to import CSV files #17599

sranka · 2020-04-03T14:12:36Z

There is a bunch of enhancements in the way of how CSV data are processed by influx write so that external CSV data can be written to InfluxDB. They are described in detail in #17004 + #17008

Examples of new capabilities are provided in https://github.com/bonitoo-io/influxdb-csv-import/blob/master/README.md#step-2

New flags in influx write :

      --debug                Log CSV columns to stderr before reading data rows
      --encoding string      Character encoding of input files or stdin (default "UTF-8")
  -f, --file stringArray     The path to the file to import
      --header stringArray   One or more header lines to prepend to input data
      --skipHeader int[=1]   Skip the first <n> rows from input data
      --skipRowOnError       Log CSV data errors to stderr and continue with CSV processing

CHANGELOG.md updated with a link to the PR (not the Issue)
Well-formatted commit messages
Rebased/mergeable
Tests pass
Documentation updated or issue created (provide link to issue/pr)
Signed CLA (if not already signed)

cmd/influx/write.go

jsteenb2

we need a reivew from @influxdata/storage-team here as well

cmd/influx/write.go

jsteenb2 · 2020-04-09T16:56:45Z

cmd/influx/write.go

@@ -74,41 +90,102 @@ func cmdWrite(f *globalFlags, opt genericCLIOpts) *cobra.Command {
 	return cmd
 }

+func (writeFlags *writeFlagsType) dump(args []string) {


what is the motivation behind this? can we remove it?

It helps with the troubleshooting of CSV -> LP conversion. It is easier to find problems in user input when you dump the actual input arguments together with their default values. This is printed only in --debug mode, together with metadata about columns that are used to extract CSV data that create protocol lines.

Sure I can remove it, but it will be then harder to resolve user input errors. Opinionated herein, I do not insist on having it here.

jsteenb2 · 2020-04-09T16:58:23Z

cmd/influx/write.go

-	if len(args) > 0 && args[0][0] == '@' {
+func (writeFlags *writeFlagsType) createLineReader(cmd *cobra.Command, args []string) (io.Reader, io.Closer, error) {
+	readers := make([]io.Reader, 0, 2*len(writeFlags.Headers)+2*len(writeFlags.Files)+1)
+	closers := make([]io.Closer, 0, len(writeFlags.Files))


why are readers and closers separate instead of being ReadClosers?

closers are subset of readers herein, only opened files will be closed, nothing more ... stdin or string readers cannot/must not be closed.

I hear you, seems better to do it as ReadClosers with a noop closer on those that don't need a closer, typical pattern in golang.

type noopCloser struct { r io.reader } func (n *noopCloser) Close() error { return nil }

my big concern here is it makes it easy to lose track of closers/readers when they ares disparate uncoupled return types.

Last thing, please don't resolve conversations, I'll happily resolve them when I feel the conversation has come to fruition 👍

Thank you for the clarification around "conversation resolving", as a newbie herein, I need such advice 👍

I see two concerns herein:

Why are readers and closers separate instead of being ReadClosers?

files can be provided in non-UTF-8 encoding, a file (ReaderCloser) is then transformed to UTF-8 Reader (not a ReadCloser)

influxdb/cmd/influx/write.go

Line 141 in b786fc1

readers = append(readers, decode(f), strings.NewReader("\n"))

the last file or stdin can be then changed so that first N lines are skipped when --skipHeader option is present. This transformation also works with Readers, but not ReadClosers

influxdb/cmd/influx/write.go

Line 168 in b786fc1

readers[i] = write.SkipHeaderLinesReader(writeFlags.SkipHeader, readers[i])

changing the above intermediate Readers to ReadClosers to have just one ReadCloser array would introduce an extra complexity herein IMO

Why are io.Reader and io.Closer disparate return values?

I understand that a composite io.ReaderCloser looks better. It would require to wrap multiple Readers and Closers (or ReadClosers) in a new type that will only delegate calls. I wanted to prefer builtins whenever possible, but there is no multi-reader-closer for reuse.

At the same time, it is not a big deal (in my POV) for a non-public function to avoid this wrapping. "A returned Closer must be closed" is an expected contract IMO.

jsteenb2 · 2020-04-09T17:03:17Z

cmd/influx/write.go

 		// backward compatibility: @ in arg denotes a file
-		writeFlags.File = args[0][1:]
+		writeFlags.Files = append(writeFlags.Files, args[0][1:])


I'm concerned the flags are being used as a bucket for a bunch of side effects here

yes, for the sake of backward compatibility @file means a file ... somebody introduced this and I decided to keep this functionality because I don't know where and how it is currently used. When starting from scratch, I would not allow specifying data in an argument at all.

I don't htink my eariler comment was clear. I'd rather see you maintain the files in a sep variable instead of creating a destructive call on the flags. Flags should only be used (imo), for the API contract between CLI and user. Makes things coupled in weird global ways atm.

understood, repaired

cmd/influx/write.go

cmd/influx/write_test.go

jsteenb2 · 2020-04-10T19:40:45Z

cmd/influx/internal/isCharacterDevice.go

+)
+
+// IsCharacterDevice returns true if the supplied reader is a character device (a terminal)
+func IsCharacterDevice(reader io.Reader) bool {


what is the motivation to move this to the internal pkg?

My motivation: This fn is independent on existing cli commands. I did not convince myself that attaching it to write.go or main.go of the main package is better. I will accept any suggestion herein to follow the mindset of this repo.

jsteenb2 · 2020-04-10T19:43:39Z

cmd/influx/write.go

-	if len(args) > 0 && args[0][0] == '@' {
+func (writeFlags *writeFlagsType) createLineReader(cmd *cobra.Command, args []string) (io.Reader, io.Closer, error) {
+	readers := make([]io.Reader, 0, 2*len(writeFlags.Headers)+2*len(writeFlags.Files)+1)
+	closers := make([]io.Closer, 0, len(writeFlags.Files))


I hear you, seems better to do it as ReadClosers with a noop closer on those that don't need a closer, typical pattern in golang.

type noopCloser struct { r io.reader } func (n *noopCloser) Close() error { return nil }

my big concern here is it makes it easy to lose track of closers/readers when they ares disparate uncoupled return types.

Last thing, please don't resolve conversations, I'll happily resolve them when I feel the conversation has come to fruition 👍

jsteenb2 · 2020-04-10T19:44:51Z

cmd/influx/write.go

 		// backward compatibility: @ in arg denotes a file
-		writeFlags.File = args[0][1:]
+		writeFlags.Files = append(writeFlags.Files, args[0][1:])


I don't htink my eariler comment was clear. I'd rather see you maintain the files in a sep variable instead of creating a destructive call on the flags. Flags should only be used (imo), for the API contract between CLI and user. Makes things coupled in weird global ways atm.

jsteenb2 · 2020-04-10T19:46:29Z

cmd/influx/internal/isCharacterDevice.go

+	if err != nil {
+		return false
+	}
+	if (info.Mode() & os.ModeCharDevice) == os.ModeCharDevice {


just realized this can be simplified too:

return (info.Mode() & os.ModeCharDevice) == os.ModeCharDevice

the if isn't necessary here

ok, simplified

jsteenb2 · 2020-04-10T19:50:51Z

cmd/influx/write_test.go

+	require.Contains(t, fmt.Sprintf("%s", err), "bucket") // failed to retrieve buckets
+
+	// validation: no such bucket found
+	lineData = lineData[:0]


are these sub tests? its easier to erad, but feels like a lot of coupling between assertions. Can this be split up into sub tests?

t.Run("validates no bucket found", func(t *testing.T) { lineData := lineData[:0] command := cmdWrite(&globalFlags{}, genericCLIOpts{w: ioutil.Discard}) // note: my-empty-org parameter causes the test server to return no buckets command.SetArgs([]string{"--format", "csv", "--org", "my-empty-org", "--bucket", "my-bucket"}) err = :command.Execute() require.Contains(t, fmt.Sprintf("%s", err), "bucket") // no such bucket found })

same goes for a lot of these assertions here. They look like sub tests to me but aren't being treated as such. Please correct me if that assumption is wrong 🤔

Yes, it looks better when changed to sub-tests

rbetts

I have some concerns about this PR. We like the user experience it creates but we aren't comfortable accepting this PR as is because of concerns around complexity and maintainability.

I'd like to see this PR broken into at least two pieces:

A standalone CSV to line-protocol conversion library. This should be a reusable library that we can use uniformly in other areas where we want to convert CSV to line protocol.
A PR that then uses that library's public API from the CLI command tool to accomplish this CLI workflow.

Separately - I'm curious if there are existing parsers that accomplish the parsing work done here. I don't really want to take on the long term maintenance costs of CSV parsing features. I'm not familiar with what's available in the go language space... But I'm curious if there are third-party CSV processors that would accomplish more with less custom code than is used here.

sranka · 2020-04-11T12:08:53Z

@rbetts I've seen only a few golang's CSV parser extensions that can do (un)marshaling into structs of a pre-defined layout. We have a different situation herein:

the layout is driven by CSV data itself with the help of annotations and/or column headers
a CSV with variable row size (such as a result of a flux query) is supported and can be used on input
the code should be optimized to transform to line protocol and recognize CSV data annotations that are AFAIK used only in InfluxDB

@russorat Can you please comment on the requested creation of CSV -> LP conversion library? I mean, there are no technical problems to do so, once you agree and provide a new influxdata git repo.

jsteenb2 · 2020-04-14T15:24:03Z

@sranka you should be able to put the parser under the pkg directory in the monorepo here in influxdb. If we need to pull it out, we can always do so.

… header rows without changing the data

…ctive for dateTime

…very data row

…err before reading data rows

…s to sterr and continue with CSV processing

…alue in column name

sranka · 2020-04-15T06:56:48Z

@jsteenb2 ok, "csv to line protocol conversion" is now in pkg/csv2lp

jsteenb2 · 2020-04-15T17:28:07Z

@sranka, we'd like to see this PR broken up and submitted incrementally. First PR for the csv converter and another for the CLI. We'll have different stakeholders reviewing for the csv PR.

sranka · 2020-04-15T18:43:16Z

@jsteenb2 thanks for letting me know of what is required.

sranka requested review from kelwang and jsteenb2 April 3, 2020 14:13

sranka linked an issue Apr 3, 2020 that may be closed by this pull request

Add the ability to specify the encoding when importing csv via the influx cli #17008

Closed

sanderson mentioned this pull request Apr 3, 2020

Add new flags to 'influx write' command influxdata/docs-v2#914

Closed

sranka force-pushed the 17004/writeFromCsv branch 2 times, most recently from 55d7b3d to 9a71054 Compare April 9, 2020 16:27

russorat reviewed Apr 9, 2020

View reviewed changes

cmd/influx/write.go Show resolved Hide resolved

sranka requested review from russorat and removed request for kelwang April 9, 2020 17:13

jsteenb2 suggested changes Apr 9, 2020

View reviewed changes

jsteenb2 requested review from a team and jacobmarble and removed request for a team April 9, 2020 17:27

jacobmarble requested review from sebito91 and removed request for jacobmarble April 9, 2020 17:29

sranka requested a review from jsteenb2 April 9, 2020 21:25

sranka linked an issue Apr 9, 2020 that may be closed by this pull request

numeric values should be "trimmed" before conversion when importing CSV #17699

Closed

jsteenb2 reviewed Apr 10, 2020

View reviewed changes

rbetts suggested changes Apr 10, 2020

View reviewed changes

sranka force-pushed the 17004/writeFromCsv branch from ed05c38 to 00e31e8 Compare April 11, 2020 11:00

sranka requested a review from jsteenb2 April 14, 2020 06:21

sranka added 7 commits April 15, 2020 08:19

feat(cmd/influx/write): add header flag to let specify annotation and…

56fb040

… header rows without changing the data

feat(cmd/influx/write): allow to specify format of the datatype, effe…

a5c991e

…ctive for dateTime

chore(cmd/influx/write): get ready for constant columns

908a1ae

feat(cmd/influx/write): add 'constant' annotation to add columns to e…

3c9db3a

…very data row

feat(cmd/influx/write): add 'debug' flag that logs CSV columns to std…

8615341

…err before reading data rows

feat(cmd/influx/write): add 'logCsvErrors' flag to log CSV data error…

b06fbf7

…s to sterr and continue with CSV processing

feat(cmd/influx/write): support multiple --file flags

c0ece64

sranka added 21 commits April 15, 2020 08:19

feat(cmd/influx/write): allow to include column data type in column name

f14f2a2

feat(cmd/influx/write): add --encoding flag #17008

7dc19c4

chore(cmd/influx/write): reuse record in CSV parser

653fd35

feat(cmd/influx/write): add #timezone annotation

f76c3db

chore(cmd/influx/write): enhance debug output

3e17183

feat(cmd/influx/write): support custom boolean type

b6224dd

chore(cmd/influx/write): doc only

05d0df1

feat(cmd/influx/write): allow to specify both data type and default v…

fefa9aa

…alue in column name

chore(cmd/influx/write): add tests for examples

06978f6

chore(cmd/influx/write): add test for column separator example

8a2b873

fix(cmd/influx/write): report original line numbers

6bcac40

chore(cmd/influx/write): add test for influx write

314a56a

chore(cmd/influx/write): repair test

17bee56

chore(cmd/influx/write): improve tests

f2aad4a

chore(cmd/influx/write): read from stdin only if it is not a terminal

e89cfda

chore(cmd/influx/write): add comments

8e0b881

feat(cmd/influx/write): ignore whitespaces in CSV numbers OOTB #17699

b1e2a5a

chore(cmd/influx/write): simplify code

0e783fa

chore(cmd/influx/write): apply review comments

d00466c

chore(cmd/influx/write): improve write_test

fb0c611

chore(cmd/influx/write): refactor to csv2lp package

764552e

sranka force-pushed the 17004/writeFromCsv branch from 7c1157d to 764552e Compare April 15, 2020 06:20

russorat requested a review from rbetts April 15, 2020 16:32

sranka closed this Apr 15, 2020

sranka mentioned this pull request Apr 16, 2020

feat(pkg/csv2lp): add csv to line protocol conversion library #17764

Merged

6 tasks

sranka mentioned this pull request May 14, 2020

feat(cmd/influx/write): enhance the ability to import CSV files using csv2lp library #18093

Merged

6 tasks

sranka deleted the 17004/writeFromCsv branch May 14, 2020 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cmd/influx/write): enhance the ability to import CSV files #17599

feat(cmd/influx/write): enhance the ability to import CSV files #17599

sranka commented Apr 3, 2020 •

edited

Loading

jsteenb2 left a comment

jsteenb2 Apr 9, 2020

sranka Apr 9, 2020 •

edited

Loading

jsteenb2 Apr 9, 2020

sranka Apr 9, 2020

jsteenb2 Apr 10, 2020

sranka Apr 11, 2020 •

edited

Loading

jsteenb2 Apr 9, 2020

sranka Apr 9, 2020

jsteenb2 Apr 10, 2020

sranka Apr 11, 2020

jsteenb2 Apr 10, 2020

sranka Apr 11, 2020

jsteenb2 Apr 10, 2020

jsteenb2 Apr 10, 2020

jsteenb2 Apr 10, 2020

sranka Apr 11, 2020

jsteenb2 Apr 10, 2020

sranka Apr 11, 2020 •

edited

Loading

rbetts left a comment

sranka commented Apr 11, 2020

jsteenb2 commented Apr 14, 2020

sranka commented Apr 15, 2020

jsteenb2 commented Apr 15, 2020

sranka commented Apr 15, 2020

feat(cmd/influx/write): enhance the ability to import CSV files #17599

feat(cmd/influx/write): enhance the ability to import CSV files #17599

Conversation

sranka commented Apr 3, 2020 • edited Loading

jsteenb2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sranka Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sranka Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sranka Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

rbetts left a comment

Choose a reason for hiding this comment

sranka commented Apr 11, 2020

jsteenb2 commented Apr 14, 2020

sranka commented Apr 15, 2020

jsteenb2 commented Apr 15, 2020

sranka commented Apr 15, 2020

sranka commented Apr 3, 2020 •

edited

Loading

sranka Apr 9, 2020 •

edited

Loading

sranka Apr 11, 2020 •

edited

Loading

sranka Apr 11, 2020 •

edited

Loading