Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cmd/influx/write): enhance the ability to import CSV files #17599

Closed
wants to merge 34 commits into from

Conversation

sranka
Copy link
Contributor

@sranka sranka commented Apr 3, 2020

Closes #17004 #17008

There is a bunch of enhancements in the way of how CSV data are processed by influx write so that external CSV data can be written to InfluxDB. They are described in detail in #17004 + #17008

Examples of new capabilities are provided in https://github.com/bonitoo-io/influxdb-csv-import/blob/master/README.md#step-2

New flags in influx write :

      --debug                Log CSV columns to stderr before reading data rows
      --encoding string      Character encoding of input files or stdin (default "UTF-8")
  -f, --file stringArray     The path to the file to import
      --header stringArray   One or more header lines to prepend to input data
      --skipHeader int[=1]   Skip the first <n> rows from input data
      --skipRowOnError       Log CSV data errors to stderr and continue with CSV processing

@sranka sranka requested review from kelwang and jsteenb2 April 3, 2020 14:13
@sranka sranka force-pushed the 17004/writeFromCsv branch 2 times, most recently from 55d7b3d to 9a71054 Compare April 9, 2020 16:27
@sranka sranka requested review from russorat and removed request for kelwang April 9, 2020 17:13
Copy link
Contributor

@jsteenb2 jsteenb2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a reivew from @influxdata/storage-team here as well

cmd/influx/write.go Outdated Show resolved Hide resolved
@@ -74,41 +90,102 @@ func cmdWrite(f *globalFlags, opt genericCLIOpts) *cobra.Command {
return cmd
}

func (writeFlags *writeFlagsType) dump(args []string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation behind this? can we remove it?

Copy link
Contributor Author

@sranka sranka Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps with the troubleshooting of CSV -> LP conversion. It is easier to find problems in user input when you dump the actual input arguments together with their default values. This is printed only in --debug mode, together with metadata about columns that are used to extract CSV data that create protocol lines.

Sure I can remove it, but it will be then harder to resolve user input errors. Opinionated herein, I do not insist on having it here.

if len(args) > 0 && args[0][0] == '@' {
func (writeFlags *writeFlagsType) createLineReader(cmd *cobra.Command, args []string) (io.Reader, io.Closer, error) {
readers := make([]io.Reader, 0, 2*len(writeFlags.Headers)+2*len(writeFlags.Files)+1)
closers := make([]io.Closer, 0, len(writeFlags.Files))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are readers and closers separate instead of being ReadClosers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closers are subset of readers herein, only opened files will be closed, nothing more ... stdin or string readers cannot/must not be closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you, seems better to do it as ReadClosers with a noop closer on those that don't need a closer, typical pattern in golang.

type noopCloser struct {
    r io.reader
}

func (n *noopCloser) Close() error {
    return nil 
}

my big concern here is it makes it easy to lose track of closers/readers when they ares disparate uncoupled return types.

Last thing, please don't resolve conversations, I'll happily resolve them when I feel the conversation has come to fruition 👍

Copy link
Contributor Author

@sranka sranka Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification around "conversation resolving", as a newbie herein, I need such advice 👍

I see two concerns herein:

  1. Why are readers and closers separate instead of being ReadClosers?
    • files can be provided in non-UTF-8 encoding, a file (ReaderCloser) is then transformed to UTF-8 Reader (not a ReadCloser)
      readers = append(readers, decode(f), strings.NewReader("\n"))
    • the last file or stdin can be then changed so that first N lines are skipped when --skipHeader option is present. This transformation also works with Readers, but not ReadClosers
      readers[i] = write.SkipHeaderLinesReader(writeFlags.SkipHeader, readers[i])
    • changing the above intermediate Readers to ReadClosers to have just one ReadCloser array would introduce an extra complexity herein IMO
  2. Why are io.Reader and io.Closer disparate return values?
    • I understand that a composite io.ReaderCloser looks better. It would require to wrap multiple Readers and Closers (or ReadClosers) in a new type that will only delegate calls. I wanted to prefer builtins whenever possible, but there is no multi-reader-closer for reuse.
    • At the same time, it is not a big deal (in my POV) for a non-public function to avoid this wrapping. "A returned Closer must be closed" is an expected contract IMO.

// backward compatibility: @ in arg denotes a file
writeFlags.File = args[0][1:]
writeFlags.Files = append(writeFlags.Files, args[0][1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned the flags are being used as a bucket for a bunch of side effects here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for the sake of backward compatibility @file means a file ... somebody introduced this and I decided to keep this functionality because I don't know where and how it is currently used. When starting from scratch, I would not allow specifying data in an argument at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't htink my eariler comment was clear. I'd rather see you maintain the files in a sep variable instead of creating a destructive call on the flags. Flags should only be used (imo), for the API contract between CLI and user. Makes things coupled in weird global ways atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, repaired

cmd/influx/write.go Show resolved Hide resolved
cmd/influx/write_test.go Outdated Show resolved Hide resolved
cmd/influx/write_test.go Outdated Show resolved Hide resolved
cmd/influx/write_test.go Show resolved Hide resolved
@jsteenb2 jsteenb2 requested review from a team and jacobmarble and removed request for a team April 9, 2020 17:27
@jacobmarble jacobmarble requested review from sebito91 and removed request for jacobmarble April 9, 2020 17:29
@sranka sranka requested a review from jsteenb2 April 9, 2020 21:25
@sranka sranka linked an issue Apr 9, 2020 that may be closed by this pull request
)

// IsCharacterDevice returns true if the supplied reader is a character device (a terminal)
func IsCharacterDevice(reader io.Reader) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation to move this to the internal pkg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation: This fn is independent on existing cli commands. I did not convince myself that attaching it to write.go or main.go of the main package is better. I will accept any suggestion herein to follow the mindset of this repo.

if len(args) > 0 && args[0][0] == '@' {
func (writeFlags *writeFlagsType) createLineReader(cmd *cobra.Command, args []string) (io.Reader, io.Closer, error) {
readers := make([]io.Reader, 0, 2*len(writeFlags.Headers)+2*len(writeFlags.Files)+1)
closers := make([]io.Closer, 0, len(writeFlags.Files))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you, seems better to do it as ReadClosers with a noop closer on those that don't need a closer, typical pattern in golang.

type noopCloser struct {
    r io.reader
}

func (n *noopCloser) Close() error {
    return nil 
}

my big concern here is it makes it easy to lose track of closers/readers when they ares disparate uncoupled return types.

Last thing, please don't resolve conversations, I'll happily resolve them when I feel the conversation has come to fruition 👍

// backward compatibility: @ in arg denotes a file
writeFlags.File = args[0][1:]
writeFlags.Files = append(writeFlags.Files, args[0][1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't htink my eariler comment was clear. I'd rather see you maintain the files in a sep variable instead of creating a destructive call on the flags. Flags should only be used (imo), for the API contract between CLI and user. Makes things coupled in weird global ways atm.

if err != nil {
return false
}
if (info.Mode() & os.ModeCharDevice) == os.ModeCharDevice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized this can be simplified too:

return (info.Mode() & os.ModeCharDevice) == os.ModeCharDevice

the if isn't necessary here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, simplified

require.Contains(t, fmt.Sprintf("%s", err), "bucket") // failed to retrieve buckets

// validation: no such bucket found
lineData = lineData[:0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these sub tests? its easier to erad, but feels like a lot of coupling between assertions. Can this be split up into sub tests?

t.Run("validates no bucket found", func(t *testing.T) {
       lineData := lineData[:0]
        command := cmdWrite(&globalFlags{}, genericCLIOpts{w: ioutil.Discard})
	// note: my-empty-org parameter causes the test server to return no buckets
	command.SetArgs([]string{"--format", "csv", "--org", "my-empty-org", "--bucket", "my-bucket"})
	err = :command.Execute()
	require.Contains(t, fmt.Sprintf("%s", err), "bucket") // no such bucket found
})

same goes for a lot of these assertions here. They look like sub tests to me but aren't being treated as such. Please correct me if that assumption is wrong 🤔

Copy link
Contributor Author

@sranka sranka Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks better when changed to sub-tests

Copy link
Contributor

@rbetts rbetts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about this PR. We like the user experience it creates but we aren't comfortable accepting this PR as is because of concerns around complexity and maintainability.

I'd like to see this PR broken into at least two pieces:

  1. A standalone CSV to line-protocol conversion library. This should be a reusable library that we can use uniformly in other areas where we want to convert CSV to line protocol.

  2. A PR that then uses that library's public API from the CLI command tool to accomplish this CLI workflow.

Separately - I'm curious if there are existing parsers that accomplish the parsing work done here. I don't really want to take on the long term maintenance costs of CSV parsing features. I'm not familiar with what's available in the go language space... But I'm curious if there are third-party CSV processors that would accomplish more with less custom code than is used here.

@sranka sranka force-pushed the 17004/writeFromCsv branch from ed05c38 to 00e31e8 Compare April 11, 2020 11:00
@sranka
Copy link
Contributor Author

sranka commented Apr 11, 2020

@rbetts I've seen only a few golang's CSV parser extensions that can do (un)marshaling into structs of a pre-defined layout. We have a different situation herein:

  • the layout is driven by CSV data itself with the help of annotations and/or column headers
  • a CSV with variable row size (such as a result of a flux query) is supported and can be used on input
  • the code should be optimized to transform to line protocol and recognize CSV data annotations that are AFAIK used only in InfluxDB

@russorat Can you please comment on the requested creation of CSV -> LP conversion library? I mean, there are no technical problems to do so, once you agree and provide a new influxdata git repo.

@sranka sranka requested a review from jsteenb2 April 14, 2020 06:21
@jsteenb2
Copy link
Contributor

@sranka you should be able to put the parser under the pkg directory in the monorepo here in influxdb. If we need to pull it out, we can always do so.

@sranka sranka force-pushed the 17004/writeFromCsv branch from 7c1157d to 764552e Compare April 15, 2020 06:20
@sranka
Copy link
Contributor Author

sranka commented Apr 15, 2020

@jsteenb2 ok, "csv to line protocol conversion" is now in pkg/csv2lp

@russorat russorat requested a review from rbetts April 15, 2020 16:32
@jsteenb2
Copy link
Contributor

@sranka, we'd like to see this PR broken up and submitted incrementally. First PR for the csv converter and another for the CLI. We'll have different stakeholders reviewing for the csv PR.

@sranka
Copy link
Contributor Author

sranka commented Apr 15, 2020

@jsteenb2 thanks for letting me know of what is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants