Support CSV external data #397

dcbriccetti · 2019-01-06T22:57:07Z

Closes #

Trying out the idea here: getodk/xforms-spec#88, with an optimization from @ggalmazor to skip the XML parsing altogether.

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

Are there any risks to merging this code? If so, what are they?

dcbriccetti · 2019-01-06T22:59:31Z

src/org/javarosa/core/model/instance/utils/CsvAsXmlInputStream.java

+
+        sb.append("</root>\n");
+
+        stream = new ByteArrayInputStream(sb.toString().getBytes(StandardCharsets.UTF_8));


If we like this approach, we can change this class so it doesn’t read the whole file into memory.

My only concern is that this is reading all bytes from the original input stream to produce an XML document and then, that XML document is parsed (effectively reading the contents twice).

Since each line represents an item, maybe we could manually produce one TreeElement for each row which could be added to a root element, and then return it. I think that this could be more memory and time efficient. It could be even faster than the XML loader.

This is what I was thinking about: https://github.com/ggalmazor/javarosa/commit/cd7c991f51e2cf5d56264fd004c30554a7b104d2

You’re a clever one!

dcbriccetti · 2019-01-08T19:02:32Z

@ggalmazor, thanks for that excellent code. It’s here now.

We now have a conflict, given that file-csv seems to have special meaning in Collect, and that previously JavaRosa ignored it.

Now we have this test failing:
public void parsesPreloadForm() throws IOException

ggalmazor · 2019-01-14T08:34:24Z

The commits so far are looking good ;)

To solve the issue with that test, one would have to set up the ReferenceManager before parsing the form. I've prepared two commits on top of your PR to show how I'd go about it:

codecov-io · 2019-01-27T00:04:11Z

Codecov Report

Merging #397 into master will increase coverage by 0.03%.
The diff coverage is 84.61%.

@@             Coverage Diff             @@
##             master    #397      +/-   ##
===========================================
+ Coverage     48.66%   48.7%   +0.03%     
- Complexity     2896    2900       +4     
===========================================
  Files           239     241       +2     
  Lines         13569   13591      +22     
  Branches       2628    2632       +4     
===========================================
+ Hits           6603    6619      +16     
- Misses         6127    6130       +3     
- Partials        839     842       +3

Impacted Files	Coverage Δ	Complexity Δ
src/org/javarosa/xform/parse/XFormParser.java	`64.55% <0%> (-0.1%)`	`229 <0> (-1)`
...rosa/core/model/instance/ExternalDataInstance.java	`82.35% <100%> (ø)`	`12 <2> (+1)`	⬆️
...arosa/core/model/instance/XmlExternalInstance.java	`80% <80%> (ø)`	`1 <1> (?)`
...arosa/core/model/instance/CsvExternalInstance.java	`88.23% <88.23%> (ø)`	`3 <3> (?)`
...c/org/javarosa/core/services/PrototypeManager.java	`79.16% <0%> (-8.34%)`	`8% <0%> (-1%)`
...varosa/core/model/condition/EvaluationContext.java	`70.39% <0%> (-0.66%)`	`38% <0%> (-1%)`
src/org/javarosa/core/model/utils/DateUtils.java	`56.03% <0%> (+0.28%)`	`73% <0%> (+2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c5fb7...dd13694. Read the comment docs.

dcbriccetti · 2019-01-27T00:12:30Z

@ggalmazor thanks for the coded solution. I cherry picked, and made a small correction (setUp instead of setup—phrasal verb, two words, to camel case) to your second commit. Was the first commit just formatting changes? I didn’t take it because it loses all my hand-beautifications.

What’s the next step for this project?

ggalmazor · 2019-01-28T07:36:36Z

Thanks, @dcbriccetti!

Sorry for removing the manual code formatting. It's not me. It's IntelliJ who doesn't like it :)

My first commit extracted ReferenceManagerTest.buildReferenceFactory to ReferenceManagerTestUtils and reused it elsewhere. I see that this is still pending on XFormParserTest and ReferenceManagerTest in this test.

What’s the next step for this project?

Once that's done, I think this would have to be tested on Collect.

It's looking good :)

ggalmazor

Thanks, @dcbriccetti!

I'd just like to ask you to extract the ReferenceManagerTest.buildReferenceFactory to ReferenceManagerTestUtils and reuse it elsewhere.

Other than that, this is looking really good. I'm excited about doing some benchmarks to put the external XML and CSV itemsets face to face :D

Transform each line in the CSV file directly into a TreeElement.

ggalmazor

Thanks, @dcbriccetti!

And apologies for taking so much time to review the last change!

dcbriccetti · 2019-02-13T05:27:58Z

Thanks. Perhaps we’d like to meet again to discuss next steps.

ggalmazor · 2019-02-13T09:37:33Z

Sure, let's talk it over Slack and touch base.

lognaturel · 2019-03-29T00:44:50Z

test/org/javarosa/xform/parse/XFormParserTest.java

-        FormDef formDef = parse(r("Sample-Preloading.xml"));
+        // The form on this test uses a jr://file-csv resource.
+        // We need to prime the ReferenceManager to deal with those
+        Path form = r("Sample-Preloading.xml");


@dcbriccetti @ggalmazor This particular form doesn't actually use the instance loaded by file-csv because it relies on the Collect-only database-backed implementation of the search() appearance/function: https://github.com/opendatakit/collect/blob/d52ce0bfc63a11fb7f3fc62e0c0fd7c58684527a/collect_app/src/main/java/org/odk/collect/android/external/handler/ExternalDataHandlerSearch.java. You'll see the form doesn't actually use the instance. It's not a big deal since this test only verifies that the form can be parsed but it would probably be worth changing to avoid confusion.

I would also recommend avoiding the "preload' word unless referring to https://opendatakit.github.io/xforms-spec/#preload-attributes

We should file an issue for this change

I propose removing that test in #452

lognaturel · 2019-03-29T01:00:34Z

It looks like this was never tried in Collect, right? It interferes with one of the existing client-only database-backed external CSV implementations as discovered in #415. It also looks like to get this to work, a Collect change would be needed, at the very least to add a translator for jr://file-csv.

Since we definitely do not want to parse a CSV at the JR level if one of the other database-backed implementations is used (for performance), I think we should disable this support for now. I'm not really coming up with a great way to only parse the CSV if the instance is actually referred to in the form definition. That is, https://github.com/opendatakit/javarosa/blob/1547174f521c47335cc4d87dc5c6638ef93ea570/resources/Sample-Preloading.xml defines <instance id="hhplotdetails" src="jr://file-csv/hhplotdetails.csv"/> but actually never uses instance("hhplotdetails"). Instead, it uses pulldata and search. This forum post has more context.

ggalmazor · 2019-03-29T12:16:39Z

OK, some thoughts and questions that I have about this issue:

~~There's something I'm missing for sure, but what's the relation between jr://file-csv and the client-only database-backed external CSV implementations? I mean:~~

Since we definitely do not want to parse a CSV at the JR level if one of the other database-backed implementations is used (for performance), I think we should disable this support for now

I believe this would be a mistake in the form's design. Using the pulldata/search solution in xlsforms doesn't create an xform that includes a secondary external instance using file-csv. I don't see what's the relation here (I'm probably missing something).

Well, it turns out that (when transforming xlsforms to xforms) the search appearance won't create a secondary external instance with file-csv, but pulldata() will.

After studying the code that Collect uses to implement the client-only database-backed external CSV implementations, it looks like:

It will try to create a db from any csv file present in the media folder, regardless of them being declared as external secondary instances or not.
It loads a couple of external function handlers for the pulldata function and search appearance that read data from the dbs created in step 1.

The fact that the search appearance will work without defining any external secondary instance with jr://file-csv means that we could make xlsform not append a secondary external instance that uses jr://file-csv when a form uses the pulldata function, so that the file-csv "hostname" is free for JR and there's no interference.

Then, we should decide if we want to do something about Collect pre-loading any and all csvs in the media folder into client-side databases, which defeats the purpose of JR parsing them in the first place, but doesn't comply with the requirements we discussed when talking about external secondary instances e.g. can't use them with instance() and xpath, for example.

Regarding QA testing, we decided to merge this into master to release a SNAPSHOT and then eventually test it on Collect. We deemed that to be a safe move because the new "hostname" file-csv acts effectively as a feature toggle. Unfortunately, it looks like we didn't detect the bug our changes introduced in the existing "hostname" jr://file :(.

lognaturel · 2019-03-29T17:55:51Z

I think I addressed your questions/concerns in #417, @ggalmazor! We can continue the discussion there when it comes to figuring out what to actually do about it.

lognaturel · 2019-06-28T22:05:42Z

#452 has a proposed fix for #417 and other various little issues related to CSV data files.

dcbriccetti requested a review from ggalmazor January 6, 2019 22:57

dcbriccetti commented Jan 6, 2019

View reviewed changes

ggalmazor mentioned this pull request Jan 8, 2019

Issue 390 external instances #394

Merged

dcbriccetti changed the title ~~WIP: Support CSV external data as jr://file-csv, via a CSV to XML adapter~~ WIP: Support CSV external data Jan 8, 2019

ggalmazor suggested changes Feb 4, 2019

View reviewed changes

Support CSV external data as jr://file-csv.

dd13694

Transform each line in the CSV file directly into a TreeElement.

ggalmazor approved these changes Feb 12, 2019

View reviewed changes

ggalmazor merged commit 0fa92f5 into getodk:master Feb 12, 2019

ggalmazor changed the title ~~WIP: Support CSV external data~~ Support CSV external data Feb 12, 2019

ggalmazor added this to the v2.14.0 milestone Feb 12, 2019

ggalmazor mentioned this pull request Mar 12, 2019

Support csv-external for defining an external instance not necessarily used for a select XLSForm/pyxform#271

Closed

lognaturel reviewed Mar 29, 2019

View reviewed changes

lognaturel mentioned this pull request Mar 29, 2019

Fixed the problem with parsing instances #416

Merged

lognaturel mentioned this pull request Mar 29, 2019

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

Closed

lognaturel mentioned this pull request Apr 11, 2019

In the test for CSV secondary instance support, use a form that refers to the instance #421

Closed

lognaturel mentioned this pull request Apr 5, 2021

Support quotes, escaped commas, and other CSV formatting in external CSV files #616

Closed

dcbriccetti deleted the external-csv branch April 5, 2021 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CSV external data #397

Support CSV external data #397

dcbriccetti commented Jan 6, 2019 •

edited

Loading

dcbriccetti Jan 6, 2019

ggalmazor Jan 7, 2019

ggalmazor Jan 7, 2019

dcbriccetti Jan 7, 2019

dcbriccetti commented Jan 8, 2019

ggalmazor commented Jan 14, 2019

codecov-io commented Jan 27, 2019 •

edited

Loading

dcbriccetti commented Jan 27, 2019

ggalmazor commented Jan 28, 2019 •

edited

Loading

ggalmazor left a comment •

edited

Loading

ggalmazor left a comment

dcbriccetti commented Feb 13, 2019

ggalmazor commented Feb 13, 2019

lognaturel Mar 29, 2019

ggalmazor Mar 29, 2019

lognaturel Jun 28, 2019

lognaturel commented Mar 29, 2019

ggalmazor commented Mar 29, 2019 •

edited

Loading

lognaturel commented Mar 29, 2019

lognaturel commented Jun 28, 2019


		sb.append("</root>\n");

		stream = new ByteArrayInputStream(sb.toString().getBytes(StandardCharsets.UTF_8));

Support CSV external data #397

Support CSV external data #397

Conversation

dcbriccetti commented Jan 6, 2019 • edited Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

Are there any risks to merging this code? If so, what are they?

dcbriccetti Jan 6, 2019

Choose a reason for hiding this comment

ggalmazor Jan 7, 2019

Choose a reason for hiding this comment

ggalmazor Jan 7, 2019

Choose a reason for hiding this comment

dcbriccetti Jan 7, 2019

Choose a reason for hiding this comment

dcbriccetti commented Jan 8, 2019

ggalmazor commented Jan 14, 2019

codecov-io commented Jan 27, 2019 • edited Loading

Codecov Report

dcbriccetti commented Jan 27, 2019

ggalmazor commented Jan 28, 2019 • edited Loading

ggalmazor left a comment • edited Loading

Choose a reason for hiding this comment

ggalmazor left a comment

Choose a reason for hiding this comment

dcbriccetti commented Feb 13, 2019

ggalmazor commented Feb 13, 2019

lognaturel Mar 29, 2019

Choose a reason for hiding this comment

ggalmazor Mar 29, 2019

Choose a reason for hiding this comment

lognaturel Jun 28, 2019

Choose a reason for hiding this comment

lognaturel commented Mar 29, 2019

ggalmazor commented Mar 29, 2019 • edited Loading

lognaturel commented Mar 29, 2019

lognaturel commented Jun 28, 2019

dcbriccetti commented Jan 6, 2019 •

edited

Loading

codecov-io commented Jan 27, 2019 •

edited

Loading

ggalmazor commented Jan 28, 2019 •

edited

Loading

ggalmazor left a comment •

edited

Loading

ggalmazor commented Mar 29, 2019 •

edited

Loading