Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CSV external data #397

Merged
merged 1 commit into from
Feb 12, 2019
Merged

Support CSV external data #397

merged 1 commit into from
Feb 12, 2019

Conversation

dcbriccetti
Copy link
Contributor

@dcbriccetti dcbriccetti commented Jan 6, 2019

Closes #

Trying out the idea here: getodk/xforms-spec#88, with an optimization from @ggalmazor to skip the XML parsing altogether.

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

Are there any risks to merging this code? If so, what are they?

@dcbriccetti dcbriccetti requested a review from ggalmazor January 6, 2019 22:57

sb.append("</root>\n");

stream = new ByteArrayInputStream(sb.toString().getBytes(StandardCharsets.UTF_8));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we like this approach, we can change this class so it doesn’t read the whole file into memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that this is reading all bytes from the original input stream to produce an XML document and then, that XML document is parsed (effectively reading the contents twice).

Since each line represents an item, maybe we could manually produce one TreeElement for each row which could be added to a root element, and then return it. I think that this could be more memory and time efficient. It could be even faster than the XML loader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re a clever one!

@dcbriccetti
Copy link
Contributor Author

@ggalmazor, thanks for that excellent code. It’s here now.

We now have a conflict, given that file-csv seems to have special meaning in Collect, and that previously JavaRosa ignored it.

Now we have this test failing:
public void parsesPreloadForm() throws IOException

@dcbriccetti dcbriccetti changed the title WIP: Support CSV external data as jr://file-csv, via a CSV to XML adapter WIP: Support CSV external data Jan 8, 2019
@ggalmazor
Copy link
Contributor

The commits so far are looking good ;)

To solve the issue with that test, one would have to set up the ReferenceManager before parsing the form. I've prepared two commits on top of your PR to show how I'd go about it:

@codecov-io
Copy link

codecov-io commented Jan 27, 2019

Codecov Report

Merging #397 into master will increase coverage by 0.03%.
The diff coverage is 84.61%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master    #397      +/-   ##
===========================================
+ Coverage     48.66%   48.7%   +0.03%     
- Complexity     2896    2900       +4     
===========================================
  Files           239     241       +2     
  Lines         13569   13591      +22     
  Branches       2628    2632       +4     
===========================================
+ Hits           6603    6619      +16     
- Misses         6127    6130       +3     
- Partials        839     842       +3
Impacted Files Coverage Δ Complexity Δ
src/org/javarosa/xform/parse/XFormParser.java 64.55% <0%> (-0.1%) 229 <0> (-1)
...rosa/core/model/instance/ExternalDataInstance.java 82.35% <100%> (ø) 12 <2> (+1) ⬆️
...arosa/core/model/instance/XmlExternalInstance.java 80% <80%> (ø) 1 <1> (?)
...arosa/core/model/instance/CsvExternalInstance.java 88.23% <88.23%> (ø) 3 <3> (?)
...c/org/javarosa/core/services/PrototypeManager.java 79.16% <0%> (-8.34%) 8% <0%> (-1%)
...varosa/core/model/condition/EvaluationContext.java 70.39% <0%> (-0.66%) 38% <0%> (-1%)
src/org/javarosa/core/model/utils/DateUtils.java 56.03% <0%> (+0.28%) 73% <0%> (+2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c5fb7...dd13694. Read the comment docs.

@dcbriccetti
Copy link
Contributor Author

@ggalmazor thanks for the coded solution. I cherry picked, and made a small correction (setUp instead of setup—phrasal verb, two words, to camel case) to your second commit. Was the first commit just formatting changes? I didn’t take it because it loses all my hand-beautifications.

What’s the next step for this project?

@ggalmazor
Copy link
Contributor

ggalmazor commented Jan 28, 2019

Thanks, @dcbriccetti!

Sorry for removing the manual code formatting. It's not me. It's IntelliJ who doesn't like it :)

My first commit extracted ReferenceManagerTest.buildReferenceFactory to ReferenceManagerTestUtils and reused it elsewhere. I see that this is still pending on XFormParserTest and ReferenceManagerTest in this test.

What’s the next step for this project?

Once that's done, I think this would have to be tested on Collect.

It's looking good :)

Copy link
Contributor

@ggalmazor ggalmazor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dcbriccetti!

I'd just like to ask you to extract the ReferenceManagerTest.buildReferenceFactory to ReferenceManagerTestUtils and reuse it elsewhere.

Other than that, this is looking really good. I'm excited about doing some benchmarks to put the external XML and CSV itemsets face to face :D

Transform each line in the CSV file directly into a TreeElement.
Copy link
Contributor

@ggalmazor ggalmazor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dcbriccetti!

And apologies for taking so much time to review the last change!

@ggalmazor ggalmazor merged commit 0fa92f5 into getodk:master Feb 12, 2019
@ggalmazor ggalmazor changed the title WIP: Support CSV external data Support CSV external data Feb 12, 2019
@ggalmazor ggalmazor added this to the v2.14.0 milestone Feb 12, 2019
@dcbriccetti
Copy link
Contributor Author

Thanks. Perhaps we’d like to meet again to discuss next steps.

@ggalmazor
Copy link
Contributor

Sure, let's talk it over Slack and touch base.

FormDef formDef = parse(r("Sample-Preloading.xml"));
// The form on this test uses a jr://file-csv resource.
// We need to prime the ReferenceManager to deal with those
Path form = r("Sample-Preloading.xml");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcbriccetti @ggalmazor This particular form doesn't actually use the instance loaded by file-csv because it relies on the Collect-only database-backed implementation of the search() appearance/function: https://github.com/opendatakit/collect/blob/d52ce0bfc63a11fb7f3fc62e0c0fd7c58684527a/collect_app/src/main/java/org/odk/collect/android/external/handler/ExternalDataHandlerSearch.java. You'll see the form doesn't actually use the instance. It's not a big deal since this test only verifies that the form can be parsed but it would probably be worth changing to avoid confusion.

I would also recommend avoiding the "preload' word unless referring to https://opendatakit.github.io/xforms-spec/#preload-attributes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should file an issue for this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose removing that test in #452

@lognaturel
Copy link
Member

It looks like this was never tried in Collect, right? It interferes with one of the existing client-only database-backed external CSV implementations as discovered in #415. It also looks like to get this to work, a Collect change would be needed, at the very least to add a translator for jr://file-csv.

Since we definitely do not want to parse a CSV at the JR level if one of the other database-backed implementations is used (for performance), I think we should disable this support for now. I'm not really coming up with a great way to only parse the CSV if the instance is actually referred to in the form definition. That is, https://github.com/opendatakit/javarosa/blob/1547174f521c47335cc4d87dc5c6638ef93ea570/resources/Sample-Preloading.xml defines <instance id="hhplotdetails" src="jr://file-csv/hhplotdetails.csv"/> but actually never uses instance("hhplotdetails"). Instead, it uses pulldata and search. This forum post has more context.

@ggalmazor
Copy link
Contributor

ggalmazor commented Mar 29, 2019

OK, some thoughts and questions that I have about this issue:

There's something I'm missing for sure, but what's the relation between jr://file-csv and the client-only database-backed external CSV implementations? I mean:

Since we definitely do not want to parse a CSV at the JR level if one of the other database-backed implementations is used (for performance), I think we should disable this support for now

I believe this would be a mistake in the form's design. Using the pulldata/search solution in xlsforms doesn't create an xform that includes a secondary external instance using file-csv. I don't see what's the relation here (I'm probably missing something).

Well, it turns out that (when transforming xlsforms to xforms) the search appearance won't create a secondary external instance with file-csv, but pulldata() will.

After studying the code that Collect uses to implement the client-only database-backed external CSV implementations, it looks like:

  1. It will try to create a db from any csv file present in the media folder, regardless of them being declared as external secondary instances or not.
  2. It loads a couple of external function handlers for the pulldata function and search appearance that read data from the dbs created in step 1.

The fact that the search appearance will work without defining any external secondary instance with jr://file-csv means that we could make xlsform not append a secondary external instance that uses jr://file-csv when a form uses the pulldata function, so that the file-csv "hostname" is free for JR and there's no interference.

Then, we should decide if we want to do something about Collect pre-loading any and all csvs in the media folder into client-side databases, which defeats the purpose of JR parsing them in the first place, but doesn't comply with the requirements we discussed when talking about external secondary instances e.g. can't use them with instance() and xpath, for example.


Regarding QA testing, we decided to merge this into master to release a SNAPSHOT and then eventually test it on Collect. We deemed that to be a safe move because the new "hostname" file-csv acts effectively as a feature toggle. Unfortunately, it looks like we didn't detect the bug our changes introduced in the existing "hostname" jr://file :(.

@lognaturel
Copy link
Member

I think I addressed your questions/concerns in #417, @ggalmazor! We can continue the discussion there when it comes to figuring out what to actually do about it.

@lognaturel
Copy link
Member

#452 has a proposed fix for #417 and other various little issues related to CSV data files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants