-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CSV external data #397
Conversation
|
||
sb.append("</root>\n"); | ||
|
||
stream = new ByteArrayInputStream(sb.toString().getBytes(StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we like this approach, we can change this class so it doesn’t read the whole file into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is that this is reading all bytes from the original input stream to produce an XML document and then, that XML document is parsed (effectively reading the contents twice).
Since each line represents an item, maybe we could manually produce one TreeElement
for each row which could be added to a root element, and then return it. I think that this could be more memory and time efficient. It could be even faster than the XML loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was thinking about: https://github.com/ggalmazor/javarosa/commit/cd7c991f51e2cf5d56264fd004c30554a7b104d2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re a clever one!
@ggalmazor, thanks for that excellent code. It’s here now. We now have a conflict, given that file-csv seems to have special meaning in Collect, and that previously JavaRosa ignored it. Now we have this test failing: |
The commits so far are looking good ;) To solve the issue with that test, one would have to set up the ReferenceManager before parsing the form. I've prepared two commits on top of your PR to show how I'd go about it: |
Codecov Report
@@ Coverage Diff @@
## master #397 +/- ##
===========================================
+ Coverage 48.66% 48.7% +0.03%
- Complexity 2896 2900 +4
===========================================
Files 239 241 +2
Lines 13569 13591 +22
Branches 2628 2632 +4
===========================================
+ Hits 6603 6619 +16
- Misses 6127 6130 +3
- Partials 839 842 +3
Continue to review full report at Codecov.
|
@ggalmazor thanks for the coded solution. I cherry picked, and made a small correction (setUp instead of setup—phrasal verb, two words, to camel case) to your second commit. Was the first commit just formatting changes? I didn’t take it because it loses all my hand-beautifications. What’s the next step for this project? |
Thanks, @dcbriccetti! Sorry for removing the manual code formatting. It's not me. It's IntelliJ who doesn't like it :) My first commit extracted
Once that's done, I think this would have to be tested on Collect. It's looking good :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @dcbriccetti!
I'd just like to ask you to extract the ReferenceManagerTest.buildReferenceFactory to ReferenceManagerTestUtils and reuse it elsewhere.
Other than that, this is looking really good. I'm excited about doing some benchmarks to put the external XML and CSV itemsets face to face :D
Transform each line in the CSV file directly into a TreeElement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @dcbriccetti!
And apologies for taking so much time to review the last change!
Thanks. Perhaps we’d like to meet again to discuss next steps. |
Sure, let's talk it over Slack and touch base. |
FormDef formDef = parse(r("Sample-Preloading.xml")); | ||
// The form on this test uses a jr://file-csv resource. | ||
// We need to prime the ReferenceManager to deal with those | ||
Path form = r("Sample-Preloading.xml"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcbriccetti @ggalmazor This particular form doesn't actually use the instance loaded by file-csv
because it relies on the Collect-only database-backed implementation of the search()
appearance/function: https://github.com/opendatakit/collect/blob/d52ce0bfc63a11fb7f3fc62e0c0fd7c58684527a/collect_app/src/main/java/org/odk/collect/android/external/handler/ExternalDataHandlerSearch.java. You'll see the form doesn't actually use the instance. It's not a big deal since this test only verifies that the form can be parsed but it would probably be worth changing to avoid confusion.
I would also recommend avoiding the "preload' word unless referring to https://opendatakit.github.io/xforms-spec/#preload-attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should file an issue for this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose removing that test in #452
It looks like this was never tried in Collect, right? It interferes with one of the existing client-only database-backed external CSV implementations as discovered in #415. It also looks like to get this to work, a Collect change would be needed, at the very least to add a translator for Since we definitely do not want to parse a CSV at the JR level if one of the other database-backed implementations is used (for performance), I think we should disable this support for now. I'm not really coming up with a great way to only parse the CSV if the instance is actually referred to in the form definition. That is, https://github.com/opendatakit/javarosa/blob/1547174f521c47335cc4d87dc5c6638ef93ea570/resources/Sample-Preloading.xml defines |
OK, some thoughts and questions that I have about this issue:
Well, it turns out that (when transforming xlsforms to xforms) the After studying the code that Collect uses to implement the client-only database-backed external CSV implementations, it looks like:
The fact that the Then, we should decide if we want to do something about Collect pre-loading any and all csvs in the media folder into client-side databases, which defeats the purpose of JR parsing them in the first place, but doesn't comply with the requirements we discussed when talking about external secondary instances e.g. can't use them with Regarding QA testing, we decided to merge this into master to release a SNAPSHOT and then eventually test it on Collect. We deemed that to be a safe move because the new "hostname" |
I think I addressed your questions/concerns in #417, @ggalmazor! We can continue the discussion there when it comes to figuring out what to actually do about it. |
Closes #
Trying out the idea here: getodk/xforms-spec#88, with an optimization from @ggalmazor to skip the XML parsing altogether.
What has been done to verify that this works as intended?
Why is this the best possible solution? Were any other approaches considered?
Are there any risks to merging this code? If so, what are they?