Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

lognaturel · 2019-03-29T17:52:46Z

#397 added support for parsing a CSV as a secondary instance. It was then disabled in #416 because it broke search/pulldata in Collect.

We will need to figure out a way to add this support back in without interfering with the existing implementations. These may eventually be deprecated but since they solve a real problem and are in broad use, there needs to be an acceptable, suitably performant alternative in place first. Even if/when they are officially marked for deprecation, I imagine they will stay in for quite a while.

@ggalmazor rightfully pointed out here that the Collect implementation of search/pulldata doesn't use the instance declared in the form. pyxform adds that instance declaration when pulldata is used (XLSForm/pyxform#43). This is because clients that do things in a spec-compliant way (e.g. Enketo), need to know what instance to load. Collect just reads any CSV attached with the form into a database ("magic").

Since the pulldata function can be thought of as an abstraction on top of XPath queries (kind of like indexed-repeat), it's easy for clients that do load external secondary CSV instances to implement, hence the instance declaration added by pyxform so they can support it. For that reason, I think it will likely become an official part of the spec. search is strange all around, starting from it being an appearance.

Collect search/pulldata are very performant. It's important that the extra overhead of loading the CSV instance doesn't occur if the user intends to use those implementations. That is, if the form has the following instance declaration: <instance id="foo" src="jr://file-csv/foo.csv" />, the CSV should only actually be loaded if the form definition includes instance("foo"). I'm not immediately seeing a way to do that since secondary instances are parsed before the main instance. Maybe secondary instances could be "lazy loaded" the first time a reference to them is seen in the main instance instead of being parsed upfront?

There's an additional issue uncovered by #397: jr://file is a substring of jr://file-csv and the connector matching is done by looking for a prefix. I think this could be fixed by looking for the connector followed by a / or something like that.

The text was updated successfully, but these errors were encountered:

admbtlr · 2019-04-16T10:07:21Z

I'm starting to wonder whether it might not make sense to hold back on fixing this with a workaround until we can do it "properly", i.e. by making pulldata part of Javarosa, and working out how we can allow implementations to decide whether to use a database if they want to.

Or am I putting the cart before the horse? Is the requirement to get this working urgent enough to warrant an interim solution?

admbtlr · 2019-04-17T10:22:20Z

@lognaturel ☝️❓

lognaturel · 2019-04-17T16:18:05Z

I'd say it's not an urgent requirement. It's annoying to have a feature that's documented without being actually implemented -- in this case specifying a CSV secondary instance that can be XPath-queried -- but it's probably not something a ton of people are waiting for.

I think there's something to learn from exploring options here that may be applicable to a broader solution but there's certainly something to be said for first doing some high-level design work first.

lognaturel · 2019-04-18T17:32:08Z

In the mean time, here are some more details on reproduction and what should actually happen:

Check out 70e82b0
Build a jar and copy it to collect_app/libs/ in Collect
Comment out the import for javarosa in build.gradle
Push form and media from sample_preloading.zip
Opening the form should lead to Unexpected error when user opens 'Sample Form - Preloading' collect#2972
- This is the issue with the file-csv connector: there's a match with the jr://file connector. I think a quick fix would be to add file-csv to https://github.com/opendatakit/collect/blob/ab0e3206699b5d67e3a0675e54768ef76973222d/collect_app/src/main/java/org/odk/collect/android/tasks/FormLoaderTask.java#L138 before file

I haven't done this myself but once the issue with the connector is resolved on the Collect side of things, the form should work. However, the secondary CSV instance will be loaded in memory which should NOT be the case because there is no call on instance('hhplotdetails') in the form definition. The goal would be for this form to load without the secondary external instance being loaded to memory.

external_csv.zip is an example of a form that should use the new external secondary CSV instance feature and load the contents of all three csvs into memory because the form body refers to all three of the instances.

lognaturel · 2019-04-22T21:13:43Z

@musamusa, I think you were interested in the information above. If you're still looking into it, are you now able to reproduce the issue?

musamusa · 2019-04-23T08:27:23Z

@lognaturel Yes, I was able to reproduce the issue after the call. The steps above also added some more explanations.

admbtlr · 2019-05-13T10:07:31Z

I started looking into this again today, and i just want to clarify something that's occurred to me:

So far we've been talking the about the problem that Pulldata CSV files are unnecessarily loaded into memory because they're incorrectly treated as if they're an External Secondary Instance (ESI). But isn't there also second, parallel problem whereby genuine ESI CSVs will be automatically read into database tables because they're incorrectly treated as if they're a Pulldata file?

Creating an unnecessary database table is, I guess, less of a performance overhead than holding the nodes in memory, but it still seems suboptimal.

lognaturel · 2019-05-13T19:24:21Z

As you say, @adamvert, the performance hit for building the databases is much lower. Additionally, once JR can decide which instances to process, it should be straightforward for Collect to find out whether it needs to do anything.

joeflack4 · 2019-06-13T17:00:14Z

+1 in regards to native JavaRosa pulldata() support. This would allow me to remove any bespoke implementation that I would be adding to XFormTest.

lognaturel added the needs discussion label Mar 29, 2019

lognaturel changed the title ~~Re-enable support for external secondary CSV instances without breaking search()~~ Re-enable support for external secondary CSV instances without breaking pulldata() Mar 29, 2019

lognaturel changed the title ~~Re-enable support for external secondary CSV instances without breaking pulldata()~~ Re-enable support for external secondary CSV instances without breaking pulldata()/search() Mar 29, 2019

lognaturel mentioned this issue Mar 29, 2019

Support CSV external data #397

Merged

lognaturel mentioned this issue Apr 23, 2019

Add pulldata as a custom XPath function #427

Open

musamusa mentioned this issue Jun 10, 2019

Re-enable support for external secondary CSV instances eHealthAfrica/javarosa#8

Merged

lognaturel mentioned this issue Jun 22, 2019

remove dummy data generated to just sneak past validation XLSForm/pyxform#289

Merged

musamusa mentioned this issue Jun 26, 2019

Re-enable support for external secondary CSV instances #448

Closed

lognaturel mentioned this issue Jun 28, 2019

CSV secondary instance support #452

Merged

lognaturel closed this as completed in #452 Jul 8, 2019

joeflack4 mentioned this issue Oct 4, 2019

Add support for pulldata() and search() using JavaRosa update PMA-2020/xform-test#15

Open

lognaturel mentioned this issue Apr 5, 2021

Support quotes, escaped commas, and other CSV formatting in external CSV files #616

Closed

lognaturel mentioned this issue Oct 14, 2024

OutOfMemoryError while opening a form with choices loaded from an external csv file getodk/collect#6446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

lognaturel commented Mar 29, 2019 •

edited

Loading

admbtlr commented Apr 16, 2019

admbtlr commented Apr 17, 2019

lognaturel commented Apr 17, 2019

lognaturel commented Apr 18, 2019

lognaturel commented Apr 22, 2019

musamusa commented Apr 23, 2019

admbtlr commented May 13, 2019

lognaturel commented May 13, 2019

joeflack4 commented Jun 13, 2019

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

Comments

lognaturel commented Mar 29, 2019 • edited Loading

admbtlr commented Apr 16, 2019

admbtlr commented Apr 17, 2019

lognaturel commented Apr 17, 2019

lognaturel commented Apr 18, 2019

lognaturel commented Apr 22, 2019

musamusa commented Apr 23, 2019

admbtlr commented May 13, 2019

lognaturel commented May 13, 2019

joeflack4 commented Jun 13, 2019

lognaturel commented Mar 29, 2019 •

edited

Loading