Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable support for external secondary CSV instances without breaking pulldata()/search() #417

Closed
lognaturel opened this issue Mar 29, 2019 · 9 comments · Fixed by #452

Comments

@lognaturel
Copy link
Member

lognaturel commented Mar 29, 2019

#397 added support for parsing a CSV as a secondary instance. It was then disabled in #416 because it broke search/pulldata in Collect.

We will need to figure out a way to add this support back in without interfering with the existing implementations. These may eventually be deprecated but since they solve a real problem and are in broad use, there needs to be an acceptable, suitably performant alternative in place first. Even if/when they are officially marked for deprecation, I imagine they will stay in for quite a while.

@ggalmazor rightfully pointed out here that the Collect implementation of search/pulldata doesn't use the instance declared in the form. pyxform adds that instance declaration when pulldata is used (XLSForm/pyxform#43). This is because clients that do things in a spec-compliant way (e.g. Enketo), need to know what instance to load. Collect just reads any CSV attached with the form into a database ("magic").

Since the pulldata function can be thought of as an abstraction on top of XPath queries (kind of like indexed-repeat), it's easy for clients that do load external secondary CSV instances to implement, hence the instance declaration added by pyxform so they can support it. For that reason, I think it will likely become an official part of the spec. search is strange all around, starting from it being an appearance.

Collect search/pulldata are very performant. It's important that the extra overhead of loading the CSV instance doesn't occur if the user intends to use those implementations. That is, if the form has the following instance declaration: <instance id="foo" src="jr://file-csv/foo.csv" />, the CSV should only actually be loaded if the form definition includes instance("foo"). I'm not immediately seeing a way to do that since secondary instances are parsed before the main instance. Maybe secondary instances could be "lazy loaded" the first time a reference to them is seen in the main instance instead of being parsed upfront?

There's an additional issue uncovered by #397: jr://file is a substring of jr://file-csv and the connector matching is done by looking for a prefix. I think this could be fixed by looking for the connector followed by a / or something like that.

@lognaturel lognaturel changed the title Re-enable support for external secondary CSV instances without breaking search() Re-enable support for external secondary CSV instances without breaking pulldata() Mar 29, 2019
@lognaturel lognaturel changed the title Re-enable support for external secondary CSV instances without breaking pulldata() Re-enable support for external secondary CSV instances without breaking pulldata()/search() Mar 29, 2019
@admbtlr
Copy link

admbtlr commented Apr 16, 2019

I'm starting to wonder whether it might not make sense to hold back on fixing this with a workaround until we can do it "properly", i.e. by making pulldata part of Javarosa, and working out how we can allow implementations to decide whether to use a database if they want to.

Or am I putting the cart before the horse? Is the requirement to get this working urgent enough to warrant an interim solution?

@admbtlr
Copy link

admbtlr commented Apr 17, 2019

@lognaturel ☝️❓

@lognaturel
Copy link
Member Author

I'd say it's not an urgent requirement. It's annoying to have a feature that's documented without being actually implemented -- in this case specifying a CSV secondary instance that can be XPath-queried -- but it's probably not something a ton of people are waiting for.

I think there's something to learn from exploring options here that may be applicable to a broader solution but there's certainly something to be said for first doing some high-level design work first.

@lognaturel
Copy link
Member Author

In the mean time, here are some more details on reproduction and what should actually happen:

I haven't done this myself but once the issue with the connector is resolved on the Collect side of things, the form should work. However, the secondary CSV instance will be loaded in memory which should NOT be the case because there is no call on instance('hhplotdetails') in the form definition. The goal would be for this form to load without the secondary external instance being loaded to memory.

external_csv.zip is an example of a form that should use the new external secondary CSV instance feature and load the contents of all three csvs into memory because the form body refers to all three of the instances.

@lognaturel
Copy link
Member Author

@musamusa, I think you were interested in the information above. If you're still looking into it, are you now able to reproduce the issue?

@musamusa
Copy link

@lognaturel Yes, I was able to reproduce the issue after the call. The steps above also added some more explanations.

@admbtlr
Copy link

admbtlr commented May 13, 2019

I started looking into this again today, and i just want to clarify something that's occurred to me:

So far we've been talking the about the problem that Pulldata CSV files are unnecessarily loaded into memory because they're incorrectly treated as if they're an External Secondary Instance (ESI). But isn't there also second, parallel problem whereby genuine ESI CSVs will be automatically read into database tables because they're incorrectly treated as if they're a Pulldata file?

Creating an unnecessary database table is, I guess, less of a performance overhead than holding the nodes in memory, but it still seems suboptimal.

@lognaturel
Copy link
Member Author

As you say, @adamvert, the performance hit for building the databases is much lower. Additionally, once JR can decide which instances to process, it should be straightforward for Collect to find out whether it needs to do anything.

@joeflack4
Copy link
Contributor

+1 in regards to native JavaRosa pulldata() support. This would allow me to remove any bespoke implementation that I would be adding to XFormTest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants