Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV secondary instance support #452

Merged
merged 17 commits into from
Jul 8, 2019
Merged

Conversation

lognaturel
Copy link
Member

@lognaturel lognaturel commented Jun 28, 2019

Closes #417
Closes #451

CC @musamusa

What has been done to verify that this works as intended?

Wrote and ran tests. Built a jar, added to Collect and ensured that I could fill a form with an external secondary instance declaration straight from a pyxform conversion without having to remove dummy nodes (e.g. nigeria_wards_external). Also tried the test CSV form and the search/pulldata form from #417.

Why is this the best possible solution? Were any other approaches considered?

cc77b2e, 138429e and bfc97c7 are the fix for #417. The approach is to maintain a set of instance IDs seen used as arguments to the instance XPath function. This set is cleared for every new form definition parsed. A secondary instance is only made part of the FormDef if it is referred to in this way. This is as performant as possible within the existing code structure because it takes advantage of an existing pass over the Element tree. The static call to populate the set is not ideal but I think it's much better than setting up a complex maze of listeners. There are a lot of static calls already so it doesn't establish a new anti-pattern. Note that bfc97c7 also means internal secondary instances are ignored if they're not explicitly referenced. That change is not part of the fix for #417 but I think it's an improvement where applicable.

I made various testing changes so I could understand more of the context. In particular, I split out all tests related to external secondary instances in
0029f80.

I ran into a couple of simple bugs with the CSV parsing implementation that I fixed in 1ddbbd5 and 203241e. There are still a number of issues like not handling escaped commas that I want to think more about. I think I'll probably suggest introducing a CSV parsing library.

4bd58fc is the fix for #451. It made sense to me to add while in the same brain space and hopefully that will help with review as well. I simply ignored child nodes when the instance declaration has a source. I think that's the simplest solution. I do think we'll want to review it (as I wrote in the TODO) and perhaps combine child nodes and external instance nodes while more narrowly ignoring just items with a name of _.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Intentional changes are:

  • re-establish support for CSV external instances
  • ignore any secondary instance definition that is not referenced in an instance() call
  • don't crash on CSV external instances that have dangling commas
  • ignore instance child nodes if a source is specified

This touches a lot of the instance parsing code and though I think the test suite give us a pretty high level of confidence that the changes don't have unintended consequences, I suggest we issue a snapshot and a Collect beta as soon as this is merged.

Do we need any specific form for testing your changes? If so, please attach one.

See test forms.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Not yet. We should do a pass at performance improvements first.

@lognaturel lognaturel requested a review from ggalmazor June 28, 2019 21:34
@codecov-io
Copy link

codecov-io commented Jun 28, 2019

Codecov Report

Merging #452 into master will decrease coverage by 0.11%.
The diff coverage is 75%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #452      +/-   ##
============================================
- Coverage     48.91%   48.79%   -0.12%     
+ Complexity     2930     2913      -17     
============================================
  Files           245      241       -4     
  Lines         13683    13634      -49     
  Branches       2645     2645              
============================================
- Hits           6693     6653      -40     
+ Misses         6146     6132      -14     
- Partials        844      849       +5
Impacted Files Coverage Δ Complexity Δ
...arosa/core/model/instance/CsvExternalInstance.java 89.47% <100%> (+89.47%) 4 <0> (+4) ⬆️
...in/java/org/javarosa/xpath/expr/XPathFuncExpr.java 55.72% <50%> (-0.02%) 167 <0> (+1)
...ain/java/org/javarosa/xform/parse/XFormParser.java 64.59% <73.07%> (-0.11%) 236 <1> (+1)
...n/java/org/javarosa/core/model/actions/Action.java 57.14% <0%> (-29.53%) 5% <0%> (-3%)
...a/org/javarosa/core/services/PrototypeManager.java 79.16% <0%> (-8.34%) 8% <0%> (-1%)
.../org/javarosa/xform/parse/ElementChildDeleter.java 93.54% <0%> (-3.23%) 13% <0%> (-1%)
.../org/javarosa/core/reference/ReferenceManager.java 60.63% <0%> (-1.07%) 20% <0%> (-1%)
...rg/javarosa/core/model/instance/TreeReference.java 66.53% <0%> (-0.39%) 87% <0%> (-1%)
src/main/java/org/javarosa/core/model/FormDef.java 60.79% <0%> (-0.21%) 121% <0%> (-1%)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6120647...425ad37. Read the comment docs.

@lognaturel
Copy link
Member Author

@dcbriccetti @ggalmazor I'll let you fight over who wants to review this with me. 😉

@ggalmazor
Copy link
Contributor

I'm up for that!

@lognaturel
Copy link
Member Author

@ggalmazor do you want to talk through it or do an async review?

@ggalmazor
Copy link
Contributor

ggalmazor commented Jul 5, 2019

Thanks! I'll have a fist go at it by myself today and write any idea that pops up :)

@ggalmazor
Copy link
Contributor

I'm going through each commit. I think I'll work at two levels of detail:

  • I'll write inline comments for low-level topics using the GH review tool
  • I'll write a new comment with high-level topics that might show up

Copy link
Contributor

@ggalmazor ggalmazor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done!

@ggalmazor
Copy link
Contributor

ggalmazor commented Jul 5, 2019

Extra stuff about the tests

  • We could make a change to ReferenceManagerTestUtils.setUpSimpleReferenceManager:

    public static ReferenceManager setUpSimpleReferenceManager(Path path, String... schemes) {
        ReferenceManager refManager = ReferenceManager.instance();
        refManager.reset();
        for (String scheme : schemes)
            refManager.addReferenceFactory(buildReferenceFactory(
                scheme,
                path.toAbsolutePath().toString()
            ));
        return refManager;
    }

    This way, we could replace all the calls to FormParserHelper.mapFileToResourcePath which has a duplicated implementation of the same thing.

  • Now that we have proper benchmarks, I'd lose all the tests that use FormParserHelper.timeParsing.

  • I really dislike the EXTERNAL_SECONDARY_INSTANCE_XML static members. I'd prefer to see:

    private static Path formFile = r("external-secondary-instance.xml");

    instead of:

    private static Path EXTERNAL_SECONDARY_INSTANCE_XML;
    
    @BeforeClass
    public static void setUp() {
        EXTERNAL_SECONDARY_INSTANCE_XML = r("external-secondary-instance.xml");
    }
  • There are some tests to verify that some code won't throw. We're doing it indirectly (what other way would there be?) by evaluating an xpath expression and asserting that the result has the number of elements we expect.

    The test method says: formWithExternalSecondaryXMLInstance_ShouldParseWithoutError and then, there is a comment before the assertion with Confirm that items are made available to the XPath parser.

    I'd suggest we switch the method name and the comment so that the test method tells me what (hopefully behavior) we're testing and the comment gives me a piece of context I could be missing (not really needed in this case IMO):

    @Test
    public void items_from_an_xml_external_secondary_instance_are_available_to_the_xpath_parser() throws IOException, XPathSyntaxException {
        Path formName = r("external-select-xml.xml");
        setUpSimpleReferenceManager(formName.getParent(), "file");
        FormDef formDef = parse(formName);
    
        TreeReference treeReference = ((XPathPathExpr) parseXPath("instance('external-xml')/root/item")).getReference();
        EvaluationContext evaluationContext = formDef.getEvaluationContext();
        List<TreeReference> treeReferences = evaluationContext.expandReference(treeReference);
        assertThat(treeReferences.size(), is(12));
    }

    Since we're doing that kind of assertion a lot, we could do:

    @Test
    public void items_from_an_xml_external_secondary_instance_are_available_to_the_xpath_parser() throws IOException, XPathSyntaxException {
        Path formName = r("external-select-xml.xml");
        setUpSimpleReferenceManager(formName.getParent(), "file");
        FormDef formDef = parse(formName);
    
        assertThat(countChildrenAt(formDef, "instance('external-xml')/root/item"), is(12));
    }
    
    static int countChildrenAt(FormDef formDef, String xpath) throws XPathSyntaxException {
        TreeReference treeReference = ((XPathPathExpr) parseXPath(xpath)).getReference();
        EvaluationContext evaluationContext = formDef.getEvaluationContext();
        List<TreeReference> treeReferences = evaluationContext.expandReference(treeReference);
        return treeReferences.size();
    }

    and reuse countChildrenAt.

  • I always find all those ser/deser tests without assertions a bit weird. I'd expect something approximately like this:

    @Test
    public void formWithExternalSecondaryXMLInstance_ShouldSerializeAndDeserializeWithoutError() throws IOException, DeserializationException {
        // Arrange
        Path tempFile = Files.createTempFile("serialized-form", null);
        setUpSimpleReferenceManager(formFile.getParent(), "file", "file-csv");
        FormParserHelper.initSerialization();
        
        // Serialize the form
        FormDef serializedForm = parse(formFile);
        DataOutputStream dos = new DataOutputStream(Files.newOutputStream(tempFile));
        serializedForm.writeExternal(dos);
        dos.close();
        
        // Deserialize it into a blank object
        FormDef deserializedForm = new FormDef();
        DataInputStream dis = new DataInputStream(Files.newInputStream(tempFile));
        deserializedForm.readExternal(dis, defaultPrototypes());
        dis.close();
    
        // Both forms should be the same form
        assertThat(serializedForm, is(deserializedForm));
        
        // Clean up
        Files.delete(tempFile);
    }

    We should figure out how to assertThat(serializedForm, is(deserializedForm)), though.

Regarding the static call to XFormParser.recordInstanceFunctionCall

The only thing I could think of better than adding listeners here and there was to collect a manifest of things that might be of interest for the parser while parsing elements. We could do it super narrow to just collect referenced instance ids, or generic, to collect all used functions and arguments:

private void parseDoc(Map<String, String> namespacePrefixesByUri, String lastSavedSrc) {
  final StopWatch codeTimer = StopWatch.start();
  _f = new FormDef();

  initState();
  final String defaultNamespace = _xmldoc.getRootElement().getNamespaceUri(null);

  referencedInstanceIds.clear();
  // parseElement returns a set of instance ids references in instance('instance id') function calls while parsing the form
  referencedInstanceIds.addAll(parseElement(_xmldoc.getRootElement(), _f, topLevelHandlers));
  // ...
}

// parseElement will collect referenced instance ids recursively
private Set<String> parseElement(Element e, Object parent, HashMap<String, IElementHandler> handlers) {
    Set<String> referencedInstanceIds = new HashSet<>();
    String name = e.getName();

    IElementHandler eh = handlers.get(name);
    if (eh != null) {
        referencedInstanceIds.addAll(eh.handle(this, e, parent));
    } else {
        if (!validElementNames.contains(name)) {
            triggerWarning(
                "Unrecognized element [" + name + "]. Ignoring and processing children...",
                getVagueLocation(e));
        }
        for (int i = 0; i < e.getChildCount(); i++) {
            if (e.getType(i) == Element.ELEMENT) {
                referencedInstanceIds.addAll(parseElement(e.getElement(i), parent, handlers));
            }
        }
    }
    return referencedInstanceIds;
}

// Ultimately, it would come down to searching for function calls in the handlers that might use instance('....') calls, like this one
// I know it's super ugly!
private static void initProcessingRules() {
    groupLevelHandlers = new HashMap<String, IElementHandler>() {{
        // ...
        put(SELECTONE, new IElementHandler() {
            @Override
            public Set<String> handle(XFormParser p, Element e, Object parent) {
                QuestionDef q = p.parseControl((IFormElement) parent, e, CONTROL_SELECT_ONE);
                ItemsetBinding dynamicChoices = q.getDynamicChoices();
                if (dynamicChoices != null
                    && dynamicChoices.getNodesetExpr() instanceof XPathConditional
                    && ((XPathConditional) dynamicChoices.getNodesetExpr()).getExpr() instanceof XPathPathExpr) {
                    XPathPathExpr nodesetExpr = (XPathPathExpr) ((XPathConditional) dynamicChoices.getNodesetExpr()).getExpr();
                    XPathFilterExpr filtExpr = nodesetExpr.filtExpr;
                    XPathFuncExpr x = (XPathFuncExpr) filtExpr.x;
                    if (x.id.name.equals("instance")) {
                        Set<String> strings = new HashSet<>(1);
                        strings.add(((XPathStringLiteral) x.args[0]).s);
                        return strings;
                    }
                }
                return emptySet();
            }
        });
        // ...
}

@ggalmazor
Copy link
Contributor

I think this is neat, @lognaturel! I don't see anything important and, overall, I think that the end result is much more clear and easy to follow. We can argue about style and small details, but the tests are great too.

👍👍👍

@lognaturel lognaturel force-pushed the issue-417 branch 2 times, most recently from b5bdc55 to 6007a9f Compare July 5, 2019 23:40
To remove redundant mapFileToResourcePath
This doesn't guarantee that the two form definitions are identical but at least it makes it so the tests have assertions. Typically the failure mode will be a crash if serialization wasn't implemented properly for a class.
@lognaturel
Copy link
Member Author

Thanks for the thorough review, @ggalmazor! 🙏

I believe I have addressed all of your comments and suggestions. I suggest going through commit by commit one more time since I did force push to keep the history coherent.

@ggalmazor ggalmazor self-requested a review July 8, 2019 14:03
Copy link
Contributor

@ggalmazor ggalmazor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for considering my suggestions!

@lognaturel lognaturel merged commit a457a7d into getodk:master Jul 8, 2019
@lognaturel lognaturel deleted the issue-417 branch July 8, 2019 15:14
@lognaturel
Copy link
Member Author

@JohnTheBeloved I know this will cause some conflicts with what you've been working on. I really hope they won't be too painful to resolve. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants