Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenXML nodes to import/export data #11599

Merged
merged 6 commits into from
Apr 19, 2021
Merged

Conversation

mmisol
Copy link
Collaborator

@mmisol mmisol commented Apr 8, 2021

Purpose

Two new nodes are added that allow to import and export data from an
Excel spreadsheet. These nodes use the OpenXML SDK, so they do not
make use of the Excel application, thus can be used without Office
installed locally.

These nodes should behave pretty similarly to the existing Excel nodes.
In order to prove that, the tests for the existing nodes have been
copied and adapted for the new nodes.

At the function signature level, there are the following differences:

OpenXMLImportExcel

  • Provides inputs for 'startRow' and 'startColumn' (optional)

OpenXMLExportExcel

  • Inputs for 'startRow' and 'startColumn' are optional

Screenshots

Library search

Screen Shot 2021-04-16 at 3 02 58 PM

Nodes on canvas

Screen Shot 2021-04-16 at 3 03 34 PM

Declarations

Check these if you believe they are true

  • The codebase is in a better state after this PR
  • Is documented according to the standards
  • The level of testing this PR includes is appropriate
  • User facing strings, if any, are extracted into *.resx files
  • All tests pass using the self-service CI.
  • Snapshot of UI changes, if any.
  • Changes to the API follow Semantic Versioning and are documented in the API Changes document.
  • This PR modifies some build requirements and the readme is updated

FYIs

@Amoursol

mmisol added 2 commits April 8, 2021 12:43
Two new nodes are added that allow to import and export data from an
Excel spreadsheet. These nodes uses the OpenXML SDK, so they do not
make use of the Excel application, thus can be used without Office
installed locally.

These nodes should behave pretty similarly to the existing Excel nodes.
In order to prove that, the tests for the existing nodes have been
copied and adapted for the new nodes.

At the function signature level, there are the following differences:

ImportOpenXML
- Provides inputs for 'startRow' and 'startColumn' (optional)

ExportOpenXML
- Inputs for 'startRow' and 'startColumn' are optional
@mjkkirschner
Copy link
Member

mjkkirschner commented Apr 8, 2021

@mmisol I have not reviewed in depth yet, but I have some high level questions about approach that I think may help me review better.

  1. Are the files produced by these nodes openable by excel without user intervention or training?
  2. If 1 is true - why not replace the internal implementation of the existing nodes with open.xml functionality and remove the dependency on excel (of course we need to keep the unused binaries around for a while).
  3. It would also mean the tests would not need to be duplicated, just kept as is I believe.
  4. What about obsoleting the excel nodes and adding migrations to the new ones?
  5. why the need for the windows specific binaries?
  6. Have you reviewed adsk products for this new dependency, is it shipped in revit, civil etc?
  7. updating our license files, about box etc. What is the license terms of this dep?

IMO a big benefit of this work, in addition to performance gains will be the removal of a source of conflicts and non working excel interop for users. I think, given the familiarity with execl, if those nodes are still present, most users will choose them.

Are there sufficient differences in behavior that you thought adding the new nodes was the best option? What are those differences?

@mmisol
Copy link
Collaborator Author

mmisol commented Apr 8, 2021

Hi @mjkkirschner

Regarding compatibility, I don't think we can make the guarantee these nodes work exactly like the existing ones. True, I was able to make the tests pass, but given the differences between the two approaches I wouldn't be surprised that something we have overlooked works differently. That's also why I don't think we should be so quick to obsolete the existing nodes. Also, there is plenty of functionality the obsolete Excel nodes cover that I cannot, like opening the Excel application or dealing with workbooks, worksheets, etc. as objects.

The windows-specific binaries seem to appear when targeting .NET only. As I read, if we targeted .NET standard another dependency would be used instead.

I have not reviewed if the dependency on DocumentFormat.OpenXML is safe to be made private. To be honest I don't know the exact list of products/plugins we should check or how to check them.

License and about box, that is something I can do.

// While SpreadsheetDocument.Open handles this, the error message it throws is not localized.
if (!File.Exists(filePath))
{
throw new ArgumentException(string.Format(Properties.Resources.WorkbookNotFound, filePath));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

{
rowData.Add(null);
currentColumnIndex++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what would happen if we do not do this step?

Copy link
Collaborator Author

@mmisol mmisol Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tests that exercise this. If you have a sparse spreadsheet with lots of empty cells Dynamo is expected to still return empty values for those cells. The following would be an example, where null is the value to return but the cell is actually empty:

   1    null    null
null       2    null 
null    null       3

Notice that row.Elements<Cell> would not return empty cells, so in order to return 3 elements for the first two rows, that final loop adds the remaining values as null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mmisol for the detailed explanation!

var letters = new Stack<char>();
do
{
letters.Push((char)('A' + ((columnIndex - 1) % 26)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

}
else
{
var document = SpreadsheetDocument.Create(filePath, SpreadsheetDocumentType.Workbook);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this function would be interrupted with exception that user only have read but not write access to the target path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either of the code paths of this function should fail in that case, because the above path opens the file to write it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmisol OK. I am not sure if you have a local env to test this. I can make a note to QA this case if you want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give this a try but yes, it would be a good thing to try by part of the QA team.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message I get is Access to the path '<PATH>' is denied. which must come from Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmisol Thanks, as long as Dynamo does not crash and we display this as a warning or so. Appreciate the checking!

}

/// <summary>
/// Adds default styles to the stylesheet so that Office does not identify the spreadsheet as corrupt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reference to the changes in this function? Do you foresee MS change how they identify spreadsheet as corrupt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was something I noticed in practice when the styles were absent, even if they were not needed. This code is the result of reviewing the styles of a newly created spreadsheet in Excel and adding some of those elements until Excel would accept the file created by Dynamo. Unfortunately, there is no documentation for this AFAIK, so the process was pretty hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmisol I can only imagine... Well, that is a rough start and I hope MS dont change these criteria often, otherwise we are in big trouble..

}

[Test]
public void CanWriteNullValuesToExcel1()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on this test the diff with the test above or use longer name to make it self explain? From reading the code, we are making sure not single value but null in the middle of array can be written?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can, let me take a look at it to make sure. The "undocumented" aspect is inherited from the existing tests which I copied, but let me see if I can improve that here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the tests a little bit to remark the difference. The previous test writes null as data, while this one includes null as a value in a list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmisol Great! I will leave some review period today for team and come back tomorrow to check more comments, if not expect our team to merge and bang on these this sprint!

Copy link
Contributor

@QilongTang QilongTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some reading of this PR, @mmisol really raised my confidence level and a wonderful piece of work! I cant wait to play with these two nodes. @reddyashish @zeusongit Feel free to take a look as well.

@QilongTang
Copy link
Contributor

@mmisol @mjkkirschner Thank you guys for the about box discussion, I will make a note to update internal dependency tracking docs.

@QilongTang
Copy link
Contributor

QilongTang commented Apr 15, 2021

Also @mmisol if I remember correctly, OpenXML can handle other types of docs other than excel, like word and ppt as well. If that is the case, I believe node names Data.ImportOpenXml and Data.ExportOpenXml might be still ambiguous? How about Data.ImportExcelInOpenXml and Data.ExportToExcelInOpenXml or something else makes more sense? The reason I ask is later this year, our team may introduce nodes for other file types.

@mmisol
Copy link
Collaborator Author

mmisol commented Apr 15, 2021

Thanks for the kind words @QilongTang ! I must admit it was more work than I expected initially but hopefully users will appreciate this. It should also ease processing Excel files on the server which can enable some new type of applications.

Regarding the node names, it's true that there may be some ambiguity, but since the function names are under Data. I think that should point that the underlying file is a spreadsheet. Anyway, if I had to choose a longer name I would maybe go for ImportOpenXmlSpreadsheet. I can change it but may be something to discuss with @Amoursol .

Also, I copied the icons, but it's maybe a good idea to use slightly different ones. What do you think?

@QilongTang
Copy link
Contributor

@mmisol Yes, I have discussed the icon thing with @Amoursol and he prefer to use a different color than Green to keep same visual but indicate slight difference. We will need help from UX on touching color, I will raise that at standup.

@Amoursol
Copy link
Contributor

Hello @mmisol - Jingyi, Elizabeth and I have discussed. Could you please name the nodes the following?

  • Data.OpenXMLImportExcel
  • Data.OpenXMLExportExcel

This honors the layout of the library with the data structure and is as clear as possible in as short a name as possible.

@mmisol
Copy link
Collaborator Author

mmisol commented Apr 16, 2021

@Amoursol @QilongTang I renamed the nodes and updated the icons. The PR description was also updated to reflect those changes.

Let me know if this looks good and if that's the case I'll merge after CI has passed.

@QilongTang
Copy link
Contributor

@mmisol This PR looks good, the failing test Dynamo.Tests.Configuration.PreferenceSettingsTests.TestSettingsSerialization is due to a different PR.. I am rerunning the self CI

@QilongTang
Copy link
Contributor

The latest self CI passed although not updating the status check here seems. This PR looks good to me and we would require some intensive testing this sprint.

@mmisol mmisol merged commit 44931b2 into DynamoDS:master Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants