Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax for multidimensional arrays #33697

Merged
merged 41 commits into from
May 20, 2021

Conversation

BioTurboNick
Copy link
Contributor

@BioTurboNick BioTurboNick commented Oct 28, 2019

Use multiple semicolons to denote >2 dimensions in an inline N-dimensional array specification.

Summary of all changes in PR and how it works: #33697 (comment)

Breaking changes: Extra semicolons in matrix expression currently ignored. However, use is likely to be rare and unintentional. #37168 adds deprecation warning for this. That PR was merged.

Related work:

  • Typed concatenation methods
  • Matching inline display (as used in print([1 2;3 4;;5 6;7 8;;9 0;2 3]); necessary for printing of vectors containing multidimensional arrays)
  • Performance checks / optimization
  • Documentation
  • Fix inline display for arrays too long to display
  • Array concatenation
  • General case (Any argument)
  • Handling for concatenating elements and arrays together
  • Implement ;; as second dimension as discussed
  • Extend terminal semicolons to vector expression e.g. [1,2,3;;;] Removed
  • Check print() methods e.g. print(rand(Int8, 2,2,2)) (temporary)
  • Move to proper, reversible AST

I did this largely as an exercise for myself, to partly address #30467 . Unsure how much I can do on the rest but if I get pointers I may be able to. If this approach is deemed worth continuing.

julia> [1 2;3 4;;5 6;7 8;;9 0;2 3]
2×2×3 Array{Int64,3}:
[:, :, 1] =
 1  2
 3  4

[:, :, 2] =
 5  6
 7  8

[:, :, 3] =
 9  0
 2  3

julia> [1;3;;5;7;;9;2]
2×1×3 Array{Int64,3}:
[:, :, 1] =
 1
 3

[:, :, 2] =
 5
 7

[:, :, 3] =
 9
 2

julia> [1;;5;;9]
1×1×3 Array{Int64,3}:
[:, :, 1] =
 1

[:, :, 2] =
 5

[:, :, 3] =
 9

@StefanKarpinski
Copy link
Member

You appear to have introduced whitespace changes on every line. Did you perhaps edit the files on Windows in an editor that uses \r\n line endings?

@StefanKarpinski StefanKarpinski added minor change Marginal behavior change acceptable for a minor release triage This should be discussed on a triage call labels Oct 28, 2019
@mbauman mbauman changed the title Specification of multidimensional arrays Syntax for multidimensional arrays Oct 28, 2019
@mbauman mbauman added the parser Language parsing and surface syntax label Oct 28, 2019
@vtjnash
Copy link
Member

vtjnash commented Oct 28, 2019

(downside of #32781 is that problem isn't caught locally anymore)

@BioTurboNick
Copy link
Contributor Author

Oops, somehow triggered Atom to switch from LF to CRLF in those files. Fixed.

@BioTurboNick
Copy link
Contributor Author

Saw there was a failing ParseError test because I forgot to restore a check and error message. Should be fixed.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Oct 29, 2019

Just did some quick performance checks

Allocated 10,000 times in a for loop inside a function, minimum observed time over several trials:

[1;3;5;7]
# 1.2: 0.000332 seconds (10.00 k allocations: 1.068 MiB)
# new: 0.000311 seconds (10.00 k allocations: 1.068 MiB)

[1 1; 2 2; 3 3; 4 4]
# 1.2: 0.000442 seconds (10.00 k allocations: 1.373 MiB)
# new: 0.000480 seconds (10.00 k allocations: 1.373 MiB)

cat([1 1; 2 2], [3 3; 4 4], dims = 3) / [1 1; 2 2;; 3 3; 4 4]
# 1.2:                     0.047043 seconds (340.00 k allocations: 16.632 MiB)
# new (cat function call): 0.049554 seconds (390.00 k allocations: 18.158 MiB)
# new (syntax):            0.048661 seconds (390.00 k allocations: 18.158 MiB)

All times comparable. 1d and 2d arrays identical allocations.

Not sure why cat itself has more overhead in my build. It doesn't seem to replicate in 1.2.0, 1.3.0-rc4, or 1.4.0-DEV. Some difference between the Windows build of Julia (what I use normally) and the Linux build (what I'm building in)?

@BioTurboNick
Copy link
Contributor Author

Just added inline printing of arrays, to complete the fix for #30467. Wasn't as hard as I thought.

Thinking more about the 34-39x allocation overhead of the cat() approach, thought about reshape being more performant, and it appears so, only a 2x allocation overhead.

# 10,000 iterations using reshape: 0.000731 seconds (20.00 k allocations: 2.136 MiB)

Which leads me to wonder why cat() has such a high overhead? Using reshape() from the parser would potentially make parsing easier, but perhaps there are things that can be done to improve cat()?

@BioTurboNick
Copy link
Contributor Author

I fixed two bugs caught thanks to failing tests:

  1. OffsetArray indices weren't handled well.
  2. Two functions would unintentionally return a boolean instead of () due to conditional statement.

@JeffBezanson
Copy link
Member

Triage likes this, but we think there should be a release with a deprecation warning for using multiple semicolons. @Keno also has an alternate proposal to use ;; to mean concatenation in the (N+1)ist dimension.

@Keno
Copy link
Member

Keno commented Oct 31, 2019

An alternative I proposed on the triage call: Have
[a ;; b] be a generic concat in the n+1'st dimension operator. For 3d arrays, it'd largely be equivalent, but it's different in higher dimensions and there are some lower dimensional corner cases.
For 4d, my proposal would have

[[ [1 2;3 4] ;; [5 6;7 8] ;; [9 0;2 3] ] ;;
 [ [1 2;3 4] ;; [5 6;7 8] ;; [9 0;2 3] ]]

vs

[ 1 2;3 4 ;; 5 6;7 8 ;; 9 0;2 3 ;;;
  1 2;3 4 ;; 5 6;7 8 ;; 9 0;2 3 ]

in this PR. For lower dimensions, in my proposal [1; 2;;] would be 2x1, while in this PR it's 2x1x1 (we don't currently have a literal syntax for 2x1, I don't think). The primary thing I like about my proposal though is that it forces writing the [], so for large literals it should be easy to navigate to the dimension you want, by using the existing editor parenthesis matching feature. I think the primary drawback is that higher dimensions quickly become more verbose in my proposal when the elements being concatenated are of low dimension [1;;;5;;;9] in this PR vs (I think) [ [[[1;;];;];;] ;; [[[5;;];;];;] ;; [[[9;;];;];;] ]. Of course there is a correspondingly bad example for concatenating high dimensional inputs in high dimension with this PR:

julia> a = fill(1, (1 for i = 1:10)...)
1×1×1×1×1×1×1×1×1×1 Array{Int64,10}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1] =
 1

julia> [a ;;;;;;;;;;; a]

Anyway, food for thought.

@Keno
Copy link
Member

Keno commented Oct 31, 2019

Of course a variant on my proposal that is even more similar to this proposal is to still use the multiple semicolons to concat in the n+i-1'st dimension, where i is the number of semicolons.

@StefanKarpinski
Copy link
Member

That seems at odds with what the current single semicolon syntax does, i.e. always concatenate in the second dimension.

@Keno
Copy link
Member

Keno commented Oct 31, 2019

yes, that's why I said multiple semicolons (as in > 1). I don't think it's worth too much effort trying to make that consistent. As mentioned on the triage call, even in this proposal n semicolons increment the n == 1 ? n : n+1st dimension, (with being the dimension 2 concat operator).

@Keno
Copy link
Member

Keno commented Oct 31, 2019

In either case, we should do the change to reserve this syntax with a warning right away.

@BioTurboNick
Copy link
Contributor Author

Interesting idea @Keno. I appreciate the benefits of your proposal and for my own uses it'd be just as suitable. And perhaps they could be combined as you suggest.

FTR, I'd consider [1;2;;] mapping to 2x1x1 but [1;2;] not mapping to 2x1 to be a bug in my PR, so thanks for catching that. I hadn't actually intended trailing semicolons to have an effect, so resolving in either direction would work. I suppose it couldn't hurt to allow that, although I suppose you might also want the mirror syntax where e.g. [;;;1;2] would add empty dimensions to the front?

With respect to a ten-dimensional cat, I figure at that point it'd be more readable to fall back to cat(a, a, dims = 10), anyway, and that's probably okay?

@JeffBezanson
Copy link
Member

but [1;2;] not mapping to 2x1 to be a bug in my PR

I'm not sure we can change [1;2;] --- that seems much more breaking than changing multiple semicolons.

@Keno
Copy link
Member

Keno commented Oct 31, 2019

I also don't think we want to change it. It's consistent with this PR. The real problem is that is the concat operator in dimension 2, and we don't want to make a trailing whitespace be sensitive

@BioTurboNick BioTurboNick force-pushed the multidimensional-arrays2 branch from 27fdfd8 to 80ce4bf Compare April 13, 2020 20:33
@BioTurboNick
Copy link
Contributor Author

I tried to rebase onto the current master branch but something screwed up. Don't know if I did something or if the current master is iffy. Is there a way to tell where the error is from or advice on the best way to reset the branch?

@Liozou
Copy link
Member

Liozou commented Apr 13, 2020

I believe your rebase went wrong, since it includes many commits that are irrelevant to this PR.
I don't know if this will solve all your problems, but what you should probably do is:

  • First, in your fork of julia, update your master branch with git pull upstream master. If this fails because your upstream is not set, do it first (git remote add upstream https://github.com/JuliaLang/julia.git)
  • Then, in your multidimensional-arrays2 branch, do an interactive rebase up to the first included commit in the list of current commits for this PR. I see there are 49 commits, so you should do git rebase -i multidimensional-arrays2~50 multidimensional-arrays2.
  • In the prompt, keep all the commits you want to have as part of this PR by leaving them marked with the initial pick, and remove all those irrelevant by changing the pick to drop.
  • If the rebase does not immediately succeed, it means there are some conflicts, so fix them.
  • Once the rebase is complete, the branch should have the commits from julia followed by the list of commits you want to have in this PR. You can check whether it is the case with git log.
  • The next step is to do a proper rebase with the updated master branch: do so with git rebase master multidimensional-arrays2. Again, there might be some conflicts, so you have to resolve them.
  • Once this rebase is complete, check again with git log that you have all the julia commits up to now followed by your commits for this PR. If you are happy, force-push the change.

Hope it helps!

@BioTurboNick BioTurboNick force-pushed the multidimensional-arrays2 branch from a529408 to 26e2f26 Compare April 14, 2020 05:54
@BioTurboNick
Copy link
Contributor Author

Ah, perfect @Liozou, thank you!

@Liozou
Copy link
Member

Liozou commented Apr 14, 2020

Glad I could help!
There are still two oddities in your commits : the one named "More work" looks like it rewrote the entire julia-parser.scm file, and same issue with commit "Inline printing of multidimensional arrays" for file arrayshow.jl. I think the issue comes from your end-of-lines, since when I do git diff on your branch for the relevant commits I see that each line has been appended with a non-printable character... So you should probably remove those. Sorry I can't give you pointers on how to do that however, I can only guess it happened because of your text editor.

@StefanKarpinski
Copy link
Member

Windows editor turning new lines into \r\n perhaps? If so, there are git settings to prevent that from getting committed in the repo.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Aug 22, 2020

@StefanKarpinski - Yeah, it's bitten me a few times. I think I've sorted it out.

Just added typed_ncat, so now this is also valid:

UInt16[1 2 3; 4 5 6;; 7 8 9; 10 11 12]
#=
2×3×2 Array{UInt16,3}:
[:, :, 1] =
 0x0001  0x0002  0x0003
 0x0004  0x0005  0x0006

[:, :, 2] =
 0x0007  0x0008  0x0009
 0x000a  0x000b  0x000c
=#

UInt16[1 2 3; 4 5 6;;; 7 8 9; 10 11 12]
#=
2×3×1×2 Array{UInt16,4}:
[:, :, 1, 1] =
 0x0001  0x0002  0x0003`
 0x0004  0x0005  0x0006

[:, :, 1, 2] =
 0x0007  0x0008  0x0009
 0x000a  0x000b  0x000c
=#

I'm looking at tests for the parser, and I'm not certain from looking what I should be testing for or where they should go in the syntax.jl file. They seem to be organized by issue rather than type? EDIT: Did just add documentation and doctests, if that covers it.

@BioTurboNick

This comment has been minimized.

@BioTurboNick BioTurboNick force-pushed the multidimensional-arrays2 branch from 4d7f3a7 to e38ba55 Compare May 19, 2021 18:22
@BioTurboNick
Copy link
Contributor Author

I think everything has been addressed, now just to get to green.

Copy link
Member

@simeonschaub simeonschaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the doctest failure, but otherwise LGTM!

NEWS.md Show resolved Hide resolved
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented May 20, 2021

Okay, looks like we got it! Getting the operations on the intermediate data structure just right is trickier than it seems like it should be.

One small addition: I realized that a line break followed by a semicolon is potentially a problem. e.g.

[1
;2]

In the master branch, this syntax is accepted but is ignored. I turned it into an error, which is also consistent with the fact that this is considered invalid in the master branch:

[;1

@BioTurboNick
Copy link
Contributor Author

Actually, I can easily make it so that it's one side of the linebreak or the other but not both. Then someone could, if they wanted, do something like:

[1 2 3
 4 5 6
;;;
 7 8 9
 0 1 2]

In fact, already done.

@simeonschaub simeonschaub merged commit 9117b4d into JuliaLang:master May 20, 2021
@simeonschaub
Copy link
Member

OK, let's try this!

simeonschaub added a commit to simeonschaub/DataEnvelopmentAnalysis.jl that referenced this pull request May 21, 2021
This will get a different meaning in 1.7 (ref JuliaLang/julia#33697)
simeonschaub added a commit to simeonschaub/NaiveNASlib.jl that referenced this pull request May 21, 2021
This will get a different meaning in 1.7. (ref JuliaLang/julia#33697)
@simeonschaub
Copy link
Member

Should we disallow spaces between semicolons as in [1; ; 2]? javierbarbero/DataEnvelopmentAnalysis.jl#6 and DrChainsaw/NaiveNASlib.jl#85 are two examples which were using that by accident, so it seems potentially worthwhile to me to throw an error here.

@mbauman
Copy link
Member

mbauman commented May 21, 2021

Oh interesting, yes, I agree we should do that. Because I had to check: the currently implemented mechanism treats [1 ; ; 2] like [1 ;; 2], but I've been thinking of ;; as a single indivisible token.

@BioTurboNick
Copy link
Contributor Author

That is a good idea. Implemented in #40903

@simeonschaub simeonschaub removed the forget me not PRs that one wants to make sure aren't forgotten label May 29, 2021
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
Co-authored-by: Matt Bauman <[email protected]>
Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Simeon Schaub <[email protected]>
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
Co-authored-by: Matt Bauman <[email protected]>
Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Simeon Schaub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor change Marginal behavior change acceptable for a minor release parser Language parsing and surface syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants