Path Tokenization

Before I begin, I want to lay out some definitions:

Character: a single letter Token: This can correspond to a number of things, but in this context I am talking about a single explicit command. So M 1.1 1.2 1.3 1.4 would be tokenized into [M 1.1 1.2, L 1.3 1.4] Current: In the array of characters, this is the index of the character we are currently looking at Consume: To do something with the current character, and then move onto the next character.

Tokenizer Documentation (15 Minute Read):

First, inside of initialize, we create an array of characters from the input string: I did this because I want to walk along that string and it feels clumsy to work with strings when doing positions. Anyways, this is rather minor.

Here is the actual tokenizer parser, which is a very simple state machine: We start out by consuming the whitespace (either adding to the current command if it is initialized (see add to command below) or discarding these characters by advancing (incrementing the current index of interest).) This is done because svgs technically allow for zero or more whitespaces at the beginning. Here is a look inside consume whitespace: You will notice that consume white space is a recursive VI, and this is because of the zero or more category. We want to match everything one character at a time, which simplifies the process quite a bit. And add to command ensures that we only add to a command that is initialized (ie no white space at the beginning of commands): Furthermore, note that consume whitespace wont consume anything if the current command is not whitespace:

State: Get Command:

Jumping back out the the tokenizer, let's go inside the while loop to the first case: "Get Command". You will see that the first thing we do is check if this character is a command. According to the svg specification, there is nothing else allowed besides whitespace before the first command, so if there is something then this isnt a valid svg. Taking a look inside "Is Command?": we can see that it is looking to see if current character matches any of the different commands using match pattern.

So lets make the assumption that this is indeed a command. Then we would set this as the command using.... "Set Command": This will take the current command, and if it is not empty, will add it to the list of tokens. If it is empty, then it does not get added to the list of tokens. Furthermore, it will initialize a new command string starting with the command type (moveto, lineto etc so M, L etc) The next VI that is called if it is found to be the beginning of a command sequence is the "Get Expected Coords": This is fairly simple... it takes the command type and provides the associated number of coordinates. Finally, the last VI called in this sequence is advance, which basically makes the command type character no longer the active character anymore.

Before moving onto the next state, I want to call attention to the initialization of the shift register with the number zero: this is for the coordinate count (I will add a label once I finish typing this out). Furthermore, if the character is not a command, then we produce an error which will exit the tokenizer: (error 1 is improper input, which is appropriate here)

State: Get Coordinate

Ok so moving onto the next state, here we are going to parse the actual coordinates (well the first one at least). This looks complicated at first, but its actually relatively simple once its explained: So the first thing we do is just read out the expected coordinates. There are a lot of different possibilites, but the only ones that are odd (as in the number 1) are Horizontal Lineto, Vertical Lineto, and close path takes zero. So they are special cases rather than the rule. Two or more numbers (and being even) is what all the other commands use, so that is what we will start with here.

The first thing that we do is we consume any whitespace that is there in between the command type and the number. Then we will also consume the sign, which is + or -. The insides of consume sign look relatively close to those of consume whitespace: If the character is a sign, it is added to the command, we advance, then we consume any whitespace that may exist between the number and sign, even though there shouldn't be any.

The next VI that is called is the consume coordinates. This basically will consume any values that are allowed inside a number until it gets to a value that is not allowed. This is a recursive VI like the consume whitespace VI: Similar to the other VIs before it, it checks using match pattern if it is an allowed value. It is important to note that commas are not allowed in numbers in svg files. Only numbers like 10123.456 are allowed, not 10,123.456 . It will keep consuming until it reaches a character that is not allowed. If it is the top level call though, it will bring an error because it expects there to be a valid number. If it isnt, that just means it terminated the number for the coordinate:

(Top level = True):

(Top level = False):

This is also used to detect when there is an incorrect number of coordinates as a side effect.

After consume coordinates, you can see that we have a consume whitespace, consume comma, and then another consume whitespace. These are the delimiters that are allowed in between the numbers that make up the coordinate pair. You must have at least one, but you can have any number of whitespace, and the comma is optional. Taking a look at consume comma: at this point this should be rather self explanatory. It looks for a comma, and if it exists, it consumes it. If not, it just moves on.

Jumping back out into the state, we can then see that for 2+ characters it will repeat the consumption of the sign and the number for the second number.

Before we move onto the next state, lets take a look at the other two cases. Thankfully, they all utilize the same subVIs, so we wont have to do a deep dive into anymore for this state. If there is only one expected command: you can see that we just consume the whitespace, sign, and then the coordinate. For zero expected commands, we dont need to consume anything: image

State: Update Coordinate Count

This state is very simple. First, it doubly increments the coordinate count because that is the highest number recorded. We are using a check to see if we have greater or equal to the number of expected coordinates, so doing this is perfectly fine. We are checking to see if we have reached the expected number of coordinates. If not, then we go back and repeat the "Get Coordinates" state as many times as needed until we have. If we have gotten the expected number of coordinates, we will go to the "Check End" state.

State: Check End

First thing we do in this state is to consume any whitespace that there maybe after the coordinates, and there can be zero or more white space characters. Oftentimes, you'll see a newline at the end of a command. Anyways, then we check to see if the current character that we are on is equal to the length of the array. If it is, then we are done parsing and can go ahead and exit, if it isnt, then we want to do a check to see if there is an implicit command following this one:

State: Check Explicit

Once again, we consume any whitespace just in case, and then we are going to call "Check Explicit": Check explicit is what really adds value to this program. It will check if the next value is a command, or if it is part of a coordinate. If it is part of a coordinate that means that this is the beginning of an implicit command. All implicit commands are the same type as the values before them, with the exception of if the implicit command follows a Moveto (M or m) command. In that case, it is an implicit Lineto command, and whether it is absolute or relative is determined by the absoluteness of the moveto command before it. Here is a look inside the Check Explicit VI:

The first thing that is done is that we check to see if it is a command next. That would indicate that this is indeed not an implicit command coming up by definition. If it is a command next, we dont do anything. However, if it isnt a command, then it is possible that this is an implicit command. We first do a check to see if it is a part of a coordinate. If it is, then we do a check for the Moveto-lineto edge case, otherwise we will get the command type from the beginning of the string. Then we will build an array consisting of the implicit command type and a space, and insert it into the array at the current point. This turns the implicit command into an explicit command.

Now I wanted to consider the case that there was a character after the points that would invalidate the svg. In this case, it would fail the check for being a coordinate. In that case, we would generate an error:

That is the conclusion of this state where we would check explicitness. Alternatively, we could go to the exit state. Lets take a look at it now.

State: Exit

In the exit case, we are going to add the final command to the list of tokens. Recall that this is done automatically in the "Get Command" State by the "Set Command" VI, but since that doesn't get called at the end, we need to do it manually when we exit for the last command.

On Error:

On error this will exit, as the goal of an error is to prevent downstream errors from obstructing the path to the origin of the error.

Example Use

Now that we went through everything, lets take a look at the example use for this: And lets do some really funky but technically correct mock inputs and see that it will tokenize this correctly without issues:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly