diff --git a/README.md b/README.md index df06c97..e9f2c59 100644 --- a/README.md +++ b/README.md @@ -29,8 +29,8 @@ It should go without saying that whether you choose to install the package with The code snippet below demonstrates obtaining of a _parse tree_ (in the `stylesheet` variable) by parsing the file `example.css`: ```python -from csspring.parsing import normalize_input, parse_stylesheet -stylesheet = parse_stylesheet(normalize_input(open('example.css', newline=''))))) # The `newline=''` argument prevents default re-writing of newline sequences in input — per the CSS Syntax spec., parsing does filtering of newline sequences so no rewriting by `open` is necessary or desirable +from csspring.parsing import parse_stylesheet +stylesheet = parse_stylesheet(open('example.css', newline='')) # The `newline=''` argument prevents default re-writing of newline sequences in input — per the CSS Syntax spec., parsing does filtering of newline sequences so no rewriting by `open` is necessary or desirable ``` ## Documentation @@ -71,7 +71,7 @@ Parsing is offered only in the form of Python modules — no "command-line" prog ### Why? -We wanted a "transparent" CSS parser — one that could be used in different configurations without it imposing limitations that would strictly speaking go beyond parsing. Put differently, we wanted a parser that does not assume any particular application, a software _library_ in the classical sense of the term, or a true _API_ if you will. +We wanted a "transparent" CSS parser — one that could be used in different configurations without it imposing limitations that would strictly speaking go beyond parsing. Put differently, we wanted a parser that does not assume any particular application — a software _library_ in the classical sense of the term, or a true _API_ if you will. For instance, the popular [Less](http://lesscss.org) software seems to rather effortlessly parse CSS [3] text, but it invariably re-arranges white-space in the output, without giving the user any control over the latter. Less is not _transparent_ like that — there is no way to use it with recovery of the originally parsed text from the parse tree — parsing with Less is a one-way street for at least _some_ applications (specifically those that "transform" CSS but need to preserve all of the original input as-is). @@ -79,7 +79,7 @@ In comparison, this library was written to preserve _all_ input, _as-is_. This b ### Why Python? -As touched upon in [the disclaimer above](#disclaimer), the parser was written "from the bottom up" - if it ever adopts a top layer exposing its features with a "command line" tool, said layer will invariably have to tap into the rest of it, the library, and so in the very least a library is offered. Without a command-line tool (implying switches and other facilities commonly associated with command-line tools) the utility of the parser is tightly bound to the capabilities of e.g. the programming language it was written in, since the language effectively functions as the interface to the library (you can hardly use a library offered in the form of a C code without a C compiler and/or a dynamic linker). A parser is seldom used in isolation, after all — its output, the parse tree, is normally fed to another component in a larger application. Python is currently ubiquitous and attractive looking at a set of metrics that are relevant here. The collective amount of Python code is currently growing steadily, which drives adoption, which makes the prospect of offering CSS parsing written in specifically Python ever more enticing. +As touched upon in [the disclaimer above](#disclaimer), the parser was written "from the bottom up" - if it ever adopts a top layer exposing its features with a "command line" tool, said layer will invariably have to tap into the rest of it, the library, and so in the very least a library is offered. Without a command-line tool (implying switches and other facilities commonly associated with command-line tools) the utility of the parser is tightly bound to the capabilities of e.g. the programming language it was written in, since the language effectively functions as the interface to the library (you can hardly use a library offered in the form of a C code without a C compiler and/or a dynamic linker). A parser is seldom used in isolation, after all — its output, the parse tree, is normally fed to another component in a larger application. Python is ubiquitous and attractive on a number of metrics relevant to us. The collective amount of Python code is growing steadily, which drives adoption, both becoming factors for choosing to offer CSS parsing written in specifically Python. Another factor for choosing Python was the fact we couldn't find any _sufficiently capable_ CSS parsing libraries written specifically as [reusable] Python module(s). While there _are_ a few CSS parsing libraries available, none declared compliance with or de-facto support CSS 3 (including features like nested rules etc). In comparison, this library was written in close alignment with CSS 3 standard specification(s) (see [the compliance declaration](#compliance)). diff --git a/expand/csspring/syntax/tokenizing.py b/expand/csspring/syntax/tokenizing.py index 7f97f93..0c316ca 100644 --- a/expand/csspring/syntax/tokenizing.py +++ b/expand/csspring/syntax/tokenizing.py @@ -9,6 +9,7 @@ from ..utils import CP, BufferedPeekingReader, is_surrogate_code_point_ordinal, IteratorReader, join, parser_error, PeekingUnreadingReader from abc import ABC +import builtins from collections.abc import Callable, Iterable, Iterator from dataclasses import dataclass from decimal import Decimal @@ -46,7 +47,7 @@ def next(n: int) -> str: def consume(n: int) -> None: """Consume the next code point from the stream. - Consuming removes a [filtered] code point from the stream. If no code points are available for consumption (the stream is "exhausted"), an empty string signifying the so-called EOF ("end of file", see https://drafts.csswg.org/css-syntax/#eof-code-point) value, is consumed instead. + Consuming removes a [filtered] code point from the stream. If no code points are available for consumption (the stream is "exhausted"), an empty string signifying the so-called EOF ("end of file", see http://drafts.csswg.org/css-syntax/#eof-code-point) value, is consumed instead. """ nonlocal consumed # required for the `+=` to work for mutable non-locals like lists (despite the fact that the equivalent `extend` does _not_ require the statement) consumed += input.read(n) or [ FilteredCodePoint('', source='') ] @@ -502,3 +503,26 @@ def is_non_printable_code_point(cp: CP) -> bool: def is_whitespace(cp: CP) -> bool: """See http://drafts.csswg.org/css-syntax/#whitespace.""" return is_newline(cp) or cp in ('\t', ' ') + +# Map of values by token type, for types of tokens which do _not_ have the `value` attribute +token_values = { # For the `token_value` procedure to work as intended, subtypes should be listed _before_ their supertype(s) + OpenBraceToken: '{', + OpenBracketToken: '[', + OpenParenToken: '(', + CloseBraceToken: '}', + CloseBracketToken: ']', + CloseParenToken: ')', + ColonToken: ':', + CommaToken: ',', + SemicolonToken: ';', + CDCToken: '->', + CDOToken: '!--', +} + +def token_value(type: builtins.type[Token]) -> str: + """Get the value of a token by its type for types of tokens that do _not_ feature a `value` attribute. + + :param type: Type of token to get the value of + :returns: The value common to the type of tokens + """ + return next(value for key, value in token_values.items() if issubclass(type, key)) diff --git a/setup.py b/setup.py index 83461b9..c34d504 100644 --- a/setup.py +++ b/setup.py @@ -22,6 +22,6 @@ def run(self, *args, **kwargs) -> None: subprocess.check_call(('make', '-C', self.build_lib, '-f', os.path.realpath('Makefile'))) class BuildCommand(setuptools.command.build.build): - sub_commands = [ ('build_make', None) ] + setuptools.command.build.build.sub_commands # Makes the `build_make` command a sub-command of the `build_command`, which has the effect of the former being invoked when the latter is invoked (which is invoked in turn when the wheel must be built, through the `bdist_wheel` command) + sub_commands = [ ('build_make', None) ] + setuptools.command.build.build.sub_commands # Makes the `build_make` command a sub-command of the `build` command, which has the effect of the former being invoked when the latter is invoked (which is invoked in turn when the wheel must be built, through the `bdist_wheel` command) setup(cmdclass={ 'build': BuildCommand, 'build_make': MakeCommand }) diff --git a/src/csspring/selectors.py b/src/csspring/selectors.py index 7c6cc1e..e8a3e79 100644 --- a/src/csspring/selectors.py +++ b/src/csspring/selectors.py @@ -11,7 +11,8 @@ from .syntax.tokenizing import Token, BadStringToken, BadURLToken, CloseBraceToken, CloseBracketToken, CloseParenToken, ColonToken, DelimToken, FunctionToken, HashToken, IdentToken, OpenBraceToken, OpenBracketToken, OpenParenToken, StringToken from .syntax.grammar import any_value -from .values import Production, AlternativesProduction, CommaSeparatedRepetitionProduction, ConcatenationProduction, NonEmptyProduction, OptionalProduction, ReferenceProduction, RepetitionProduction, TokenProduction +from .values import Production, AlternativesProduction, CommaSeparatedRepetitionProduction, ConcatenationProduction, NonEmptyProduction, OptionalProduction, ReferenceProduction, RepetitionProduction, TokenProduction, OWS +from .utils import intersperse from functools import singledispatch from typing import cast @@ -32,7 +33,7 @@ def parse(production: Production, input: TokenStream) -> Product | Token | None: @parse.register def _(production: AlternativesProduction, input: TokenStream) -> Product | Token | None: - """Variant of `parse` for productions of the `|` combinator variety (see https://drafts.csswg.org/css-values-4/#component-combinators).""" + """Variant of `parse` for productions of the `|` combinator variety (see http://drafts.csswg.org/css-values-4/#component-combinators).""" input.mark() for element in production.elements: result = parse(element, input) @@ -50,49 +51,24 @@ def parse_any_value(input: TokenStream) -> Product | None: result: list[Token] = [] count = { type: 0 for type in { OpenBraceToken, OpenBracketToken, OpenParenToken } } while True: - match token := input.consume_token(): + match token := input.next_token(): case BadStringToken() | BadURLToken(): break case OpenParenToken() | OpenBracketToken() | OpenBraceToken(): count[type(token)] += 1 case CloseParenToken() | CloseBracketToken() | CloseBraceToken(): - if count[token.mirror_type] <= 0: + if count[token.mirror_type] == 0: break count[token.mirror_type] -= 1 case None: break + input.consume_token() result.append(token) - if result: - return result - else: - return None - -@parse.register -def _(production: CommaSeparatedRepetitionProduction, input: TokenStream) -> Product | None: - """Variant of `parse` for productions of the `#` multiplier variety (see https://drafts.csswg.org/css-values-4/#mult-comma).""" - result: list[Product | Token] = [] - input.mark() - while True: - value: Product | Token | None - if result: - value = parse(production.delimiter, input) - if value is None: - break - result.append(value) - value = parse(production.element, input) - if value is None: - break - result.append(value) - if result: - input.discard_mark() - return result - else: - input.restore_mark() - return None + return result or None @parse.register def _(production: ConcatenationProduction, input: TokenStream) -> Product | None: - """Variant of `parse` for productions of the ` ` combinator variety (see "juxtaposing components" at https://drafts.csswg.org/css-values-4/#component-combinators).""" + """Variant of `parse` for productions of the ` ` combinator variety (see "juxtaposing components" at http://drafts.csswg.org/css-values-4/#component-combinators).""" result: list[Product | Token] = [] input.mark() for element in production.elements: @@ -105,12 +81,9 @@ def _(production: ConcatenationProduction, input: TokenStream) -> Product | None @parse.register def _(production: NonEmptyProduction, input: TokenStream) -> Product | None: - """Variant of `parse` for productions of the `!` multiplier variety (see https://drafts.csswg.org/css-values-4/#mult-req).""" + """Variant of `parse` for productions of the `!` multiplier variety (see http://drafts.csswg.org/css-values-4/#mult-req).""" result = cast(Product | None, parse(production.element, input)) # The element of a non-empty production is concatenation, and the `parse` overload for `ConcatenationProduction` never returns a `Token`, only `Product | None` - if result and any(tokens(result)): - return result - else: - return None + return result if result and any(tokens(result)) else None @parse.register def _(production: ReferenceProduction, input: TokenStream) -> Product | Token | None: @@ -126,9 +99,21 @@ def _(production: RepetitionProduction, input: TokenStream) -> Product | None: result: list[Product | Token] = [] input.mark() while True: + if result and production.separator: + input.mark() + separator = parse(production.separator, input) + if separator is None: + input.restore_mark() + break value = parse(production.element, input) if value is None: + if result and production.separator: + input.restore_mark() break + if result and production.separator: + assert separator is not None + result.append(separator) + input.discard_mark() result.append(value) if len(result) == production.max: break @@ -143,7 +128,7 @@ def _(production: RepetitionProduction, input: TokenStream) -> Product | None: def _(production: TokenProduction, input: TokenStream) -> Token | None: """Variant of `parse` for token productions. - A token production can be identified in the grammar at https://drafts.csswg.org/selectors-4/#grammar with the `<...-token>` text. + A token production can be identified in the grammar at http://drafts.csswg.org/selectors-4/#grammar with the `<...-token>` text. """ input.mark() if isinstance(token := input.consume_token(), production.type) and all((getattr(token, name) == value) for name, value in production.attributes.items()): @@ -157,13 +142,19 @@ def parse_selector_list(input: TokenStream) -> Product | None: Parsing of selector lists is the _reason d'etre_ for this module and this is the [convenience] procedure that exposes the feature. """ - return cast(Product | None, parse(grammar.selector_list, input)) + return cast(Product | None, parse(ConcatenationProduction(OWS, grammar.selector_list, OWS), input)) class Grammar: """The grammar defining the language of selector list expressions. Normally a grammar would be defined as a set of rules (for deriving productions), where each rule would feature a component to the left side of the `->` operator (the "rewriting" operator) and a component to the right side of the operator. Owing to relative simplicity of the Selectors grammar -- where the left-hand side component is always a production name _reference_ (an identifying factor of context free grammars), we leverage Python's meta-programming facilities and use class attribute assignment statements to define the rules instead, where the assigned value is the right side of the rule, an arbitrary production (which may be an opaque value). Each attribute of the grammar is assigned the corresponding name automatically, owing to the `__set_name__` dunder method of the common production (super)class (where appropriate). + NOTE: Some of the productions as defined in the specification, have been rewritten below to eliminate repetition. These rewritten productions are marked accordingly, for clarity. + + NOTE: `intersperse` is used to insert white-space productions as required by the specification, which otherwise doesn't include them explicitly, instead describing white-space handling "in prose". + + NOTE: There is no notation (defined by the Values & Units spec.) for expressing `RepetitionProduction` productions with a `separator` attribute value other than `None` (the '[ ... ]*' variant) or that of `CommaSeparatedRepetitionProduction` (the '[ ... ]#' variant). Nevertheless, these productions are employed below to eliminate repetition as part of optimizing the grammar. + Implements http://drafts.csswg.org/selectors-4/#grammar. """ ns_prefix = ConcatenationProduction(OptionalProduction(AlternativesProduction(TokenProduction(IdentToken), TokenProduction(DelimToken, value='*'))), TokenProduction(DelimToken, value='|')) @@ -173,26 +164,17 @@ class Grammar: class_selector = ConcatenationProduction(TokenProduction(DelimToken, value='.'), TokenProduction(IdentToken)) attr_matcher = ConcatenationProduction(OptionalProduction(AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('~', '|', '^', '$', '*')))), TokenProduction(DelimToken, value='=')) attr_modifier = AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('i', 's'))) - attribute_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), TokenProduction(CloseBracketToken)), ConcatenationProduction(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), ReferenceProduction(attr_matcher), AlternativesProduction(TokenProduction(StringToken), TokenProduction(IdentToken)), OptionalProduction(ReferenceProduction(attr_modifier)), TokenProduction(CloseBracketToken))) + attribute_selector = ConcatenationProduction(*intersperse(TokenProduction(OpenBracketToken), ReferenceProduction(wq_name), OptionalProduction(ConcatenationProduction(*intersperse(ReferenceProduction(attr_matcher), AlternativesProduction(TokenProduction(StringToken), TokenProduction(IdentToken)), OptionalProduction(ReferenceProduction(attr_modifier)), separator=OWS))), TokenProduction(CloseBracketToken), separator=OWS)) # Rewritten legacy_pseudo_element_selector = ConcatenationProduction(TokenProduction(ColonToken), AlternativesProduction(*(TokenProduction(IdentToken, value=value) for value in ('before', 'after', 'first-line', 'first-letter')))) - pseudo_class_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(ColonToken), TokenProduction(IdentToken)), ConcatenationProduction(TokenProduction(ColonToken), TokenProduction(FunctionToken), ReferenceProduction(any_value), TokenProduction(CloseParenToken))) + pseudo_class_selector = ConcatenationProduction(TokenProduction(ColonToken), AlternativesProduction(TokenProduction(IdentToken), ConcatenationProduction(TokenProduction(FunctionToken), ReferenceProduction(any_value), TokenProduction(CloseParenToken)))) # Rewritten pseudo_element_selector = AlternativesProduction(ConcatenationProduction(TokenProduction(ColonToken), ReferenceProduction(pseudo_class_selector)), ReferenceProduction(legacy_pseudo_element_selector)) pseudo_compound_selector = ConcatenationProduction(ReferenceProduction(pseudo_element_selector), RepetitionProduction(ReferenceProduction(pseudo_class_selector))) subclass_selector = AlternativesProduction(ReferenceProduction(id_selector), ReferenceProduction(class_selector), ReferenceProduction(attribute_selector), ReferenceProduction(pseudo_class_selector)) compound_selector = NonEmptyProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(type_selector)), RepetitionProduction(ReferenceProduction(subclass_selector)))) complex_selector_unit = NonEmptyProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(compound_selector)), RepetitionProduction(ReferenceProduction(pseudo_compound_selector)))) combinator = AlternativesProduction(*(TokenProduction(DelimToken, value=value) for value in ('>', '+', '~')), ConcatenationProduction(*(TokenProduction(DelimToken, value=value) for value in ('|', '|')))) - complex_selector = ConcatenationProduction(ReferenceProduction(complex_selector_unit), RepetitionProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_selector_unit)))) + complex_selector = RepetitionProduction(ReferenceProduction(complex_selector_unit), min=1, separator=AlternativesProduction(ConcatenationProduction(OWS, ReferenceProduction(combinator), OWS), OWS)) # Rewritten complex_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(complex_selector)) selector_list = ReferenceProduction(complex_selector_list) - complex_real_selector = ConcatenationProduction(ReferenceProduction(compound_selector), RepetitionProduction(ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(compound_selector)))) - complex_real_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(complex_real_selector)) - compound_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(compound_selector)) - simple_selector = AlternativesProduction(ReferenceProduction(type_selector), ReferenceProduction(subclass_selector)) - simple_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(simple_selector)) - relative_selector = ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_selector)) - relative_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(relative_selector)) - relative_real_selector = ConcatenationProduction(OptionalProduction(ReferenceProduction(combinator)), ReferenceProduction(complex_real_selector)) - relative_real_selector_list = CommaSeparatedRepetitionProduction(ReferenceProduction(relative_real_selector)) grammar = Grammar() diff --git a/src/csspring/syntax/grammar.py b/src/csspring/syntax/grammar.py index f60a631..8c9433b 100644 --- a/src/csspring/syntax/grammar.py +++ b/src/csspring/syntax/grammar.py @@ -3,6 +3,6 @@ from ..values import Production -# See https://drafts.csswg.org/css-syntax/#any-value +# See http://drafts.csswg.org/css-syntax/#any-value any_value = Production() any_value.name = 'any_value' diff --git a/src/csspring/utils.py b/src/csspring/utils.py index cf84427..09d8881 100644 --- a/src/csspring/utils.py +++ b/src/csspring/utils.py @@ -15,6 +15,22 @@ T_co = TypeVar('T_co', covariant=True) T_contra = TypeVar('T_contra', contravariant=True) +def intersperse(*items: T, separator: T) -> Iterable[T]: + """Yield items with a separator yielded between each item. + + E.g. `intersperse(("foo", "bar", "baz"), "-")` will yield "foo", "-", "bar", "-", then "baz". + + Owing to a mere design choice, this procedure demands, by the type checker, that the separator be of a type co-variant with the type of items in the sequence, with the latter assumed to be homogenous (items are all of the same type). + + :param items: A sequence (items are assumed to be of the same type or share a super-type) + :param separator: A value to yield between yielding each item in the sequence + """ + it = iter(items) + yield next(it) + for item in it: + yield separator + yield item + @runtime_checkable class Reader(Protocol[T_co]): """An interface to readable/readers, to assist type checking for the most part.""" diff --git a/src/csspring/values.py b/src/csspring/values.py index 33c5b46..153edef 100644 --- a/src/csspring/values.py +++ b/src/csspring/values.py @@ -1,9 +1,8 @@ -"""Implement the ["CSS Values and Units Module Level 4"](https://drafts.csswg.org/css-values-4) specification. +"""Implement the ["CSS Values and Units Module Level 4"](http://drafts.csswg.org/css-values-4) specification. Only parts currently in use by the rest of the `csspring` pcakge, are implemented. """ -from .syntax.tokenizing import Token, CommaToken, CommentToken, WhitespaceToken -from .syntax.tokenizing import ColonToken +from .syntax.tokenizing import Token, token_value, CommaToken, CommentToken, WhitespaceToken import builtins from collections.abc import Iterable, Mapping @@ -12,203 +11,199 @@ from typing import cast class Production: - """An [abstract] class of CSS grammar elements. + """An [abstract] class of CSS grammar elements. - A production is a broadly accepted term for objects that express a subset of valid linguistic permutations for a given formal language. + A production is a broadly accepted term for objects that express a subset of valid linguistic permutations for a given formal language. - In this case the formal language is CSS, with tokens (see `Token`) for symbols. + In this case the formal language is CSS, with tokens (see `Token`) for symbols. - This class is effectively a "tag" type (given how it imposes no effective protocol), it serves to help distinguish productions from other kinds of objects, for at least type checking purposes. The feature that allows productions to automatically acquire a name, owing to Python's meta-programming features, is only added to this specific class for sheer convenience of it. + This class is effectively a "tag" type (given how it imposes no effective protocol), it serves to help distinguish productions from other kinds of objects, for at least type checking purposes. The feature that allows productions to automatically acquire a name, owing to Python's meta-programming features, is only added to this specific class for sheer convenience of it. - Note that a production is not the same as a [parse] product -- the former is an element of some language grammar, while the latter is a result of parsing a sequence of tokens in accordance with a grammar production, in effect expressing one single element from the set of all permutations a production would permit, corresponding to consumed part of the sequence. - """ - name: str - def __set_name__(self, _, name): - assert not hasattr(self, "name") or self.name == name # Given the usage context, we don't want to support re-setting the name, not with a different value at least - self.name = name + Note that a production is not the same as a [parse] product -- the former is an element of some language grammar, while the latter is a result of parsing a sequence of tokens in accordance with a grammar production, in effect expressing one single element from the set of all permutations a production would permit, corresponding to consumed part of the sequence. + """ + name: str + def __set_name__(self, _, name): + assert not hasattr(self, "name") or self.name == name # Given the usage context, we don't want to support re-setting the name, not with a different value at least + self.name = name # The fundamental grammar productions are defined below, see also http://drafts.csswg.org/css-values-4/#value-defs class ReferenceProduction(Production): - """Class of productions expressing a _reference_ to some production. + """Class of productions expressing a _reference_ to some production. - In the canonical notation, references use _names_ to identify the production they point to, but since we do not employ a grammar parser (e.g. one that parses BNF) but use the equivalent of an already parsed grammar, we permit ourselves the convenience of references pointing to productions directly, without using names. This is not to say production names aren't used, they are, just for different purposes (e.g. for serializing the parsed grammar back to the kind of notation defined by the "Values and Units" specification. - """ - element: Production - def __init__(self, element: Production): - self.element = element + In the canonical notation, references use _names_ to identify the production they point to, but since we do not employ a grammar parser (e.g. one that parses BNF) but use the equivalent of an already parsed grammar, we permit ourselves the convenience of references pointing to productions directly, without using names. This is not to say production names aren't used, they are, just for different purposes (e.g. for serializing the parsed grammar back to the kind of notation defined by the "Values and Units" specification. + """ + element: Production + def __init__(self, element: Production): + self.element = element class AlternativesProduction(Production): - """Class of productions expressing exactly one production from from an _ordered_ set of alternatives. + """Class of productions expressing exactly one production from from an _ordered_ set of alternatives. - Note that the ordered set type is expressed with `Iterable`, as Python doesn't provide a true ordered set type (with _ordered_ insertion) and while `Iterable` will allow duplicate items, something that a set normally must not, it was deemed an acceptable compromise to retain simplicity and avoid importing exotic third-party type(s) that implement ordered sets. + Note that the ordered set type is expressed with `Iterable`, as Python doesn't provide a true ordered set type (with _ordered_ insertion) and while `Iterable` will allow duplicate items, something that a set normally must not, it was deemed an acceptable compromise to retain simplicity and avoid importing exotic third-party type(s) that implement ordered sets. - Implements the `|` combinator as defined at http://drafts.csswg.org/css-values-4/#component-combinators. - """ - elements: Iterable[Production] # The alternative(s) - def __init__(self, *elements: Production): - self.elements = elements + Implements the `|` combinator as defined at http://drafts.csswg.org/css-values-4/#component-combinators. + """ + elements: Iterable[Production] # The alternative(s) + def __init__(self, *elements: Production): + self.elements = elements class ConcatenationProduction(Production): - """Class of productions equivalent to a concatenation (ordered sequence) of productions. + """Class of productions equivalent to a concatenation (ordered sequence) of productions. - Implements "juxtaposing components" as defined at http://drafts.csswg.org/css-values-4/#component-combinators. + Implements "juxtaposing components" as defined at http://drafts.csswg.org/css-values-4/#component-combinators. - The name of the class borrows the term "concatenation" from the more general parsing lingo. - """ - elements: Iterable[Production] - def __init__(self, *elements: Production): - """ - :param elements: A sequence of productions expressing the concatenation - """ - self.elements = elements + The name of the class borrows the term "concatenation" from the more general parsing lingo. + """ + elements: Iterable[Production] + def __init__(self, *elements: Production): + """ + :param elements: A sequence of productions expressing the concatenation + """ + self.elements = elements class NonEmptyProduction(Production): - """Class of productions that behave much like `ConcatenationProduction` but only permit a _non-empty_ concatenation. + """Class of productions that behave much like `ConcatenationProduction` but only permit a _non-empty_ concatenation. - Implements the `[...]!` notation as defined at http://drafts.csswg.org/css-values-4/#mult-req, see the "An exclamation point (!) after a group..." bullet point. - """ - element: ConcatenationProduction - def __init__(self, element: ConcatenationProduction): - """ - :param element: A [`ConcatenationiProduction`] production that will express the non-empty concatenation - """ - self.element = element + Implements the `[...]!` notation as defined at http://drafts.csswg.org/css-values-4/#mult-req, see the "An exclamation point (!) after a group..." bullet point. + """ + element: ConcatenationProduction + def __init__(self, element: ConcatenationProduction): + """ + :param element: A [`ConcatenationiProduction`] production that will express the non-empty concatenation + """ + self.element = element class RepetitionProduction(Production): - """Class of productions that express repetition of an element (with optionally lower and upper bounds on the number of repetitions). - - Implements the `*` notation as defined at http://drafts.csswg.org/css-values-4/#mult-zero-plus. - """ - element: Production - min: int - max: int | None - def __init__(self, element: Production, min: int = 0, max: int | None = None): - """ - :param element: The production expressing the repeating part of this production - :param min: The minimum amount of times the parser must accept input, i.e. the minimum number of repetitions of token sequences accepted by the parser - :param max: The maximum amount of times the parser will be called, i.e. the maximum number of repetitions that may be consumed in the input; the value of `None` implies no maximum (i.e. no upper bound on repetition) - """ - assert min >= 0 - assert max is None or max > 0 - assert max is None or min <= max - self.min = min - self.max = max - self.element = element + """Class of productions that express repetition of an element (with optionally lower and upper bounds on the number of repetitions). + + Implements the `*` notation as defined at http://drafts.csswg.org/css-values-4/#mult-zero-plus. + """ + separator: Production | None = None + element: Production + min: int + max: int | None + def __init__(self, element: Production, min: int = 0, max: int | None = None, *, separator: Production | None = None): + """ + :param element: The production expressing the repeating part of this production + :param min: The minimum amount of times the parser must accept input, i.e. the minimum number of repetitions of token sequences accepted by the parser + :param max: The maximum amount of times the parser will be called, i.e. the maximum number of repetitions that may be consumed in the input; the value of `None` implies no maximum (i.e. no upper bound on repetition) + :param separator: A production expressing the "delimiting" part between any two repetitions of the `element` production; if omitted or `None`, there's _no_ delimiting part -- repetitions are _adjacent_ + """ + assert min >= 0 + assert max is None or max > 0 + assert max is None or min <= max + self.min = min + self.max = max + self.element = element + if separator: + self.separator = separator class OptionalProduction(RepetitionProduction): - """Class of productions equivalent to `RepetitionProduction` with no lower bound and accepting no repetition of the element, meaning the element is expressed at most once. + """Class of productions equivalent to `RepetitionProduction` with no lower bound and accepting no repetition of the element, meaning the element is expressed at most once. - Implements the `?` notation as defined at http://drafts.csswg.org/css-values-4/#mult-opt. - """ - def __init__(self, element: Production): - super().__init__(element, 0, 1) + Implements the `?` notation as defined at http://drafts.csswg.org/css-values-4/#mult-opt. + """ + def __init__(self, element: Production): + super().__init__(element, 0, 1) class TokenProduction(Production): - """Class of productions that express a token, optionally one with a matching set of attributes. - """ - type: type[Token] - attributes: Mapping - def __init__(self, type: builtins.type[Token], **attributes): - """ - :param type: The type of token this production will express - :param attributes: Mapping of presumably token attribute values by name, to use for expressing the set of attributes on the token this production will express - """ - self.type = type - self.attributes = attributes - + """Class of productions that express a token, optionally one with a matching set of attributes. + """ + type: type[Token] + attributes: Mapping + def __init__(self, type: builtins.type[Token], **attributes): + """ + :param type: The type of token this production will express + :param attributes: Mapping of presumably token attribute values by name, to use for expressing the set of attributes on the token this production will express + """ + self.type = type + self.attributes = attributes + +OWS = optional_whitespace = RepetitionProduction(TokenProduction(WhitespaceToken)) whitespace = RepetitionProduction(TokenProduction(WhitespaceToken), min=1) # The white-space production; presence of white-space expressed with this production, is _mandatory_ (`min=1`); the definition was "hoisted" here because a) it depends on `RepetitionProduction` and `TokenProduction` definitions, which must thus precede it, and b) because the `CommaSeparatedRepetitionParser` definition that follows, depends on it, in turn -class CommaSeparatedRepetitionProduction(Production): - """Class of productions that express a non-empty comma-separated repetition (CSR) of a production element. - - Unlike `RepetitionProduction` which permits arbitrary number of the production element, this class does not currently implement arbitrary repetition bounds. The delimiting part (a comma optionally surrounded by white-space) is mandatory, which implies at least one repetition (two expressions of the element). Disregarding the delimiting behaviour, productions of this class thus behave like those of `RepetitionProduction` with `2` for `min` and `None` for `max` property values. +class CommaSeparatedRepetitionProduction(RepetitionProduction): + """Class of productions that express a non-empty comma-separated repetition (CSR) of a production element. - Implements the `#` notation as defined at http://drafts.csswg.org/css-values-4/#mult-comma. - """ - delimiter = ConcatenationProduction(OptionalProduction(AlternativesProduction(whitespace, TokenProduction(CommentToken))), TokenProduction(CommaToken), OptionalProduction(AlternativesProduction(whitespace, TokenProduction(CommentToken)))) # The production expressing the delimiter to use with the repetition, a comma with [optional] white-space around it - element: Production - def __init__(self, element: Production): - """ - :param element: A production to use for expressing the repeating part in this production - """ - self.element = element + Implements the `#` notation as defined at http://drafts.csswg.org/css-values-4/#mult-comma. + """ + separator = ConcatenationProduction(OWS, TokenProduction(CommaToken), OWS) # A comma with [optional] white-space around it + def __init__(self, element: Production, min: int = 1, max: int | None = None): + assert min >= 1 # "one or more times" (ref. definition); the spec. does not define whether a minimum of zero is permitted, so we err on the safer side + super().__init__(element, min, max) class Formatter: - """Class of objects that offer procedures for serializing productions into streams of text formatted per the [value definition syntax](https://drafts.csswg.org/css-values-4/#value-defs).""" - grouping_strings = ('[ ', ' ]') # The kind of grouping symbol to use when a production expression must be surrounded with a pair of brace-like grouping symbols, in its serialized form - def grouping_mode(self, production: Production): - """Determine whether a given production shall require an explicit pair of grouping symbols when featured as an _operand_ (e.g. in binary/unary operation context). - :returns: `True` if the expression of `production` serialized with this formatter, should feature explicit grouping symbols wrapping it, `False` otherwise - """ - match production: - case AlternativesProduction() | CommaSeparatedRepetitionProduction() | ConcatenationProduction() | NonEmptyProduction() | RepetitionProduction(): return not hasattr(production, "name") - case ReferenceProduction() | TokenProduction(): return False - case _: - raise ValueError - def combined(self, productions: Iterable, combinator: str) -> Iterable[str]: - it = (self.format(production) for production in productions) - yield from next(it) - for item in it: - yield combinator - yield from item - @singledispatchmethod - def format(self, production: Production) -> Iterable[str]: - raise TypeError(f"No suitable `format` method for {production}") - @format.register - def _(self, production: AlternativesProduction) -> Iterable[str]: - return self.combined(production.elements, ' | ') - @format.register - def _(self, production: ConcatenationProduction) -> Iterable[str]: - return self.combined(production.elements, ' ') - @format.register - def _(self, production: NonEmptyProduction) -> Iterable[str]: - yield from self.operand(production.element) - yield '!' - @format.register - def _(self, production: ReferenceProduction) -> Iterable[str]: - yield '<' + self.name(production.element) + '>' - @format.register - def _(self, production: RepetitionProduction) -> Iterable[str]: - yield from self.operand(production.element) - yield self.multiplier(production) - @format.register - def _(self, production: TokenProduction) -> Iterable[str]: - if production.attributes: - if 'value' not in production.attributes or len(production.attributes) > 1: - raise NotImplementedError # the "Values and Units" specification doesn't feature token productions with matching of attributes other than `value` - yield repr(production.attributes['value']) - else: - if hasattr(production.type, 'value'): - yield '<' + re.sub(r'(^)?[A-Z]', lambda m: (('-' if m[1] is None else '') + m[0].lower()), production.type.__name__) + '>' # type: ignore # MyPy 1.11 complains with "error: Unsupported operand types for + ("str" and "bytes") [operator]", but the error appears to be a false positive: http://github.com/python/mypy/issues/12961 # TODO: Revisit the issue following MyPy updates - else: - if issubclass(production.type, ColonToken): - value = ',' - else: - raise TypeError - yield repr(value) - @format.register - def _(self, production: CommaSeparatedRepetitionProduction) -> Iterable[str]: - yield from self.operand(production.element) - yield '#' - def multiplier(self, production: RepetitionProduction) -> str: - match (production.min, production.max): - case (0, 1): - return '?' - case (0, None): - return '*' - case (1, None): - return '+' - case _: - return '{' + (str(production.min) if production.min == production.max else str(production.min) + ',' + (str(production.max) if production.max else '')) + '}' - def name(self, production: Production) -> str: - """Get the name of a production. - - :raises AttributeError: if the production does not have a name - """ - return production.name.replace('_', '-') - def operand(self, production) -> Iterable[str]: - group_start, group_end = self.grouping_strings if self.grouping_mode(production) else ('', '') - yield group_start - yield from self.format(production) - yield group_end + """Class of objects that offer procedures for serializing productions into streams of text formatted per the [value definition syntax](http://drafts.csswg.org/css-values-4/#value-defs).""" + grouping_strings = ('[ ', ' ]') # The kind of grouping symbol to use when a production expression must be surrounded with a pair of brace-like grouping symbols, in its serialized form + def grouping_mode(self, production: Production): + """Determine whether a given production shall require an explicit pair of grouping symbols when featured as an _operand_ (e.g. in binary/unary operation context). + :returns: `True` if the expression of `production` serialized with this formatter, should feature explicit grouping symbols wrapping it, `False` otherwise + """ + match production: + case AlternativesProduction() | CommaSeparatedRepetitionProduction() | ConcatenationProduction() | NonEmptyProduction() | RepetitionProduction(): return not hasattr(production, "name") + case ReferenceProduction() | TokenProduction(): return False + case _: + raise ValueError + def combined(self, productions: Iterable, combinator: str) -> Iterable[str]: + it = (self.format(production) for production in productions) + yield from next(it) + for item in it: + yield combinator + yield from item + @singledispatchmethod + def format(self, production: Production) -> Iterable[str]: + raise TypeError(f"No suitable `format` method for {production}") + @format.register + def _(self, production: AlternativesProduction) -> Iterable[str]: + return self.combined(production.elements, ' | ') + @format.register + def _(self, production: ConcatenationProduction) -> Iterable[str]: + return self.combined(production.elements, ' ') + @format.register + def _(self, production: NonEmptyProduction) -> Iterable[str]: + yield from self.operand(production.element) + yield '!' + @format.register + def _(self, production: ReferenceProduction) -> Iterable[str]: + yield '<' + self.name(production.element) + '>' + @format.register + def _(self, production: RepetitionProduction) -> Iterable[str]: + yield from self.operand(production.element) + yield self.multiplier(production) + @format.register + def _(self, production: TokenProduction) -> Iterable[str]: + if production.attributes: + if production.attributes.keys() != { 'value' }: + raise TypeError # the "Values and Units" specification doesn't feature token productions with matching of attributes other than `value` + yield repr(production.attributes['value']) + else: + if hasattr(production.type, 'value'): + yield '<' + re.sub(r'(^)?[A-Z]', lambda m: (('-' if m[1] is None else '') + m[0].lower()), production.type.__name__) + '>' # type: ignore # MyPy 1.11 complains with "error: Unsupported operand types for + ("str" and "bytes") [operator]", but the error appears to be a false positive: http://github.com/python/mypy/issues/12961 # TODO: Revisit the issue following MyPy updates + else: + yield repr(token_value(production.type)) + @format.register + def _(self, production: CommaSeparatedRepetitionProduction) -> Iterable[str]: + yield from self.operand(production.element) + yield '#' + def multiplier(self, production: RepetitionProduction) -> str: + match (production.min, production.max): + case (0, 1): + return '?' + case (0, None): + return '*' + case (1, None): + return '+' + case _: + return '{' + (str(production.min) if production.min == production.max else str(production.min) + ',' + (str(production.max) if production.max else '')) + '}' + def name(self, production: Production) -> str: + """Get the name of a production. + + :raises AttributeError: if the production does not have a name + """ + return production.name.replace('_', '-') + def operand(self, production) -> Iterable[str]: + group_start, group_end = self.grouping_strings if self.grouping_mode(production) else ('', '') + yield group_start + yield from self.format(production) + yield group_end