Last Update |
---|
Oct 23, 2022 |
Note: All Rights Reserved - Unpublished research results
- Motivation
- Contributions
- Tools Used
- K-Forrelation Classification Problem
- Algorithms to generate the dataset
- Guidelines for Generation
This article is a brief summary of the project, click here for the full written report.
The need for challenging and relevant datasets to benchmark:
- Hybrid classical-quantum classifiers
- Algorithms to optimize quantum circuit design for classification tasks
A few research suggested datasets for quantum machine learning, but they are more suited for training models on quantum control and tomography (Perrier et al., 2021) or simply based on heuristic about difficulty in simulating entanglement (Schatzki et al., 2021)
A classification dataset based on the k-fold Forrelation (k-Forrelation) problem is interesting because:
-
k-Forrelation is formally proved to maximally separate the query complexity between classical and quantum computation in the black-box model (Aaronson, 2014) – which makes it more directly relevant to quantum classifiers than generic datasets (Note: this research does not attempt to compare quantum against classical classifiers)
-
k-Forrelation decision problem is PromiseBQP-complete and has been proved to be within the expressiveness of Variational Quantum Classifier and Quantum Support Vector Machine (Jager & Krems, 2022)
-
k-Forrelation datasets can be generated with different parameterizations that allow evaluation of model performance and computational cost at scales
This research addresses the following challenges regarding the k-Forrelation dataset:
-
The positive class is exponentially rare at larger problem sizes, making it prohibitively difficult to sample a balanced dataset
-
Random sampling in the function space is incapable of generating examples with high positive thresholds, which requires the development of a novel sampling algorithm
-
The k-Forrelation decision problem is theoretically formulated but the relative classification difficulty at different parameterizations have not been studied
The generated k-Forrelation datasets could also enable future research on performance criteria suitable for benchmarking quantum classifiers beyond accuracy (e.g., scalability), or to statistically confirm hypotheses of algorithm improvement.
Note: Benchmarking of (classical) machine learning algorithms concern criteria such as model complexity, scalability, sample complexity, interpretability, ability to learn from data stream, performance, among others. It remains an open question what should be the criteria to benchmark quantum algorithms and confirm any hypothesis of performance improvement. This is because the matters of concern for quantum computing are different and include depth of circuit (trainability), coherent time required if storing quantum information, optimizing entanglement, among others. This research suggests some potential use for the k-Forrelation dataset, but leave further discussion on the proper benchmarking of quantum algorithms to future works
- An algorithm to generate k-Forrelation datasets with high positive class threshold based on approximated Fourier Transform
- An analysis of the properties of k-Forrelation datasets in classification task
- Guidelines for the generation of k-Forrelation datasets for benchmarking
- Suggestions and demonstration for potential uses of k-Forrelation datasets (in progress)
- MATLAB (for development of sampling algorithm and dataset generation)
- Python (for training composite-kernel SVM and quantum classifers on the datasets)
- Advanced Research Computing (ARC) Cluster to run all codes
In 2014, Aaronson and Ambainis proved the maximal separation in query complexity for black-box model between quantum and classical computation. The study involved a property-testing problem called Forrelation (originally introduced by Aaronson, 2009). In Forrelation, two Boolean functions are given and the task is to decide if one function is highly correlated with the Fourier transform of the other. K-fold Forrelation (or, k-Forrelation) is the heuristical generalization of Forrelation that applies to k > 2 Boolean functions.
A review of Fourier analysis on the Boolean Cube can be found here (de Wolf, 2008). This section summarizes the mathematical concepts of discrete Fourier transform on the Boolean cube for the generation of a classification dataset.
Consider a function
The value table
(we will later focus on functions that map only to Boolean values
Define a
which defines the
Also, define the function
where,
It can be shown that
Then, the Fourier Transform of any function
Given two functions
The form of
If we restrict the range of
Figure 1. A quantum circuit that can be taken to define the k-fold Forrelation problem. (Aaronson, 2014)
Where
Since
The decision problem: Decide whether
This decision problem was used to prove the maximal separation in query complexity between quantum and classical computation (Aaronson, 2014). Recently, (Jager & Krems, 2022) showed that variational quantum classifiers and quantum kernels have the theoretical expressiveness to solve the k-Forrelation problem (which is known to be PromiseBQP-complete). However, the authors used the exact quantum circuit that described the k-Forrelation problem as the feature map for both methods in the proof.
The question remains to find a quantum classifier architecture capable of learning the k-Forrelation problem without the characteristic feature map (circuit in Fig. (1)) provided. This effort requires the generation of k-Forrelation datasets at various scales (length of input bitstring
Conditions on the functions to be sampled:
- Each function is Boolean in the general function space
$f(x): {0,1}^n \mapsto {+1,-1}$ - Each function is of the form
$f(x) = (-1)^{C_i(x)}$ where$C_i(x)$ is a product of at most 3 input bits, or chosen to be constant$f(x)=1$ - The number of function in each ensemble is at least three (
$k \ge 3$ )
The space of all Boolean functions is called the general space. Boolean functions that satisfy the above conditions are said to be in the policy space
Structure of the datasets:
- Each problem instance is characterized by three parameters:
- Length of binary input
$n$ - Number of Boolean functions
$k$ - Threshold of positive class
$\mu$ (the value of$\Phi$ above which an example is attributed the positive class)
- Length of binary input
- Each example is one-hot encoded with length
$n \times k$ - Condition on the number of functions
$k \le \text{poly}(n)$
With
It can be proved that
Figure 2. Distributions of
A similar trend can be seen for
Figure 3. Distributions of
Algorithm:
- Randomly sample
$k$ Boolean functions from the space to make a k-Forrelation example - Calculate the value of
$\Phi$ - Keep or discard the example depending on the desired class to be generated
Pros:
- Create the most general datasets since the sampling algorithm enforces no correlation between the
$k$ sampled functions
Cons:
- Cannot generate positive class at higher
$n$ or higher positive threshold
Aaronson (2014) introduces an algorithm to generate the positive class for the case of 2-Forrelation. To generate two Boolean functions that are highly likely to be forrelated:
- Generate a random vector
$\boldsymbol{v} \in \mathbb{R}^{2^n}$ with each entry sampled from the normal distribution$\mathcal{N}(0,1)$ - Set
$f_1 = \text{sign}(\boldsymbol{v})$ and$f_2 = \text{sign}(\hat{\boldsymbol{v}})$ , where$\hat{\boldsymbol{v}}$ is the discrete Fourier transform of$\boldsymbol{v}$
Since 2-Forrelation is not a sub-problem of k-Forrelation, it is not trivial to find a generation algorithm that would work to generate k-Forrelation ensambles with arbitrary number of functions (
Here I proposes a generalization of the sampling strategy for any arbitrary number of functions. This is done by leveraging the recursive linearized form of
The idea is to view
FOURIER GENERATOR (k):
if
choose
generate random vector
assign sign of
return [
else:
previous functions,
calculate
generate random vector
assign sign of
return [previous functions,
end if
Figure 4.
Figure 5.
In the general space, Fourier sampling increases
$\mathbb{E}[\Phi]$ to around 0.5. In the policy space, Fourier sampling increases$\mathbb{E}[\Phi]$ to around 0.25, which will still approach 0 if$n$ grows further. That said, for intermediate values of$n$ , Fourier sampling allows significant improvement over random sampling in generating the positive class when threshold is far from 0.
A characterization of the generated datasets using classical SVM with Bayesian-optimized kernel can be found in the Full Written Report
Here is summary of the guidelines for generating the k-Forrelation datasets:
-
Choose
$n$ : Go as high as desirable, or computationally capable, depending on the purpose. If$N$ is the number of qubits being used in a quantum classifier, then it is suggested that$n \le N$ -
Choose
$k$ : Go as high as possible according to the chosen$n$ , with the restriction$k = \text{poly}(n)$ . From our experiments,$k \approx n^2$ is a good heuristic -
Choose positive threshold: Near 0, but not overlapping with the negative class (defined to be instances with
$|\Phi| \le 0.01$ ) – e.g., around$0.02$ -
Generative algorithm: Random Sampling (for
$n < 10$ ). Larger$n$ would require Fourier Sampling, but the dataset is expected to be relatively easier than randomly sampled
GENERATING WITH MATLAB
The k-Forrelation datasets can be generated using the MATLAB class KForrClass.m
To generate a dataset using Random Sampling use getDatasetRandomSamp
function. E.g., to randomly sample from the general space a k-Forrelation set with:
$n=5$ $k=21$ - Negative Threshold = 0.01
- Positive Threshold = 0.50
- Class: Positive
- Number of samples: 10000
- Weights of functions: Uniform
Use getDatasetRandomSamp("general", 5, 21, 0.5, 0.01, +1, 10000, [])
To generate a dataset using Fourier Sampling use getDatasetFourier
function. E.g., to fourier sample from the policy space a k-Forrelation set with:
$n=7$ $k=30$ - Negative Threshold = 0.02
- Positive Threshold = 0.30
- Class: Negative
- Number of samples: 1000
Use getDatasetFourier("fourier", 7, 30, 0.01, 0.30, -1, 1000)
The function will return the encoded dataset, the raw dataset (with functions as value tables), and the frequency of sampling the class according to the specified positive and negative class policy.