From 4208fcf2934a7adc3827b9349d72c6560056c57a Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 10 Nov 2019 15:25:58 -0500 Subject: [PATCH 01/82] Initial cleanup of README docs & instructions --- README.md | 92 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 59 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 124c0c6309..733d75b82f 100644 --- a/README.md +++ b/README.md @@ -1,71 +1,97 @@ -## EVA (Exploratory Video Analytics) +# EVA (Exploratory Video Analytics) [![Build Status](https://travis-ci.org/georgia-tech-db/Eva.svg?branch=master)](https://travis-ci.com/georgia-tech-db/Eva) [![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/Eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/Eva?branch=master) -### Table of Contents -* Installation -* Demos -* Eva core -* Eva storage -* Dataset - - -### Installation -* Clone the repo -* Create a virtual environment with conda (explained in detail in the next subsection) -* Run following command to configure git hooks + +EVA is an end-to-end video analytics engine that allows users to query a database of videos and return results based on machine learning analysis. + +## Table of Contents +* [Installation](#installation) +* [Demos](#demos) +* [Unit Tests](#unit-tests) +* [Eva Core](#eva-core) +* [Eva Storage](#eva-storage) +* [Dataset](#dataset) + + +## Installation +Installation of EVA involves setting a virtual environment using conda and configuring git hooks. + +1. Clone the repo ```shell -git config core.hooksPath .githooks +git clone https://github.com/georgia-tech-db/Eva.git +``` + +2. Install [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) and update path. +```shell +export PATH=~/anaconda3/bin:$PATH ``` +3. Install dependencies in a virtual environment. Dependencies should install with no errors on Ubuntu 16.04 but there are known installation issues with MacOS. +```shell +cd Eva/ +conda env create -f environment.yml +``` -##### How to create the virtual environment -* Install conda - we have prepared a yaml file that you can directly use with conda to install a virtual environment -* Navigate to the eva repository in your local computer -* conda env create -f environment.yml -* Note, this yaml file should install and all code should run with no errors in Ubuntu 16.04. - However, there are know installation issues with MacOS. +4. Run following command to configure git hooks. +```shell +git config core.hooksPath .githooks +``` -### Demos -We have demos for the following components: -1. Eva analytics (pipeline for loading the dataset, training the filters, and outputting the optimal plan) +## Demos +The following components have demos: + +1. EVA Analytics: A pipeline for loading a dataset, training filters, and outputting the optimal plan. ```commandline cd + source activate eva_35 python pipeline.py ``` -2. Eva Query Optimizer (Will show converted queries for the original queries) +2. EVA Query Optimizer: The optimizer shows converted queries + (Will show converted queries for the original queries) ```commandline cd + source activate eva_35 python query_optimizer/query_optimizer.py ``` 3. Eva Loader (Loads UA-DETRAC dataset) ```commandline cd + source activate eva_35 python loaders/load.py ``` -NEW!!! There are new versions of the loaders and filters. +4. NEW!!! There are new versions of the loaders and filters. ```commandline cd + source activate eva_35 python loaders/uadetrac_loader.py python filters/minimum_filter.py ``` -2. EVA storage-system (Video compression and indexing system - *currently in progress*) +5. EVA storage-system (Video compression and indexing system - *currently in progress*) + +## Unit Tests +To run unit tests on the system, the following commands can be run: + +```shell + pycodestyle --select E test src/loaders + pytest test/ --cov-report= --cov=./ -s -v +``` -### Eva Core +## Eva Core Eva core is consisted of * Query Optimizer * Filters * UDFs * Loaders -##### Query Optimizer +#### Query Optimizer The query optimizer converts a given query to the optimal form. All code related to this module is in */query_optimizer* -##### Filters +#### Filters The filters does preliminary filtering to video frames using cheap machine learning models. The filters module also outputs statistics such as reduction rate and cost that is used by Query Optimizer module. @@ -80,22 +106,22 @@ The filters below are running: All code related to this module is in */filters* -##### UDFs +#### UDFs This module contains all imported deep learning models. Currently, there is no code that performs this task. It is a work in progress. Information of current work is explained in detail [here](src/udfs/README.md). All related code should be inside */udfs* -##### Loaders +#### Loaders The loaders load the dataset with the attributes specified in the *Accelerating Machine Learning Inference with Probabilistic Predicates* by Yao et al. All code related to this module is in */loaders* -### Eva storage +## Eva Storage Currently a work in progress. Come check back later! -### Dataset +## Dataset __[Dataset info](data/README.md)__ explains detailed information about the datasets From bb9834a2cf1d37a27509079c810e7bdacb49022f Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 10 Nov 2019 17:07:56 -0500 Subject: [PATCH 02/82] Updated environment to make unit tests pass --- README.md | 9 +++++---- environment.yml | 5 ++++- src/query_optimizer/query_optimizer.py | 4 ---- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 733d75b82f..bd7cac3813 100644 --- a/README.md +++ b/README.md @@ -44,27 +44,27 @@ The following components have demos: 1. EVA Analytics: A pipeline for loading a dataset, training filters, and outputting the optimal plan. ```commandline cd - source activate eva_35 + conda activate eva_35 python pipeline.py ``` 2. EVA Query Optimizer: The optimizer shows converted queries (Will show converted queries for the original queries) ```commandline cd - source activate eva_35 + conda activate eva_35 python query_optimizer/query_optimizer.py ``` 3. Eva Loader (Loads UA-DETRAC dataset) ```commandline cd - source activate eva_35 + conda activate eva_35 python loaders/load.py ``` 4. NEW!!! There are new versions of the loaders and filters. ```commandline cd - source activate eva_35 + conda activate eva_35 python loaders/uadetrac_loader.py python filters/minimum_filter.py ``` @@ -75,6 +75,7 @@ The following components have demos: To run unit tests on the system, the following commands can be run: ```shell + conda activate eva_35 pycodestyle --select E test src/loaders pytest test/ --cov-report= --cov=./ -s -v ``` diff --git a/environment.yml b/environment.yml index 1768328e19..0a4fd05781 100644 --- a/environment.yml +++ b/environment.yml @@ -3,6 +3,7 @@ channels: - conda-forge - anaconda - defaults + - pytorch dependencies: - _tflow_1100_select=0.0.1=gpu - easydict=1.9=py_0 @@ -88,6 +89,7 @@ dependencies: - python=3.6.8=h0371630_0 - python-dateutil=2.8.0=py36_0 - pytorch=1.0.1=cuda92py36h65efead_0 + - pytorch-cpu - pytz=2018.9=py36_0 - pyyaml=5.1=py36h7b6447c_0 - qt=5.9.7=h5867ecd_1 @@ -113,7 +115,8 @@ dependencies: - zlib=1.2.11=h7b6447c_3 - zstd=1.3.7=h0b5b093_0 - pycodestyle + - pytest-cov - pip: - torch==1.0.1 -prefix: /nethome/jbang36/anaconda3/envs/eva_35 +prefix: eva_35 diff --git a/src/query_optimizer/query_optimizer.py b/src/query_optimizer/query_optimizer.py index 814d843a2f..ac92ad60b8 100644 --- a/src/query_optimizer/query_optimizer.py +++ b/src/query_optimizer/query_optimizer.py @@ -21,12 +21,8 @@ from itertools import product import numpy as np - from src import constants -eva_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -sys.path.append(eva_dir) - class QueryOptimizer: """ From af88467225888a72a040f0c9b04c568bce4813c2 Mon Sep 17 00:00:00 2001 From: Asra Yousuf Date: Sun, 24 Nov 2019 22:09:31 -0500 Subject: [PATCH 03/82] Added aggregation --- src/expression/abstract_expression.py | 11 ++++- src/expression/aggregation_expression.py | 28 +++++++++++ test/expression/test_aggregation.py | 63 ++++++++++++++++++++++++ test/expression/test_logical.py | 2 +- 4 files changed, 101 insertions(+), 3 deletions(-) create mode 100644 src/expression/aggregation_expression.py create mode 100644 test/expression/test_aggregation.py diff --git a/src/expression/abstract_expression.py b/src/expression/abstract_expression.py index 0057e5981d..a7fec8a42f 100644 --- a/src/expression/abstract_expression.py +++ b/src/expression/abstract_expression.py @@ -24,9 +24,16 @@ class ExpressionType(IntEnum): ARITHMETIC_ADD = 12, ARITHMETIC_SUBTRACT = 13, ARITHMETIC_MULTIPLY = 14, - ARITHMETIC_DIVIDE = 15 + ARITHMETIC_DIVIDE = 15, + + FUNCTION_EXPRESSION = 16, + + AGGREGATION_COUNT = 17, + AGGREGATION_SUM = 18, + AGGREGATION_MIN = 19, + AGGREGATION_MAX = 20, + AGGREGATION_AVG = 21, - FUNCTION_EXPRESSION = 16 # add other types diff --git a/src/expression/aggregation_expression.py b/src/expression/aggregation_expression.py new file mode 100644 index 0000000000..9d2f2dff46 --- /dev/null +++ b/src/expression/aggregation_expression.py @@ -0,0 +1,28 @@ +from src.expression.abstract_expression import AbstractExpression, \ + ExpressionType, \ + ExpressionReturnType +import statistics + +class AggregationExpression(AbstractExpression): + def __init__(self, exp_type: ExpressionType, left: AbstractExpression, + right: AbstractExpression): + children = [] + if left is not None: + children.append(left) + if right is not None: + children.append(right) + super().__init__(exp_type, rtype=ExpressionReturnType.INTEGER, ## can also be a float + children=children) + + def evaluate(self, *args): + values = self.get_child(0).evaluate(*args) + if self.etype == ExpressionType.AGGREGATION_SUM: + return sum(values) + elif self.etype == ExpressionType.AGGREGATION_COUNT: + return len(values) + elif self.etype == ExpressionType.AGGREGATION_AVG: + return statistics.mean(values) + elif self.etype == ExpressionType.AGGREGATION_MIN: + return min(values) + elif self.etype == ExpressionType.AGGREGATION_MAX: + return max(values) diff --git a/test/expression/test_aggregation.py b/test/expression/test_aggregation.py new file mode 100644 index 0000000000..20c696f1fc --- /dev/null +++ b/test/expression/test_aggregation.py @@ -0,0 +1,63 @@ +import unittest + +from src.expression.abstract_expression import ExpressionType +from src.expression.comparison_expression import ComparisonExpression +from src.expression.aggregation_expression import AggregationExpression +from src.expression.constant_value_expression import ConstantValueExpression +from src.expression.tuple_value_expression import TupleValueExpression + + +class LogicalExpressionsTest(unittest.TestCase): + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def test_aggregation_sum(self): + columnName = TupleValueExpression(0) + aggr_expr = AggregationExpression( + ExpressionType.AGGREGATION_SUM, + None, + columnName + ) + tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(6, aggr_expr.evaluate(tuple1, None)) + + def test_aggregation_count(self): + columnName = TupleValueExpression(0) + aggr_expr = AggregationExpression( + ExpressionType.AGGREGATION_COUNT, + None, + columnName + ) + tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(3, aggr_expr.evaluate(tuple1, None)) + + def test_aggregation_avg(self): + columnName = TupleValueExpression(0) + aggr_expr = AggregationExpression( + ExpressionType.AGGREGATION_AVG, + None, + columnName + ) + tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(2, aggr_expr.evaluate(tuple1, None)) + + def test_aggregation_min(self): + columnName = TupleValueExpression(0) + aggr_expr = AggregationExpression( + ExpressionType.AGGREGATION_MIN, + None, + columnName + ) + tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(1, aggr_expr.evaluate(tuple1, None)) + + def test_aggregation_max(self): + columnName = TupleValueExpression(0) + aggr_expr = AggregationExpression( + ExpressionType.AGGREGATION_MAX, + None, + columnName + ) + tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(3, aggr_expr.evaluate(tuple1, None)) \ No newline at end of file diff --git a/test/expression/test_logical.py b/test/expression/test_logical.py index 199795155b..780155c441 100644 --- a/test/expression/test_logical.py +++ b/test/expression/test_logical.py @@ -36,7 +36,7 @@ def test_logical_and(self): tuple1 = [[1], [2], 3] self.assertEqual([True], logical_expr.evaluate(tuple1, None)) - def test_comparison_compare_greater(self): + def test_logical_or(self): tpl_exp = TupleValueExpression(0) const_exp = ConstantValueExpression(1) From 8cc03c6ceba1f8ab621c7651490cbf7b74ad53e6 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 1 Dec 2019 14:09:10 -0500 Subject: [PATCH 04/82] Initial commit of new UDF and its unit test --- src/udfs/video_action_classification.py | 160 ++++++++++++++++++++++ test/udfs/vid_to_frame_classifier_test.py | 8 ++ 2 files changed, 168 insertions(+) create mode 100644 src/udfs/video_action_classification.py create mode 100644 test/udfs/vid_to_frame_classifier_test.py diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py new file mode 100644 index 0000000000..129f414b94 --- /dev/null +++ b/src/udfs/video_action_classification.py @@ -0,0 +1,160 @@ +from src.models import FrameBatch, Prediction, FrameInfo, Point, BoundingBox, ColorSpace, VideoMetaInfo, VideoFormat +from src.loaders.video_loader import SimpleVideoLoader +from src.udfs.abstract_udfs import AbstractClassifierUDF + +#from keras.models import Sequential +#from keras.layers import Dense, Conv2D, Flatten + +from tensorflow.python.keras.models import Sequential +from tensorflow.python.keras.layers import Dense, Conv2D, Flatten + +from typing import List, Tuple +from glob import glob +import numpy as np +import random +import os + +class VideoToFrameClassifier(AbstractClassifierUDF): + + def __init__(self): + # Build the model + self.model = self.buildModel() + + # Get dataset directory and stored data + dataset = "./data/hmdb/" + videoMetaList, labelList, self.inverseLabelMap = self.findDataNames(dataset) + + # Train the model using shuffled data + self.trainModel(self.model, videoMetaList, labelList, 10) + + def findDataNames(self, searchDir): + """ + findDataNames enumerates all training data for the model and + returns a list of tuples where the first element is a EVA VideoMetaInfo + object and the second is a string label of the correct video classification + + Inputs: + - searchDir = path to the directory containing the video data + + Outputs: + - videoFileNameList = list of tuples where each tuple corresponds to a video + in the data set. The tuple contains the path to the video, + its label, and a nest tuple containing the shape + - labelList = a list of labels that correspond to the labels in labelMap + - inverseLabelMap = an inverse mapping between the string representation of the label + name and an integer representation of that label + + """ + + # Find all video files and corresponding labels in search directory + videoFileNameList = glob(searchDir+"**/*.avi", recursive=True) + random.shuffle(videoFileNameList) + + labels = [os.path.split(os.path.dirname(a))[1] for a in videoFileNameList] + + videoMetaList = [VideoMetaInfo(f,30,VideoFormat.AVI) for f in videoFileNameList] + inverseLabelMap = {k:v for (k,v) in enumerate(list(set(labels)))} + + labelMap = {v:k for (k,v) in enumerate(list(set(labels)))} + labelList = [labelMap[l] for l in labels] + + return (videoMetaList, labelList, inverseLabelMap) + + def trainModel(self, model, videoMetaList, labelList, n = 10): + """ + trainModel trains the built model using chunks of data of size n videos + + Inputs: + - model = model object to be trained + - videoMetaList = list of tuples where the first element is a EVA VideoMetaInfo + object and the second is a string label of the + correct video classification + - labelList = list of labels derived from the labelMap + - n = integer value for how many videos to act on at a time + """ + + labelArray = np.array(labelList) + + for i,videoInfo in enumerate(videoMetaList): + # Load the video from disk into memory + videoLoader = SimpleVideoLoader(videoInfo) + batches = videoLoader.load() + + for b in batches: + # Get the frames as a numpy array + frames = b.frames_as_numpy_array() + + # Skip unsupported frame sizes + if frames.shape[1:] != (240, 320, 3): + break + + labels = np.zeros((frames.shape[0],51)) + labels[:,labelList[i]] = 1 + + print(frames.shape) + print(labels.shape) + + # Split x and y into training and validation sets + xTrain = frames[0:int(0.8*frames.shape[0])] + yTrain = labels[0:int(0.8*labels.shape[0])] + + xTest = frames[int(0.8*frames.shape[0]):] + yTest = labels[int(0.8*labels.shape[0]):] + + print(xTrain) + print(yTrain) + + # Train the model using cross-validation (so we don't need to explicitly do CV outside of training) + model.fit(xTrain, yTrain, validation_data = (xTest, yTest), epochs = 2) + + def buildModel(self): + """ + buildModel sets up a convolutional 2D network using a reLu activation function + + Outputs: + - model = model object to be used later for training and classification + """ + # We need to incrementally train the model so we'll set it up before preparing the data + model = Sequential() + + # Add layers to the model + model.add(Conv2D(64, kernel_size = 3, activation = "relu", input_shape=(240, 320, 3))) + model.add(Conv2D(32, kernel_size = 3, activation = "relu")) + model.add(Flatten()) + model.add(Dense(51, activation = "softmax")) + + # Compile model and use accuracy to measure performance + model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ["accuracy"]) + + return model + + def input_format(self) -> FrameInfo: + return FrameInfo(240, 320, 3, ColorSpace.RGB) + + @property + def name(self) -> str: + return "Paula_Test_Funk" + + def labels(self) -> List[str]: + return [ + 'brush_hair', 'clap', 'draw_sword', 'fall_floor', 'handstand', 'kick', 'pick', 'push', 'run', + 'shoot_gun', 'smoke', 'sword', 'turn', 'cartwheel', 'climb', 'dribble', 'fencing', 'hit', + 'kick_ball', 'pour', 'pushup', 'shake_hands', 'sit', 'somersault', 'sword_exercise', 'walk', 'catch', + 'climb_stairs', 'drink', 'flic_flac', 'hug', 'kiss', 'pullup', 'ride_bike', 'shoot_ball', 'situp', + 'stand', 'talk', 'wave', 'chew', 'dive', 'eat', 'golf', 'jump', 'laugh', 'punch', 'ride_horse', + 'shoot_bow', 'smile', 'swing_baseball', 'throw', + ] + + def classify(self, batch: FrameBatch) -> List[Prediction]: + """ + Takes as input a batch of frames and returns the predictions by applying the classification model. + + Arguments: + batch (FrameBatch): Input batch of frames on which prediction needs to be made + + Returns: + List[Prediction]: The predictions made by the classifier + """ + + pred = model.predict(batch.frames_as_numpy_array()) + return [self.inverseLabelMap[l] for l in pred] diff --git a/test/udfs/vid_to_frame_classifier_test.py b/test/udfs/vid_to_frame_classifier_test.py new file mode 100644 index 0000000000..6733ebff77 --- /dev/null +++ b/test/udfs/vid_to_frame_classifier_test.py @@ -0,0 +1,8 @@ +from src.udfs import video_action_classification + +def test_VidToFrameClassifier(): + model = video_action_classification.VideoToFrameClassifier() + assert model != None + + + From 9c2ee755b04bcea0629bbeb819ae53aeebc695a8 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 1 Dec 2019 20:45:19 -0500 Subject: [PATCH 05/82] Added unit test for research filter and fixed paths --- src/filters/models/ml_dnn.py | 2 +- src/filters/models/ml_randomforest.py | 2 +- src/filters/models/ml_svm.py | 2 +- src/filters/research_filter.py | 39 +++------------------------ test/filters/__init__.py | 0 test/filters/test_research_filter.py | 25 +++++++++++++++++ 6 files changed, 32 insertions(+), 38 deletions(-) create mode 100644 test/filters/__init__.py create mode 100644 test/filters/test_research_filter.py diff --git a/src/filters/models/ml_dnn.py b/src/filters/models/ml_dnn.py index 11af3f1376..0d1eb41458 100644 --- a/src/filters/models/ml_dnn.py +++ b/src/filters/models/ml_dnn.py @@ -7,7 +7,7 @@ import numpy as np import time from sklearn.neural_network import MLPClassifier -from filters.models.ml_base import MLBase +from src.filters.models.ml_base import MLBase class MLMLP(MLBase): diff --git a/src/filters/models/ml_randomforest.py b/src/filters/models/ml_randomforest.py index fa19c4e62e..3dc7e0456e 100644 --- a/src/filters/models/ml_randomforest.py +++ b/src/filters/models/ml_randomforest.py @@ -7,7 +7,7 @@ import numpy as np import time from sklearn.ensemble import RandomForestClassifier -from filters.models.ml_base import MLBase +from src.filters.models.ml_base import MLBase class MLRandomForest(MLBase): def __init__(self, **kwargs): diff --git a/src/filters/models/ml_svm.py b/src/filters/models/ml_svm.py index 44da91e9a2..5a5df79317 100644 --- a/src/filters/models/ml_svm.py +++ b/src/filters/models/ml_svm.py @@ -8,7 +8,7 @@ import numpy as np import time from sklearn.svm import LinearSVC -from filters.models.ml_base import MLBase +from src.filters.models.ml_base import MLBase class MLSVM(MLBase): diff --git a/src/filters/research_filter.py b/src/filters/research_filter.py index f5c7ade29c..ee8e68977c 100644 --- a/src/filters/research_filter.py +++ b/src/filters/research_filter.py @@ -9,10 +9,10 @@ import pandas as pd from copy import deepcopy -from src.filters import FilterTemplate +from src.filters.abstract_filter import FilterTemplate from src.filters.models.ml_randomforest import MLRandomForest -from src.filters import MLSVM -from src.filters import MLMLP +from src.filters.models.ml_svm import MLSVM +from src.filters.models.ml_dnn import MLMLP # Meant to be a black box for trying all models available and returning statistics and model for @@ -206,35 +206,4 @@ def getAllStats(self): # Create DataFrame df = pd.DataFrame(data) - return df - - -if __name__ == "__main__": - filter = FilterResearch() - - X = np.random.random([100, 30, 30, 3]) - y = np.random.random([100]) - y *= 10 - y = y.astype(np.int32) - - division = int(X.shape[0] * 0.8) - X_train = X[:division] - X_test = X[division:] - y_iscar_train = y[:division] - y_iscar_test = y[division:] - - filter.train(X_train, y_iscar_train) - print("filter finished training!") - y_iscar_hat = filter.predict(X_test, post_model_name='rf') - print("filter finished prediction!") - stats = filter.getAllStats() - print(stats) - print("filter got all stats") - - - - - - - - + return df \ No newline at end of file diff --git a/test/filters/__init__.py b/test/filters/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/test/filters/test_research_filter.py b/test/filters/test_research_filter.py new file mode 100644 index 0000000000..5d0d486964 --- /dev/null +++ b/test/filters/test_research_filter.py @@ -0,0 +1,25 @@ +from src.filters.research_filter import FilterResearch +import numpy as np + +def test_FilterResearch(): + # Construct the filter research and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + filter = FilterResearch() + + # Set up the randomized input for testing + X = np.random.random([100, 30, 30, 3]) + y = np.random.random([100]) + y *= 10 + y = y.astype(np.int32) + + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] + + filter.train(X_train, y_iscar_train) + y_iscar_hat = filter.predict(X_test, post_model_name='rf') + stats = filter.getAllStats() \ No newline at end of file From 23c8eaa2a8575a8a2a29efbc539e182d04d7a603 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 1 Dec 2019 21:19:06 -0500 Subject: [PATCH 06/82] Added unit test for minimum filter, adjusted some paths --- src/filters/minimum_filter.py | 27 +-------------------------- src/filters/research_filter.py | 1 + test/filters/test_minimum_filter.py | 25 +++++++++++++++++++++++++ 3 files changed, 27 insertions(+), 26 deletions(-) create mode 100644 test/filters/test_minimum_filter.py diff --git a/src/filters/minimum_filter.py b/src/filters/minimum_filter.py index 27187d89b9..8a24092fa7 100644 --- a/src/filters/minimum_filter.py +++ b/src/filters/minimum_filter.py @@ -8,7 +8,7 @@ import pandas as pd from copy import deepcopy -from src.filters import FilterTemplate +from src.filters.abstract_filter import FilterTemplate from src.filters.models.ml_randomforest import MLRandomForest @@ -216,31 +216,6 @@ def getAllStats(self): return df -if __name__ == "__main__": - - - filter = FilterMinimum() - - X = np.random.random([100,30,30,3]) - y = np.random.random([100]) - y *= 10 - y = y.astype(np.int32) - - division = int(X.shape[0] * 0.8) - X_train = X[:division] - X_test = X[division:] - y_iscar_train = y[:division] - y_iscar_test = y[division:] - - filter.train(X_train, y_iscar_train) - print("filter finished training!") - y_iscar_hat = filter.predict(X_test, post_model_name='rf') - print("filter finished prediction!") - stats = filter.getAllStats() - print(stats) - print("filter got all stats") - - """ from loaders.loader_uadetrac import LoaderUADetrac diff --git a/src/filters/research_filter.py b/src/filters/research_filter.py index ee8e68977c..02a217de4d 100644 --- a/src/filters/research_filter.py +++ b/src/filters/research_filter.py @@ -15,6 +15,7 @@ from src.filters.models.ml_dnn import MLMLP + # Meant to be a black box for trying all models available and returning statistics and model for # the query optimizer to choose for a given query diff --git a/test/filters/test_minimum_filter.py b/test/filters/test_minimum_filter.py new file mode 100644 index 0000000000..e549f4f6ad --- /dev/null +++ b/test/filters/test_minimum_filter.py @@ -0,0 +1,25 @@ +from src.filters.minimum_filter import FilterMinimum +import numpy as np + +def test_FilterMinimum(): + # Construct the filter minimum and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + filter = FilterMinimum() + + # Set up the randomized input for testing + X = np.random.random([100,30,30,3]) + y = np.random.random([100]) + y *= 10 + y = y.astype(np.int32) + + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] + + filter.train(X_train, y_iscar_train) + y_iscar_hat = filter.predict(X_test, post_model_name='rf') + stats = filter.getAllStats() \ No newline at end of file From e6c71e4f11d0d41fe315b372193ab02619adf0a4 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 1 Dec 2019 21:32:50 -0500 Subject: [PATCH 07/82] Consolidated tests into single folder, removed old file structure --- src/filters/tests/__init__.py | 0 {src/filters/tests => test/filters}/filter_test_pytest.py | 0 2 files changed, 0 insertions(+), 0 deletions(-) delete mode 100644 src/filters/tests/__init__.py rename {src/filters/tests => test/filters}/filter_test_pytest.py (100%) diff --git a/src/filters/tests/__init__.py b/src/filters/tests/__init__.py deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/src/filters/tests/filter_test_pytest.py b/test/filters/filter_test_pytest.py similarity index 100% rename from src/filters/tests/filter_test_pytest.py rename to test/filters/filter_test_pytest.py From db8b6d90060d3f90eca8b1e45b7927ae0e55c966 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 1 Dec 2019 22:02:37 -0500 Subject: [PATCH 08/82] Added unit test for kernel density wrapper --- src/filters/kdewrapper.py | 2 +- test/filters/test_kdewrapper.py | 24 ++++++++++++++++++++++++ 2 files changed, 25 insertions(+), 1 deletion(-) create mode 100644 test/filters/test_kdewrapper.py diff --git a/src/filters/kdewrapper.py b/src/filters/kdewrapper.py index 42892d875c..1648f7a4ab 100644 --- a/src/filters/kdewrapper.py +++ b/src/filters/kdewrapper.py @@ -12,7 +12,7 @@ class KernelDensityWrapper: #need .fit function #need .predict function - def __init__(self, kernel='guassian', bandwidth=0.2): + def __init__(self, kernel='gaussian', bandwidth=0.2): self.kernels = [] #assume everything is one shot self.kernel = kernel self.bandwidth = bandwidth diff --git a/test/filters/test_kdewrapper.py b/test/filters/test_kdewrapper.py new file mode 100644 index 0000000000..a59f491109 --- /dev/null +++ b/test/filters/test_kdewrapper.py @@ -0,0 +1,24 @@ +from src.filters.kdewrapper import KernelDensityWrapper +import numpy as np + +def test_KD_Wrapper(): + # Construct the filter research and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + wrapper = KernelDensityWrapper() + + # Set up the randomized input for testing + X = np.random.random([100, 30]) + y = np.random.randint(2, size = 100) + y = y.astype(np.int32) + + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] + + wrapper.fit(X_train, y_iscar_train) + y_iscar_hat = wrapper.predict(X_test) + #scores = wrapper.getAllStats() \ No newline at end of file From f055a1e99f2feae64715194a009518cfeaa5a30c Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Mon, 2 Dec 2019 14:32:14 -0500 Subject: [PATCH 09/82] Increased coverage for research & min filters and PCA --- src/filters/minimum_filter.py | 2 +- src/filters/models/ml_pca.py | 4 +--- src/filters/research_filter.py | 2 +- test/filters/test_minimum_filter.py | 14 +++++++++++--- test/filters/test_research_filter.py | 14 +++++++++++--- 5 files changed, 25 insertions(+), 11 deletions(-) diff --git a/src/filters/minimum_filter.py b/src/filters/minimum_filter.py index 8a24092fa7..4a5491282d 100644 --- a/src/filters/minimum_filter.py +++ b/src/filters/minimum_filter.py @@ -133,7 +133,7 @@ def train(self, X:np.ndarray, y:np.ndarray): for pre_model_names, pre_post_instance_pair in internal_dict.items(): pre_model, post_model = pre_post_instance_pair X_transform = pre_model.predict(X) - post_model.train(X_transform) + post_model.train(X_transform, y) def predict(self, X:np.ndarray, pre_model_name:str = None, post_model_name:str = None)->np.ndarray: diff --git a/src/filters/models/ml_pca.py b/src/filters/models/ml_pca.py index aab7ce9291..c5d33c6235 100644 --- a/src/filters/models/ml_pca.py +++ b/src/filters/models/ml_pca.py @@ -1,8 +1,6 @@ - - import numpy as np import time -from filters.models.ml_base import MLBase +from src.filters.models.ml_base import MLBase from sklearn.decomposition import PCA diff --git a/src/filters/research_filter.py b/src/filters/research_filter.py index 02a217de4d..b75533202e 100644 --- a/src/filters/research_filter.py +++ b/src/filters/research_filter.py @@ -133,7 +133,7 @@ def train(self, X: np.ndarray, y: np.ndarray): for pre_model_names, pre_post_instance_pair in internal_dict.items(): pre_model, post_model = pre_post_instance_pair X_transform = pre_model.predict(X) - post_model.train(X_transform) + post_model.train(X_transform, y) def predict(self, X: np.ndarray, pre_model_name: str = None, post_model_name: str = None) -> np.ndarray: pre_model_names = self.pre_models.keys() diff --git a/test/filters/test_minimum_filter.py b/test/filters/test_minimum_filter.py index e549f4f6ad..0a14f401d1 100644 --- a/test/filters/test_minimum_filter.py +++ b/test/filters/test_minimum_filter.py @@ -1,4 +1,6 @@ from src.filters.minimum_filter import FilterMinimum +from src.filters.models.ml_pca import MLPCA +from src.filters.models.ml_dnn import MLMLP import numpy as np def test_FilterMinimum(): @@ -8,7 +10,7 @@ def test_FilterMinimum(): filter = FilterMinimum() # Set up the randomized input for testing - X = np.random.random([100,30,30,3]) + X = np.random.random([100,30]) y = np.random.random([100]) y *= 10 y = y.astype(np.int32) @@ -20,6 +22,12 @@ def test_FilterMinimum(): y_iscar_train = y[:division] y_iscar_test = y[division:] + filter.addPostModel("dnn", MLMLP()) + filter.addPreModel("pca", MLPCA()) + filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, post_model_name='rf') - stats = filter.getAllStats() \ No newline at end of file + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', post_model_name='dnn') + stats = filter.getAllStats() + + filter.deletePostModel("dnn") + filter.deletePreModel("pca") \ No newline at end of file diff --git a/test/filters/test_research_filter.py b/test/filters/test_research_filter.py index 5d0d486964..752e3acc99 100644 --- a/test/filters/test_research_filter.py +++ b/test/filters/test_research_filter.py @@ -1,4 +1,6 @@ from src.filters.research_filter import FilterResearch +from src.filters.models.ml_pca import MLPCA +from src.filters.models.ml_dnn import MLMLP import numpy as np def test_FilterResearch(): @@ -8,7 +10,7 @@ def test_FilterResearch(): filter = FilterResearch() # Set up the randomized input for testing - X = np.random.random([100, 30, 30, 3]) + X = np.random.random([100, 30]) y = np.random.random([100]) y *= 10 y = y.astype(np.int32) @@ -20,6 +22,12 @@ def test_FilterResearch(): y_iscar_train = y[:division] y_iscar_test = y[division:] + filter.addPostModel("dnn", MLMLP()) + filter.addPreModel("pca", MLPCA()) + filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, post_model_name='rf') - stats = filter.getAllStats() \ No newline at end of file + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', post_model_name='dnn') + stats = filter.getAllStats() + + filter.deletePostModel("dnn") + filter.deletePreModel("pca") \ No newline at end of file From b47873dcec8dcad5363f85e7129a891684457907 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Mon, 2 Dec 2019 14:32:43 -0500 Subject: [PATCH 10/82] Added unit test for pp --- test/filters/test_pp.py | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 test/filters/test_pp.py diff --git a/test/filters/test_pp.py b/test/filters/test_pp.py new file mode 100644 index 0000000000..7c8919f7c2 --- /dev/null +++ b/test/filters/test_pp.py @@ -0,0 +1,18 @@ +import numpy as np +from src.filters.pp import PP + +def test_PP(): + pp = PP() + + labels = "" + x = np.random.random([2, 30, 30, 3]) + + y = { + 'vehicle': [['car', 'car'], ['car', 'car', 'car']], + 'speed': [[6.859 * 5, 1.5055 * 5], + [6.859 * 5, 1.5055 * 5, 0.5206 * 5]], + 'color': [None, None], + 'intersection': [None, None] + } + + stats = pp.train_all(x,y) From 38f5abc936abfd044739cc2453fddb82d17e2797 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Mon, 2 Dec 2019 14:41:16 -0500 Subject: [PATCH 11/82] Update to vid to frame classifier and unit test, still IP --- src/udfs/video_action_classification.py | 140 ++++++++++++++-------- test/udfs/vid_to_frame_classifier_test.py | 5 +- 2 files changed, 95 insertions(+), 50 deletions(-) diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index 129f414b94..214de44530 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -1,31 +1,27 @@ -from src.models import FrameBatch, Prediction, FrameInfo, Point, BoundingBox, ColorSpace, VideoMetaInfo, VideoFormat +from src.models import Frame, FrameBatch, Prediction, FrameInfo, Point, BoundingBox, ColorSpace, VideoMetaInfo, VideoFormat from src.loaders.video_loader import SimpleVideoLoader from src.udfs.abstract_udfs import AbstractClassifierUDF -#from keras.models import Sequential -#from keras.layers import Dense, Conv2D, Flatten from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, Conv2D, Flatten +import cv2 + from typing import List, Tuple from glob import glob import numpy as np import random import os -class VideoToFrameClassifier(AbstractClassifierUDF): - def __init__(self): - # Build the model - self.model = self.buildModel() - - # Get dataset directory and stored data - dataset = "./data/hmdb/" - videoMetaList, labelList, self.inverseLabelMap = self.findDataNames(dataset) +class ActionClassificationLoader: + def __init__(self, path): + self.path = path + self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) - # Train the model using shuffled data - self.trainModel(self.model, videoMetaList, labelList, 10) + def getLabelMap(self): + return self.labelMap def findDataNames(self, searchDir): """ @@ -60,7 +56,69 @@ def findDataNames(self, searchDir): return (videoMetaList, labelList, inverseLabelMap) - def trainModel(self, model, videoMetaList, labelList, n = 10): + def load(self, batchSize): + + print("load") + + videoMetaIndex = 0 + while videoMetaIndex < len(self.videoMetaList): + + # Get a single batch + frames = [] + labels = np.zeros((0,51)) + while len(frames) < batchSize: + + # Load a single video + meta = self.videoMetaList[videoMetaIndex] + videoFrames, info = self.loadVideo(meta) + videoLabels = np.zeros((len(videoFrames),51)) + videoLabels[:,self.labelList[videoMetaIndex]] = 1 + videoMetaIndex += 1 + + # Skip unsupported frame types + if info != FrameInfo(240, 320, 3, ColorSpace.RGB): continue + + # Append onto frames and labels + frames += videoFrames + labels = np.append(labels, videoLabels, axis=0) + + yield FrameBatch(frames, info), labels + + def loadVideo(self, meta): + video = cv2.VideoCapture(meta.file) + video.set(cv2.CAP_PROP_POS_FRAMES, 0) + + _, frame = video.read() + frame_ind = 0 + + info = None + if frame is not None: + (height, width, channels) = frame.shape + info = FrameInfo(height, width, channels, ColorSpace.RGB) + + frames = [] + while frame is not None: + # Save frame + eva_frame = Frame(frame_ind, frame, info) + frames.append(eva_frame) + + # Read next frame + _, frame = video.read() + frame_ind += 1 + + return (frames, info) + +class VideoToFrameClassifier(AbstractClassifierUDF): + + def __init__(self): + # Build the model + self.model = self.buildModel() + + # Train the model using shuffled data + self.trainModel() + + + def trainModel(self): """ trainModel trains the built model using chunks of data of size n videos @@ -72,40 +130,26 @@ def trainModel(self, model, videoMetaList, labelList, n = 10): - labelList = list of labels derived from the labelMap - n = integer value for how many videos to act on at a time """ + videoLoader = ActionClassificationLoader("./data/hmdb/") + self.labelMap = videoLoader.getLabelMap() + + for batch,labels in videoLoader.load(1000): + # Get the frames as a numpy array + frames = batch.frames_as_numpy_array() + + print(frames.shape) + print(labels.shape) + + # Split x and y into training and validation sets + xTrain = frames[0:int(0.8*frames.shape[0])] + yTrain = labels[0:int(0.8*labels.shape[0])] + xTest = frames[int(0.8*frames.shape[0]):] + yTest = labels[int(0.8*labels.shape[0]):] + + # Train the model using cross-validation (so we don't need to explicitly do CV outside of training) + self.model.fit(xTrain, yTrain, validation_data = (xTest, yTest), epochs = 2) + self.model.save("./data/hmdb/2d_action_classifier.h5") - labelArray = np.array(labelList) - - for i,videoInfo in enumerate(videoMetaList): - # Load the video from disk into memory - videoLoader = SimpleVideoLoader(videoInfo) - batches = videoLoader.load() - - for b in batches: - # Get the frames as a numpy array - frames = b.frames_as_numpy_array() - - # Skip unsupported frame sizes - if frames.shape[1:] != (240, 320, 3): - break - - labels = np.zeros((frames.shape[0],51)) - labels[:,labelList[i]] = 1 - - print(frames.shape) - print(labels.shape) - - # Split x and y into training and validation sets - xTrain = frames[0:int(0.8*frames.shape[0])] - yTrain = labels[0:int(0.8*labels.shape[0])] - - xTest = frames[int(0.8*frames.shape[0]):] - yTest = labels[int(0.8*labels.shape[0]):] - - print(xTrain) - print(yTrain) - - # Train the model using cross-validation (so we don't need to explicitly do CV outside of training) - model.fit(xTrain, yTrain, validation_data = (xTest, yTest), epochs = 2) def buildModel(self): """ @@ -157,4 +201,4 @@ def classify(self, batch: FrameBatch) -> List[Prediction]: """ pred = model.predict(batch.frames_as_numpy_array()) - return [self.inverseLabelMap[l] for l in pred] + return [self.labelMap[l] for l in pred] diff --git a/test/udfs/vid_to_frame_classifier_test.py b/test/udfs/vid_to_frame_classifier_test.py index 6733ebff77..faf00fbc4e 100644 --- a/test/udfs/vid_to_frame_classifier_test.py +++ b/test/udfs/vid_to_frame_classifier_test.py @@ -1,8 +1,9 @@ from src.udfs import video_action_classification def test_VidToFrameClassifier(): - model = video_action_classification.VideoToFrameClassifier() - assert model != None + # model = video_action_classification.VideoToFrameClassifier() + # assert model != None + pass From cba56a302abf022f5edf8d40d14768b060be5119 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Mon, 2 Dec 2019 19:07:37 -0500 Subject: [PATCH 12/82] Fixed import statements in classifier and updated environment --- environment_mac.yml | 2 ++ src/udfs/video_action_classification.py | 8 +++++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/environment_mac.yml b/environment_mac.yml index c1b2713663..4e5b726356 100644 --- a/environment_mac.yml +++ b/environment_mac.yml @@ -14,6 +14,8 @@ dependencies: - pytest-cov - pycodestyle - torchvision + - keras + - tensorflow - pip: - torch prefix: eva_35 diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index 214de44530..c94e333611 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -1,4 +1,10 @@ -from src.models import Frame, FrameBatch, Prediction, FrameInfo, Point, BoundingBox, ColorSpace, VideoMetaInfo, VideoFormat +from src.models.catalog.frame_info import FrameInfo +from src.models.catalog.properties import VideoFormat, ColorSpace +from src.models.catalog.video_info import VideoMetaInfo +from src.models.storage.frame import Frame +from src.models.storage.batch import FrameBatch +from src.models.inference.classifier_prediction import Prediction + from src.loaders.video_loader import SimpleVideoLoader from src.udfs.abstract_udfs import AbstractClassifierUDF From 8f42abba01ff433259fd7e0f85692d9225f0fec3 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Mon, 2 Dec 2019 19:38:40 -0500 Subject: [PATCH 13/82] Updated syntax to match requirements --- test/filters/test_kdewrapper.py | 5 +++-- test/filters/test_minimum_filter.py | 10 ++++++---- test/filters/test_pp.py | 3 ++- test/filters/test_research_filter.py | 8 +++++--- test/udfs/vid_to_frame_classifier_test.py | 8 ++++---- 5 files changed, 20 insertions(+), 14 deletions(-) diff --git a/test/filters/test_kdewrapper.py b/test/filters/test_kdewrapper.py index a59f491109..c120c0ba62 100644 --- a/test/filters/test_kdewrapper.py +++ b/test/filters/test_kdewrapper.py @@ -1,6 +1,7 @@ from src.filters.kdewrapper import KernelDensityWrapper import numpy as np + def test_KD_Wrapper(): # Construct the filter research and test it with randomized values # The idea is just to run it and make sure that things run to completion @@ -9,7 +10,7 @@ def test_KD_Wrapper(): # Set up the randomized input for testing X = np.random.random([100, 30]) - y = np.random.randint(2, size = 100) + y = np.random.randint(2, size=100) y = y.astype(np.int32) # Split into training and testing data @@ -21,4 +22,4 @@ def test_KD_Wrapper(): wrapper.fit(X_train, y_iscar_train) y_iscar_hat = wrapper.predict(X_test) - #scores = wrapper.getAllStats() \ No newline at end of file + # scores = wrapper.getAllStats() \ No newline at end of file diff --git a/test/filters/test_minimum_filter.py b/test/filters/test_minimum_filter.py index 0a14f401d1..4277ae49fd 100644 --- a/test/filters/test_minimum_filter.py +++ b/test/filters/test_minimum_filter.py @@ -1,8 +1,9 @@ from src.filters.minimum_filter import FilterMinimum -from src.filters.models.ml_pca import MLPCA -from src.filters.models.ml_dnn import MLMLP +from src.filters.models.ml_pca import MLPCA +from src.filters.models.ml_dnn import MLMLP import numpy as np + def test_FilterMinimum(): # Construct the filter minimum and test it with randomized values # The idea is just to run it and make sure that things run to completion @@ -10,7 +11,7 @@ def test_FilterMinimum(): filter = FilterMinimum() # Set up the randomized input for testing - X = np.random.random([100,30]) + X = np.random.random([100, 30]) y = np.random.random([100]) y *= 10 y = y.astype(np.int32) @@ -26,7 +27,8 @@ def test_FilterMinimum(): filter.addPreModel("pca", MLPCA()) filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, pre_model_name='pca', post_model_name='dnn') + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', + post_model_name='dnn') stats = filter.getAllStats() filter.deletePostModel("dnn") diff --git a/test/filters/test_pp.py b/test/filters/test_pp.py index 7c8919f7c2..6e8eb4fd06 100644 --- a/test/filters/test_pp.py +++ b/test/filters/test_pp.py @@ -1,6 +1,7 @@ import numpy as np from src.filters.pp import PP + def test_PP(): pp = PP() @@ -15,4 +16,4 @@ def test_PP(): 'intersection': [None, None] } - stats = pp.train_all(x,y) + stats = pp.train_all(x, y) diff --git a/test/filters/test_research_filter.py b/test/filters/test_research_filter.py index 752e3acc99..46e3b02fc1 100644 --- a/test/filters/test_research_filter.py +++ b/test/filters/test_research_filter.py @@ -1,8 +1,9 @@ from src.filters.research_filter import FilterResearch -from src.filters.models.ml_pca import MLPCA -from src.filters.models.ml_dnn import MLMLP +from src.filters.models.ml_pca import MLPCA +from src.filters.models.ml_dnn import MLMLP import numpy as np + def test_FilterResearch(): # Construct the filter research and test it with randomized values # The idea is just to run it and make sure that things run to completion @@ -26,7 +27,8 @@ def test_FilterResearch(): filter.addPreModel("pca", MLPCA()) filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, pre_model_name='pca', post_model_name='dnn') + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', + post_model_name='dnn') stats = filter.getAllStats() filter.deletePostModel("dnn") diff --git a/test/udfs/vid_to_frame_classifier_test.py b/test/udfs/vid_to_frame_classifier_test.py index faf00fbc4e..bceb252ea2 100644 --- a/test/udfs/vid_to_frame_classifier_test.py +++ b/test/udfs/vid_to_frame_classifier_test.py @@ -1,9 +1,9 @@ from src.udfs import video_action_classification -def test_VidToFrameClassifier(): - # model = video_action_classification.VideoToFrameClassifier() - # assert model != None - pass +def test_VidToFrameClassifier(): + # model = video_action_classification.VideoToFrameClassifier() + # assert model != None + pass From e0645b0f153d2fcecd4e40fddce0bc2a06fe52f5 Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Wed, 4 Dec 2019 12:58:41 -0500 Subject: [PATCH 14/82] Sample test file --- test_file.txt | 1 + 1 file changed, 1 insertion(+) create mode 100644 test_file.txt diff --git a/test_file.txt b/test_file.txt new file mode 100644 index 0000000000..7bd641b030 --- /dev/null +++ b/test_file.txt @@ -0,0 +1 @@ +Testing fork From 3066e3f771a40afda76f5216af5a8b414a5a3483 Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 17:25:32 -0500 Subject: [PATCH 15/82] EVA demo --- src/demo.py | 47 +++++++++++++++++++++++++++ src/expression/abstract_expression.py | 1 + src/expression/case_expression.py | 32 ++++++++++++++++++ test/expression/test_aggregation.py | 2 +- 4 files changed, 81 insertions(+), 1 deletion(-) create mode 100644 src/demo.py create mode 100644 src/expression/case_expression.py diff --git a/src/demo.py b/src/demo.py new file mode 100644 index 0000000000..9070eea601 --- /dev/null +++ b/src/demo.py @@ -0,0 +1,47 @@ +# import unittest +import sys, os +sys.path.append('../') +from src.query_parser.eva_parser import EvaFrameQLParser +# from src.query_parser.eva_statement import EvaStatement +# from src.query_parser.eva_statement import StatementType +# from src.query_parser.select_statement import SelectStatement +# from src.expression.abstract_expression import ExpressionType +# from src.query_parser.table_ref import TableRef + +from cmd import Cmd + +class EVADemo(Cmd): + + def default(self, args): + """Takes in SQL query and generates the output""" + + # Type exit + if(args == "exit" or args == "EXIT"): + raise SystemExit + + if len(args) == 0: + + query = 'Unknown' + else: + parser = EvaFrameQLParser() + eva_statement = parser.parse(args) + query = args + # print(eva_statement) + select_stmt = eva_statement[0] + print("Result from the parser:") + print(select_stmt) + print('\n') + + + + def do_quit(self, args): + """Quits the program.""" + print ("Quitting.") + raise SystemExit + + +if __name__ == '__main__': + prompt = EVADemo() + prompt.prompt = '> ' + prompt.cmdloop('Starting EVA...') + diff --git a/src/expression/abstract_expression.py b/src/expression/abstract_expression.py index a7fec8a42f..576bfbaebc 100644 --- a/src/expression/abstract_expression.py +++ b/src/expression/abstract_expression.py @@ -34,6 +34,7 @@ class ExpressionType(IntEnum): AGGREGATION_MAX = 20, AGGREGATION_AVG = 21, + CASE = 22, # add other types diff --git a/src/expression/case_expression.py b/src/expression/case_expression.py new file mode 100644 index 0000000000..793db09676 --- /dev/null +++ b/src/expression/case_expression.py @@ -0,0 +1,32 @@ +from src.expression.abstract_expression import AbstractExpression, \ + ExpressionType, \ + ExpressionReturnType + + +class CaseExpression(AbstractExpression): + def __init__(self, exp_type: ExpressionType, left: AbstractExpression, + right: AbstractExpression): + children = [] + if left is not None: + children.append(left) + if right is not None: + children.append(right) + super().__init__(exp_type, rtype=ExpressionReturnType.BOOLEAN, + children=children) + + def evaluate(self, *args): + conditions = self.get_child(0).evaluate(*args) + + outcome = [] + + for case in range(self.get_children_count()): + left_values = self.get_child(0).evaluate(*args) + if(case == (self.get_children_count() - 1)): + outcome.append(left_values) + right_values = self.get_child(1).evaluate(*args) + + if (left_values == True): + outcome.append(right_values) + break + + return outcome diff --git a/test/expression/test_aggregation.py b/test/expression/test_aggregation.py index 20c696f1fc..c0b9710ebc 100644 --- a/test/expression/test_aggregation.py +++ b/test/expression/test_aggregation.py @@ -7,7 +7,7 @@ from src.expression.tuple_value_expression import TupleValueExpression -class LogicalExpressionsTest(unittest.TestCase): +class AggregationExpressionsTest(unittest.TestCase): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) From 02198003763c27ea86b026a2329f3b45a484d040 Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 17:26:24 -0500 Subject: [PATCH 16/82] EVA demo --- src/expression/case_expression.py | 32 ------------------------------- 1 file changed, 32 deletions(-) delete mode 100644 src/expression/case_expression.py diff --git a/src/expression/case_expression.py b/src/expression/case_expression.py deleted file mode 100644 index 793db09676..0000000000 --- a/src/expression/case_expression.py +++ /dev/null @@ -1,32 +0,0 @@ -from src.expression.abstract_expression import AbstractExpression, \ - ExpressionType, \ - ExpressionReturnType - - -class CaseExpression(AbstractExpression): - def __init__(self, exp_type: ExpressionType, left: AbstractExpression, - right: AbstractExpression): - children = [] - if left is not None: - children.append(left) - if right is not None: - children.append(right) - super().__init__(exp_type, rtype=ExpressionReturnType.BOOLEAN, - children=children) - - def evaluate(self, *args): - conditions = self.get_child(0).evaluate(*args) - - outcome = [] - - for case in range(self.get_children_count()): - left_values = self.get_child(0).evaluate(*args) - if(case == (self.get_children_count() - 1)): - outcome.append(left_values) - right_values = self.get_child(1).evaluate(*args) - - if (left_values == True): - outcome.append(right_values) - break - - return outcome From 66a92e5e799fb84b692188ede9dfafdde8965786 Mon Sep 17 00:00:00 2001 From: Asra Yousuf Date: Fri, 6 Dec 2019 17:38:04 -0500 Subject: [PATCH 17/82] input and ouput videos for CLI --- src/demo.py | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/src/demo.py b/src/demo.py index 9070eea601..1f2c4ed312 100644 --- a/src/demo.py +++ b/src/demo.py @@ -1,6 +1,14 @@ # import unittest import sys, os sys.path.append('../') +from PIL import Image +import glob +import random +import numpy as np +import matplotlib +matplotlib.use('TkAgg') +from matplotlib.pyplot import imshow + from src.query_parser.eva_parser import EvaFrameQLParser # from src.query_parser.eva_statement import EvaStatement # from src.query_parser.eva_statement import StatementType @@ -23,14 +31,36 @@ def default(self, args): query = 'Unknown' else: + #### Read Input Videos ##### + input_video = [] + for filename in glob.glob('../data/sample_video/*.jpg'): + im=Image.open(filename) + im_copy = im.copy()## too handle 'too many open files' error + input_video.append(im_copy) + im.close() + + #### Connect and Query from Eva ##### parser = EvaFrameQLParser() eva_statement = parser.parse(args) query = args - # print(eva_statement) + print(eva_statement) select_stmt = eva_statement[0] print("Result from the parser:") print(select_stmt) print('\n') + + + #### Write Output to final folder ##### + + ouput_frames = random.sample(input_video, 50) + output_folder = "../data/sample_output/" + + for i in range(len(ouput_frames)): + frame_name = output_folder + "output" + str(i) + ".jpg" + op = ouput_frames[i].save(frame_name) + + print("Refer pop-up for a sample of the output") + ouput_frames[0].show() From 54305a59fd2d63410ab4f9e0fd755f96caf44e8f Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 18:18:01 -0500 Subject: [PATCH 18/82] Fixed file path and added comments --- src/demo.py | 78 ++++++++++++++++++++++++++++------------------------- 1 file changed, 41 insertions(+), 37 deletions(-) diff --git a/src/demo.py b/src/demo.py index 1f2c4ed312..c1a4fee52f 100644 --- a/src/demo.py +++ b/src/demo.py @@ -1,6 +1,6 @@ # import unittest import sys, os -sys.path.append('../') +sys.path.append('.') from PIL import Image import glob import random @@ -20,47 +20,51 @@ class EVADemo(Cmd): - def default(self, args): + def default(self, query): """Takes in SQL query and generates the output""" - # Type exit - if(args == "exit" or args == "EXIT"): + # Type exit to exit program + if(query == "exit" or query == "EXIT"): raise SystemExit - if len(args) == 0: - - query = 'Unknown' - else: - #### Read Input Videos ##### - input_video = [] - for filename in glob.glob('../data/sample_video/*.jpg'): - im=Image.open(filename) - im_copy = im.copy()## too handle 'too many open files' error - input_video.append(im_copy) - im.close() - - #### Connect and Query from Eva ##### - parser = EvaFrameQLParser() - eva_statement = parser.parse(args) - query = args - print(eva_statement) - select_stmt = eva_statement[0] - print("Result from the parser:") - print(select_stmt) - print('\n') - - - #### Write Output to final folder ##### + if len(query) == 0: + print("Empty query") - ouput_frames = random.sample(input_video, 50) - output_folder = "../data/sample_output/" - - for i in range(len(ouput_frames)): - frame_name = output_folder + "output" + str(i) + ".jpg" - op = ouput_frames[i].save(frame_name) - - print("Refer pop-up for a sample of the output") - ouput_frames[0].show() + else: + try: + + #### Connect and Query from Eva ##### + parser = EvaFrameQLParser() + eva_statement = parser.parse(query) + print(eva_statement) + select_stmt = eva_statement[0] + print("Result from the parser:") + print(select_stmt) + print('\n') + + #### Read Input Videos ##### + #### Replace with Input Pipeline once finished #### + input_video = [] + for filename in glob.glob('data/sample_video/*.jpg'): + im=Image.open(filename) + im_copy = im.copy()## too handle 'too many open files' error + input_video.append(im_copy) + im.close() + + #### Write Output to final folder ##### + #### Replace with output pipeline once finished #### + ouput_frames = random.sample(input_video, 50) + output_folder = "data/sample_output/" + + for i in range(len(ouput_frames)): + frame_name = output_folder + "output" + str(i) + ".jpg" + op = ouput_frames[i].save(frame_name) + + print("Refer pop-up for a sample of the output") + ouput_frames[0].show() + + except TypeError: + print("SQL Statement improperly formatted. Try again.") From 95fee26a92d4d37f0175e318ba31f2ae3ade7012 Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 18:35:32 -0500 Subject: [PATCH 19/82] Cleaning output and comments --- src/demo.py | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/src/demo.py b/src/demo.py index c1a4fee52f..d002e4ffc7 100644 --- a/src/demo.py +++ b/src/demo.py @@ -1,4 +1,3 @@ -# import unittest import sys, os sys.path.append('.') from PIL import Image @@ -10,11 +9,6 @@ from matplotlib.pyplot import imshow from src.query_parser.eva_parser import EvaFrameQLParser -# from src.query_parser.eva_statement import EvaStatement -# from src.query_parser.eva_statement import StatementType -# from src.query_parser.select_statement import SelectStatement -# from src.expression.abstract_expression import ExpressionType -# from src.query_parser.table_ref import TableRef from cmd import Cmd @@ -36,7 +30,6 @@ def default(self, query): #### Connect and Query from Eva ##### parser = EvaFrameQLParser() eva_statement = parser.parse(query) - print(eva_statement) select_stmt = eva_statement[0] print("Result from the parser:") print(select_stmt) @@ -62,7 +55,7 @@ def default(self, query): print("Refer pop-up for a sample of the output") ouput_frames[0].show() - + except TypeError: print("SQL Statement improperly formatted. Try again.") From a1d3df0884f43fb8f49f00d36049b67ea2a7b9a8 Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 18:59:26 -0500 Subject: [PATCH 20/82] Fixing style errors --- src/demo.py | 28 +++++++++++------------- src/expression/aggregation_expression.py | 5 +++-- 2 files changed, 16 insertions(+), 17 deletions(-) diff --git a/src/demo.py b/src/demo.py index d002e4ffc7..05841d1709 100644 --- a/src/demo.py +++ b/src/demo.py @@ -1,4 +1,7 @@ -import sys, os +from src.query_parser.eva_parser import EvaFrameQLParser + +import sys +import os sys.path.append('.') from PIL import Image import glob @@ -7,17 +10,15 @@ import matplotlib matplotlib.use('TkAgg') from matplotlib.pyplot import imshow - -from src.query_parser.eva_parser import EvaFrameQLParser - from cmd import Cmd + class EVADemo(Cmd): def default(self, query): """Takes in SQL query and generates the output""" - # Type exit to exit program + # Type exit to stop program if(query == "exit" or query == "EXIT"): raise SystemExit @@ -27,7 +28,7 @@ def default(self, query): else: try: - #### Connect and Query from Eva ##### + ## Connect and Query from Eva parser = EvaFrameQLParser() eva_statement = parser.parse(query) select_stmt = eva_statement[0] @@ -35,17 +36,17 @@ def default(self, query): print(select_stmt) print('\n') - #### Read Input Videos ##### - #### Replace with Input Pipeline once finished #### + ## Read Input Videos + ## Replace with Input Pipeline once finished input_video = [] for filename in glob.glob('data/sample_video/*.jpg'): im=Image.open(filename) - im_copy = im.copy()## too handle 'too many open files' error + im_copy = im.copy() # to handle 'too many open files' error input_video.append(im_copy) im.close() - #### Write Output to final folder ##### - #### Replace with output pipeline once finished #### + ## Write Output to final folder + ## Replace with output pipeline once finished ouput_frames = random.sample(input_video, 50) output_folder = "data/sample_output/" @@ -59,14 +60,11 @@ def default(self, query): except TypeError: print("SQL Statement improperly formatted. Try again.") - - def do_quit(self, args): """Quits the program.""" - print ("Quitting.") + print ("Quitting.") raise SystemExit - if __name__ == '__main__': prompt = EVADemo() prompt.prompt = '> ' diff --git a/src/expression/aggregation_expression.py b/src/expression/aggregation_expression.py index 9d2f2dff46..fc0aa23793 100644 --- a/src/expression/aggregation_expression.py +++ b/src/expression/aggregation_expression.py @@ -4,6 +4,7 @@ import statistics class AggregationExpression(AbstractExpression): + def __init__(self, exp_type: ExpressionType, left: AbstractExpression, right: AbstractExpression): children = [] @@ -11,8 +12,8 @@ def __init__(self, exp_type: ExpressionType, left: AbstractExpression, children.append(left) if right is not None: children.append(right) - super().__init__(exp_type, rtype=ExpressionReturnType.INTEGER, ## can also be a float - children=children) + super().__init__(exp_type, rtype=ExpressionReturnType.INTEGER, + children=children) #can also be a float def evaluate(self, *args): values = self.get_child(0).evaluate(*args) From d02d835ecb59b3a8377c719a083cee4961d0320f Mon Sep 17 00:00:00 2001 From: SND96 Date: Fri, 6 Dec 2019 19:00:40 -0500 Subject: [PATCH 21/82] Style errors --- src/demo.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/demo.py b/src/demo.py index 05841d1709..8fb195cfc2 100644 --- a/src/demo.py +++ b/src/demo.py @@ -62,7 +62,7 @@ def default(self, query): def do_quit(self, args): """Quits the program.""" - print ("Quitting.") + print("Quitting.") raise SystemExit if __name__ == '__main__': From b380f11e7c4359bb2de0b3ac92500fa0da95b669 Mon Sep 17 00:00:00 2001 From: Alekhya Munagala Date: Sat, 7 Dec 2019 17:05:27 -0500 Subject: [PATCH 22/82] Testing visitFullColumnName function --- test/query_parser/test_parser_visitor.py | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 8cee757e74..1818d0b0ec 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -92,6 +92,14 @@ def test_comparison_operator(self): visitor.visitComparisonOperator(ctx), ExpressionType.COMPARE_GREATER) + def test_visit_full_column_name(self): + ctx = MagicMock() + visitor = EvaParserVisitor() + EvaParserVisitor.visit = MagicMock() + EvaParserVisitor.visit.return_value = None + with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): + visitor.visitFullColumnName(ctx) + if __name__ == '__main__': unittest.main() From c1bc54b3b79fe60056aa8d440f3cc56385b2cdba Mon Sep 17 00:00:00 2001 From: Alekhya Munagala Date: Sat, 7 Dec 2019 17:20:54 -0500 Subject: [PATCH 23/82] Documentation --- test/query_parser/test_parser_visitor.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 1818d0b0ec..185243cf78 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -92,7 +92,10 @@ def test_comparison_operator(self): visitor.visitComparisonOperator(ctx), ExpressionType.COMPARE_GREATER) - def test_visit_full_column_name(self): + def test_visit_full_column_name_none(self): + ''' Testing for getting a Warning when column name is None + Function: visitFullColumnName + ''' ctx = MagicMock() visitor = EvaParserVisitor() EvaParserVisitor.visit = MagicMock() From fde9ca67c00a8d78e6912c61579fa206f5f422ae Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Sat, 7 Dec 2019 17:26:01 -0500 Subject: [PATCH 24/82] Added Unit Test Case for visitLogicalExpression --- test/query_parser/test_parser_visitor.py | 27 ++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 185243cf78..e9a5279eec 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -103,6 +103,33 @@ def test_visit_full_column_name_none(self): with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): visitor.visitFullColumnName(ctx) + def test_logical_expression(self): + '''Testing for break in code if len(children) < 3 + Function : visitLogicalExpression + ''' + ctx = MagicMock() + visitor = EvaParserVisitor() + + # Test for no children + ctx.children = [] + expected = visitor.visitLogicalExpression(ctx) + self.assertEqual(expected,None) + + # Test for one children + child_1 = MagicMock() + ctx.children = [child_1] + expected = visitor.visitLogicalExpression(ctx) + self.assertEqual(expected,None) + + # Test for two children + child_1 = MagicMock() + child_2 = MagicMock() + ctx.children = [child_1, child_2] + expected = visitor.visitLogicalExpression(ctx) + self.assertEqual(expected,None) + + + if __name__ == '__main__': unittest.main() From 28b34725e64f410b0f07ab5da111e7b565de3ab0 Mon Sep 17 00:00:00 2001 From: Alekhya Munagala Date: Sat, 7 Dec 2019 17:35:12 -0500 Subject: [PATCH 25/82] Added Unit Test Case for visitTableName --- test/query_parser/test_parser_visitor.py | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index e9a5279eec..1c7ce367ef 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -103,6 +103,17 @@ def test_visit_full_column_name_none(self): with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): visitor.visitFullColumnName(ctx) + def test_visit_table_name_none(self): + ''' Testing for getting a Warning when table name is None + Function: visitTableName + ''' + ctx = MagicMock() + visitor = EvaParserVisitor() + EvaParserVisitor.visit = MagicMock() + EvaParserVisitor.visit.return_value = None + with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): + visitor.visitTableName(ctx) + def test_logical_expression(self): '''Testing for break in code if len(children) < 3 Function : visitLogicalExpression From a5a632d4983a9ebccc6482935f3f710963a5ea1d Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sat, 7 Dec 2019 17:50:56 -0500 Subject: [PATCH 26/82] Added Tests for visitStringLiteral function --- test/query_parser/test_parser_visitor.py | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 1c7ce367ef..fa02f76652 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -111,7 +111,7 @@ def test_visit_table_name_none(self): visitor = EvaParserVisitor() EvaParserVisitor.visit = MagicMock() EvaParserVisitor.visit.return_value = None - with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): + with self.assertWarns(SyntaxWarning, msg='Invalid from table'): visitor.visitTableName(ctx) def test_logical_expression(self): @@ -139,6 +139,19 @@ def test_logical_expression(self): expected = visitor.visitLogicalExpression(ctx) self.assertEqual(expected,None) + def test_visit_string_literal_none(self): + '''Testing when string literal is None + Function: visitStringLiteral + ''' + visitor = EvaParserVisitor() + ctx = MagicMock() + ctx.STRING_LITERAL.return_value = None + + EvaParserVisitor.visitChildren = MagicMock() + mock_visit = EvaParserVisitor.visitChildren + + expected = visitor.visitStringLiteral(ctx) + mock_visit.assert_has_calls([call(ctx)]) From 1202f2dcaab69abdd185e8129320b73275d700f5 Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Sat, 7 Dec 2019 18:18:32 -0500 Subject: [PATCH 27/82] Added unit test for query_parser: visitConstant --- test/query_parser/test_parser_visitor.py | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index fa02f76652..6881a8454b 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -124,20 +124,20 @@ def test_logical_expression(self): # Test for no children ctx.children = [] expected = visitor.visitLogicalExpression(ctx) - self.assertEqual(expected,None) + self.assertEqual(expected, None) # Test for one children child_1 = MagicMock() ctx.children = [child_1] expected = visitor.visitLogicalExpression(ctx) - self.assertEqual(expected,None) + self.assertEqual(expected, None) # Test for two children child_1 = MagicMock() child_2 = MagicMock() ctx.children = [child_1, child_2] expected = visitor.visitLogicalExpression(ctx) - self.assertEqual(expected,None) + self.assertEqual(expected, None) def test_visit_string_literal_none(self): '''Testing when string literal is None @@ -153,7 +153,17 @@ def test_visit_string_literal_none(self): expected = visitor.visitStringLiteral(ctx) mock_visit.assert_has_calls([call(ctx)]) - + def test_visit_constant(self): + '''Testing for value of returned constant when real literal is not None + Function: visitConstant + ''' + ctx = MagicMock() + visitor = EvaParserVisitor() + ctx.REAL_LITERAL.return_value = '5' + expected = visitor.visitConstant(ctx) + self.assertEqual( + expected.evaluate(), + float(ctx.getText())) if __name__ == '__main__': unittest.main() From 89890391e7a3cdcfa868934c204637088143d98b Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sat, 7 Dec 2019 18:25:47 -0500 Subject: [PATCH 28/82] Testing visitQuerySpecification --- test/query_parser/test_parser_visitor.py | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 6881a8454b..0662ba2209 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -165,5 +165,20 @@ def test_visit_constant(self): expected.evaluate(), float(ctx.getText())) + def test_visit_query_specification_base_exception(self): + EvaParserVisitor.visit = MagicMock() + mock_visit = EvaParserVisitor.visit + + visitor = EvaParserVisitor() + ctx = MagicMock() + child_1 = MagicMock() + child_2 = MagicMock() + ctx.children = [None, child_1, child_2] + child_1.getRuleIndex.side_effect = BaseException() + + expected = visitor.visitQuerySpecification(ctx) + + self.assertEqual(expected, None) + if __name__ == '__main__': unittest.main() From a52786d64cc47cf2cb729747b054faea608b7b0f Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sat, 7 Dec 2019 18:30:56 -0500 Subject: [PATCH 29/82] Fixing pycodestyle errors --- test/query_parser/test_parser_visitor.py | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 0662ba2209..06a9df056f 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -115,8 +115,8 @@ def test_visit_table_name_none(self): visitor.visitTableName(ctx) def test_logical_expression(self): - '''Testing for break in code if len(children) < 3 - Function : visitLogicalExpression + ''' Testing for break in code if len(children) < 3 + Function : visitLogicalExpression ''' ctx = MagicMock() visitor = EvaParserVisitor() @@ -140,7 +140,7 @@ def test_logical_expression(self): self.assertEqual(expected, None) def test_visit_string_literal_none(self): - '''Testing when string literal is None + ''' Testing when string literal is None Function: visitStringLiteral ''' visitor = EvaParserVisitor() @@ -154,8 +154,8 @@ def test_visit_string_literal_none(self): mock_visit.assert_has_calls([call(ctx)]) def test_visit_constant(self): - '''Testing for value of returned constant when real literal is not None - Function: visitConstant + ''' Testing for value of returned constant when real literal is not None + Function: visitConstant ''' ctx = MagicMock() visitor = EvaParserVisitor() @@ -166,6 +166,9 @@ def test_visit_constant(self): float(ctx.getText())) def test_visit_query_specification_base_exception(self): + ''' Testing Base Exception error handling + Function: visitQuerySpecification + ''' EvaParserVisitor.visit = MagicMock() mock_visit = EvaParserVisitor.visit @@ -180,5 +183,6 @@ def test_visit_query_specification_base_exception(self): self.assertEqual(expected, None) + if __name__ == '__main__': unittest.main() From fce764c2e24ad7e91274a2f2ac27aa7c529ac85a Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Sat, 7 Dec 2019 18:41:30 -0500 Subject: [PATCH 30/82] Fixing pycodestyle --- test/query_parser/test_parser_visitor.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index 06a9df056f..ac827634cd 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -154,7 +154,8 @@ def test_visit_string_literal_none(self): mock_visit.assert_has_calls([call(ctx)]) def test_visit_constant(self): - ''' Testing for value of returned constant when real literal is not None + ''' Testing for value of returned constant + when real literal is not None Function: visitConstant ''' ctx = MagicMock() From 169dddee86fc51a899f8db032317078e8487364a Mon Sep 17 00:00:00 2001 From: Alekhya Munagala Date: Sat, 7 Dec 2019 19:14:21 -0500 Subject: [PATCH 31/82] Testing SelectStatement --- test/query_parser/test_parser.py | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/test/query_parser/test_parser.py b/test/query_parser/test_parser.py index fab8d397af..b343065f5b 100644 --- a/test/query_parser/test_parser.py +++ b/test/query_parser/test_parser.py @@ -73,6 +73,28 @@ def test_select_parser(self): self.assertIsNotNone(select_stmt.where_clause) # other tests should go in expression testing + def test_select_statement_class(self): + ''' Testing setting different clauses for Select + Statement class + Class: SelectStatement''' + + select_stmt_new = SelectStatement() + parser = EvaFrameQLParser() + + select_query_new = "SELECT CLASS, REDNESS FROM TAIPAI \ + WHERE (CLASS = 'VAN' AND REDNESS < 400 ) OR REDNESS > 700;" + eva_statement_list = parser.parse(select_query_new) + select_stmt = eva_statement_list[0] + + select_stmt_new.where_clause = select_stmt.where_clause + select_stmt_new.target_list = select_stmt.target_list + select_stmt_new.from_table = select_stmt.from_table + + self.assertEqual(select_stmt_new.where_clause, select_stmt.where_clause) + self.assertEqual(select_stmt_new.target_list, select_stmt.target_list) + self.assertEqual(select_stmt_new.from_table, select_stmt.from_table) + self.assertEqual(str(select_stmt_new), str(select_stmt)) + if __name__ == '__main__': unittest.main() From 151ecc676354f4428543e43d9f00a8e95c922c75 Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Sat, 7 Dec 2019 19:25:13 -0500 Subject: [PATCH 32/82] Added unit test for class TableRef --- test/query_parser/test_parser.py | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/test/query_parser/test_parser.py b/test/query_parser/test_parser.py index b343065f5b..71ada1a7b6 100644 --- a/test/query_parser/test_parser.py +++ b/test/query_parser/test_parser.py @@ -95,6 +95,21 @@ def test_select_statement_class(self): self.assertEqual(select_stmt_new.from_table, select_stmt.from_table) self.assertEqual(str(select_stmt_new), str(select_stmt)) + def test_table_ref(self): + ''' Testing table info in TableRef + Class: TableInfo + ''' + table_info = TableInfo('TAIPAI', 'Schema', 'Database') + table_ref_obj = TableRef(table_info) + select_stmt_new = SelectStatement() + select_stmt_new.from_table = table_ref_obj + self.assertEqual( + select_stmt_new.from_table.table_info.table_name, 'TAIPAI') + self.assertEqual( + select_stmt_new.from_table.table_info.schema_name, 'Schema') + self.assertEqual( + select_stmt_new.from_table.table_info.database_name, 'Database') + if __name__ == '__main__': unittest.main() From b93773a944b045a8f8be6777f79dad2708806838 Mon Sep 17 00:00:00 2001 From: Alekhya Munagala Date: Sat, 7 Dec 2019 19:34:40 -0500 Subject: [PATCH 33/82] Fixing indentation errors --- test/query_parser/test_parser.py | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/test/query_parser/test_parser.py b/test/query_parser/test_parser.py index 71ada1a7b6..58b19e0c67 100644 --- a/test/query_parser/test_parser.py +++ b/test/query_parser/test_parser.py @@ -4,7 +4,7 @@ from src.query_parser.eva_statement import StatementType from src.query_parser.select_statement import SelectStatement from src.expression.abstract_expression import ExpressionType -from src.query_parser.table_ref import TableRef +from src.query_parser.table_ref import TableRef, TableInfo class ParserTest(unittest.TestCase): @@ -90,9 +90,12 @@ def test_select_statement_class(self): select_stmt_new.target_list = select_stmt.target_list select_stmt_new.from_table = select_stmt.from_table - self.assertEqual(select_stmt_new.where_clause, select_stmt.where_clause) - self.assertEqual(select_stmt_new.target_list, select_stmt.target_list) - self.assertEqual(select_stmt_new.from_table, select_stmt.from_table) + self.assertEqual(select_stmt_new.where_clause, + select_stmt.where_clause) + self.assertEqual(select_stmt_new.target_list, + select_stmt.target_list) + self.assertEqual(select_stmt_new.from_table, + select_stmt.from_table) self.assertEqual(str(select_stmt_new), str(select_stmt)) def test_table_ref(self): @@ -104,11 +107,14 @@ def test_table_ref(self): select_stmt_new = SelectStatement() select_stmt_new.from_table = table_ref_obj self.assertEqual( - select_stmt_new.from_table.table_info.table_name, 'TAIPAI') + select_stmt_new.from_table.table_info.table_name, + 'TAIPAI') self.assertEqual( - select_stmt_new.from_table.table_info.schema_name, 'Schema') + select_stmt_new.from_table.table_info.schema_name, + 'Schema') self.assertEqual( - select_stmt_new.from_table.table_info.database_name, 'Database') + select_stmt_new.from_table.table_info.database_name, + 'Database') if __name__ == '__main__': From 1c088147299dbeb096b120c19fd0b30999e16ea6 Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sat, 7 Dec 2019 19:55:31 -0500 Subject: [PATCH 34/82] Fixed continuation line under-indented for visual indent --- test/query_parser/test_parser.py | 20 ++++++++++---------- test/query_parser/test_parser_visitor.py | 14 +++++++------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/test/query_parser/test_parser.py b/test/query_parser/test_parser.py index 58b19e0c67..d932fe30fa 100644 --- a/test/query_parser/test_parser.py +++ b/test/query_parser/test_parser.py @@ -75,7 +75,7 @@ def test_select_parser(self): def test_select_statement_class(self): ''' Testing setting different clauses for Select - Statement class + Statement class Class: SelectStatement''' select_stmt_new = SelectStatement() @@ -90,12 +90,12 @@ def test_select_statement_class(self): select_stmt_new.target_list = select_stmt.target_list select_stmt_new.from_table = select_stmt.from_table - self.assertEqual(select_stmt_new.where_clause, - select_stmt.where_clause) - self.assertEqual(select_stmt_new.target_list, - select_stmt.target_list) - self.assertEqual(select_stmt_new.from_table, - select_stmt.from_table) + self.assertEqual( + select_stmt_new.where_clause, select_stmt.where_clause) + self.assertEqual( + select_stmt_new.target_list, select_stmt.target_list) + self.assertEqual( + select_stmt_new.from_table, select_stmt.from_table) self.assertEqual(str(select_stmt_new), str(select_stmt)) def test_table_ref(self): @@ -108,13 +108,13 @@ def test_table_ref(self): select_stmt_new.from_table = table_ref_obj self.assertEqual( select_stmt_new.from_table.table_info.table_name, - 'TAIPAI') + 'TAIPAI') self.assertEqual( select_stmt_new.from_table.table_info.schema_name, - 'Schema') + 'Schema') self.assertEqual( select_stmt_new.from_table.table_info.database_name, - 'Database') + 'Database') if __name__ == '__main__': diff --git a/test/query_parser/test_parser_visitor.py b/test/query_parser/test_parser_visitor.py index ac827634cd..ed225a7742 100644 --- a/test/query_parser/test_parser_visitor.py +++ b/test/query_parser/test_parser_visitor.py @@ -93,8 +93,8 @@ def test_comparison_operator(self): ExpressionType.COMPARE_GREATER) def test_visit_full_column_name_none(self): - ''' Testing for getting a Warning when column name is None - Function: visitFullColumnName + ''' Testing for getting a Warning when column name is None + Function: visitFullColumnName ''' ctx = MagicMock() visitor = EvaParserVisitor() @@ -104,8 +104,8 @@ def test_visit_full_column_name_none(self): visitor.visitFullColumnName(ctx) def test_visit_table_name_none(self): - ''' Testing for getting a Warning when table name is None - Function: visitTableName + ''' Testing for getting a Warning when table name is None + Function: visitTableName ''' ctx = MagicMock() visitor = EvaParserVisitor() @@ -115,12 +115,12 @@ def test_visit_table_name_none(self): visitor.visitTableName(ctx) def test_logical_expression(self): - ''' Testing for break in code if len(children) < 3 + ''' Testing for break in code if len(children) < 3 Function : visitLogicalExpression ''' ctx = MagicMock() visitor = EvaParserVisitor() - + # Test for no children ctx.children = [] expected = visitor.visitLogicalExpression(ctx) @@ -154,7 +154,7 @@ def test_visit_string_literal_none(self): mock_visit.assert_has_calls([call(ctx)]) def test_visit_constant(self): - ''' Testing for value of returned constant + ''' Testing for value of returned constant when real literal is not None Function: visitConstant ''' From d6806ad8afed4d9f69f9672b4d3f0b77a2cd3428 Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sat, 7 Dec 2019 20:05:47 -0500 Subject: [PATCH 35/82] Cleaning Up --- test_file.txt | 1 - 1 file changed, 1 deletion(-) delete mode 100644 test_file.txt diff --git a/test_file.txt b/test_file.txt deleted file mode 100644 index 7bd641b030..0000000000 --- a/test_file.txt +++ /dev/null @@ -1 +0,0 @@ -Testing fork From c7b2d471f4880964cecc1ca5aa300accb42c9117 Mon Sep 17 00:00:00 2001 From: Sanmathi Kamath Date: Sat, 7 Dec 2019 23:57:40 -0500 Subject: [PATCH 36/82] Update README --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 124c0c6309..ba31c04626 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,9 @@ [![Build Status](https://travis-ci.org/georgia-tech-db/Eva.svg?branch=master)](https://travis-ci.com/georgia-tech-db/Eva) [![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/Eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/Eva?branch=master) + +We have worked on adding Unit test cases for EVA. + ### Table of Contents * Installation * Demos From 7aa0269946fa551a8ee087d38a9821b344bb6057 Mon Sep 17 00:00:00 2001 From: Pranjali Kokare Date: Sun, 8 Dec 2019 00:37:13 -0500 Subject: [PATCH 37/82] Edit ReadMe --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index ba31c04626..124c0c6309 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,6 @@ [![Build Status](https://travis-ci.org/georgia-tech-db/Eva.svg?branch=master)](https://travis-ci.com/georgia-tech-db/Eva) [![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/Eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/Eva?branch=master) - -We have worked on adding Unit test cases for EVA. - ### Table of Contents * Installation * Demos From 02bede120e6a0a56743539fdbcc8567c0e2198ea Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 10:45:47 -0500 Subject: [PATCH 38/82] Fixed formatting to match new unit test requirements --- test/filters/test_kdewrapper.py | 38 +++++++++++---------- test/filters/test_minimum_filter.py | 50 +++++++++++++++------------- test/filters/test_pp.py | 26 ++++++++------- test/filters/test_research_filter.py | 50 +++++++++++++++------------- 4 files changed, 86 insertions(+), 78 deletions(-) diff --git a/test/filters/test_kdewrapper.py b/test/filters/test_kdewrapper.py index c120c0ba62..2453da3c4c 100644 --- a/test/filters/test_kdewrapper.py +++ b/test/filters/test_kdewrapper.py @@ -1,25 +1,27 @@ from src.filters.kdewrapper import KernelDensityWrapper import numpy as np +import unittest +class KDE_Wrapper_Test(unittest.TestCase): -def test_KD_Wrapper(): - # Construct the filter research and test it with randomized values - # The idea is just to run it and make sure that things run to completion - # No actual output or known inputs are tested - wrapper = KernelDensityWrapper() + def test_KD_Wrapper(self): + # Construct the filter research and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + wrapper = KernelDensityWrapper() - # Set up the randomized input for testing - X = np.random.random([100, 30]) - y = np.random.randint(2, size=100) - y = y.astype(np.int32) + # Set up the randomized input for testing + X = np.random.random([100, 30]) + y = np.random.randint(2, size=100) + y = y.astype(np.int32) - # Split into training and testing data - division = int(X.shape[0] * 0.8) - X_train = X[:division] - X_test = X[division:] - y_iscar_train = y[:division] - y_iscar_test = y[division:] + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] - wrapper.fit(X_train, y_iscar_train) - y_iscar_hat = wrapper.predict(X_test) - # scores = wrapper.getAllStats() \ No newline at end of file + wrapper.fit(X_train, y_iscar_train) + y_iscar_hat = wrapper.predict(X_test) + # scores = wrapper.getAllStats() \ No newline at end of file diff --git a/test/filters/test_minimum_filter.py b/test/filters/test_minimum_filter.py index 4277ae49fd..e98330f49e 100644 --- a/test/filters/test_minimum_filter.py +++ b/test/filters/test_minimum_filter.py @@ -2,34 +2,36 @@ from src.filters.models.ml_pca import MLPCA from src.filters.models.ml_dnn import MLMLP import numpy as np +import unittest +class FilterMinimum_Test(unittest.TestCase): -def test_FilterMinimum(): - # Construct the filter minimum and test it with randomized values - # The idea is just to run it and make sure that things run to completion - # No actual output or known inputs are tested - filter = FilterMinimum() + def test_FilterMinimum(self): + # Construct the filter minimum and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + filter = FilterMinimum() - # Set up the randomized input for testing - X = np.random.random([100, 30]) - y = np.random.random([100]) - y *= 10 - y = y.astype(np.int32) + # Set up the randomized input for testing + X = np.random.random([100, 30]) + y = np.random.random([100]) + y *= 10 + y = y.astype(np.int32) - # Split into training and testing data - division = int(X.shape[0] * 0.8) - X_train = X[:division] - X_test = X[division:] - y_iscar_train = y[:division] - y_iscar_test = y[division:] + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] - filter.addPostModel("dnn", MLMLP()) - filter.addPreModel("pca", MLPCA()) + filter.addPostModel("dnn", MLMLP()) + filter.addPreModel("pca", MLPCA()) - filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, pre_model_name='pca', - post_model_name='dnn') - stats = filter.getAllStats() + filter.train(X_train, y_iscar_train) + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', + post_model_name='dnn') + stats = filter.getAllStats() - filter.deletePostModel("dnn") - filter.deletePreModel("pca") \ No newline at end of file + filter.deletePostModel("dnn") + filter.deletePreModel("pca") \ No newline at end of file diff --git a/test/filters/test_pp.py b/test/filters/test_pp.py index 6e8eb4fd06..c2f097ceb0 100644 --- a/test/filters/test_pp.py +++ b/test/filters/test_pp.py @@ -1,19 +1,21 @@ import numpy as np from src.filters.pp import PP +import unittest +class PP_Test(unittest.TestCase): -def test_PP(): - pp = PP() + def test_PP(self): + pp = PP() - labels = "" - x = np.random.random([2, 30, 30, 3]) + labels = "" + x = np.random.random([2, 30, 30, 3]) - y = { - 'vehicle': [['car', 'car'], ['car', 'car', 'car']], - 'speed': [[6.859 * 5, 1.5055 * 5], - [6.859 * 5, 1.5055 * 5, 0.5206 * 5]], - 'color': [None, None], - 'intersection': [None, None] - } + y = { + 'vehicle': [['car', 'car'], ['car', 'car', 'car']], + 'speed': [[6.859 * 5, 1.5055 * 5], + [6.859 * 5, 1.5055 * 5, 0.5206 * 5]], + 'color': [None, None], + 'intersection': [None, None] + } - stats = pp.train_all(x, y) + stats = pp.train_all(x, y) diff --git a/test/filters/test_research_filter.py b/test/filters/test_research_filter.py index 46e3b02fc1..59755f49ed 100644 --- a/test/filters/test_research_filter.py +++ b/test/filters/test_research_filter.py @@ -2,34 +2,36 @@ from src.filters.models.ml_pca import MLPCA from src.filters.models.ml_dnn import MLMLP import numpy as np +import unittest +class ResearchFilter_Test(unittest.TestCase): -def test_FilterResearch(): - # Construct the filter research and test it with randomized values - # The idea is just to run it and make sure that things run to completion - # No actual output or known inputs are tested - filter = FilterResearch() + def test_FilterResearch(self): + # Construct the filter research and test it with randomized values + # The idea is just to run it and make sure that things run to completion + # No actual output or known inputs are tested + filter = FilterResearch() - # Set up the randomized input for testing - X = np.random.random([100, 30]) - y = np.random.random([100]) - y *= 10 - y = y.astype(np.int32) + # Set up the randomized input for testing + X = np.random.random([100, 30]) + y = np.random.random([100]) + y *= 10 + y = y.astype(np.int32) - # Split into training and testing data - division = int(X.shape[0] * 0.8) - X_train = X[:division] - X_test = X[division:] - y_iscar_train = y[:division] - y_iscar_test = y[division:] + # Split into training and testing data + division = int(X.shape[0] * 0.8) + X_train = X[:division] + X_test = X[division:] + y_iscar_train = y[:division] + y_iscar_test = y[division:] - filter.addPostModel("dnn", MLMLP()) - filter.addPreModel("pca", MLPCA()) + filter.addPostModel("dnn", MLMLP()) + filter.addPreModel("pca", MLPCA()) - filter.train(X_train, y_iscar_train) - y_iscar_hat = filter.predict(X_test, pre_model_name='pca', - post_model_name='dnn') - stats = filter.getAllStats() + filter.train(X_train, y_iscar_train) + y_iscar_hat = filter.predict(X_test, pre_model_name='pca', + post_model_name='dnn') + stats = filter.getAllStats() - filter.deletePostModel("dnn") - filter.deletePreModel("pca") \ No newline at end of file + filter.deletePostModel("dnn") + filter.deletePreModel("pca") \ No newline at end of file From fbd156b5acf60c7d2e00d83669d91e813c2a14fc Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 10:57:02 -0500 Subject: [PATCH 39/82] Split loader from 2D action classification udf --- src/loaders/action_classify_loader.py | 107 ++++++++++++++++++++++++ src/udfs/video_action_classification.py | 96 +-------------------- 2 files changed, 108 insertions(+), 95 deletions(-) create mode 100644 src/loaders/action_classify_loader.py diff --git a/src/loaders/action_classify_loader.py b/src/loaders/action_classify_loader.py new file mode 100644 index 0000000000..ac48d77883 --- /dev/null +++ b/src/loaders/action_classify_loader.py @@ -0,0 +1,107 @@ +from src.models.catalog.frame_info import FrameInfo +from src.models.catalog.properties import VideoFormat, ColorSpace +from src.models.catalog.video_info import VideoMetaInfo +from src.models.storage.frame import Frame +from src.models.storage.batch import FrameBatch +from src.loaders.video_loader import SimpleVideoLoader + +import cv2 +from typing import List, Tuple +from glob import glob +import numpy as np +import random +import os + + +class ActionClassificationLoader: + def __init__(self, path): + self.path = path + self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) + + def getLabelMap(self): + return self.labelMap + + def findDataNames(self, searchDir): + """ + findDataNames enumerates all training data for the model and + returns a list of tuples where the first element is a EVA VideoMetaInfo + object and the second is a string label of the correct video classification + + Inputs: + - searchDir = path to the directory containing the video data + + Outputs: + - videoFileNameList = list of tuples where each tuple corresponds to a video + in the data set. The tuple contains the path to the video, + its label, and a nest tuple containing the shape + - labelList = a list of labels that correspond to the labels in labelMap + - inverseLabelMap = an inverse mapping between the string representation of the label + name and an integer representation of that label + + """ + + # Find all video files and corresponding labels in search directory + videoFileNameList = glob(searchDir+"**/*.avi", recursive=True) + random.shuffle(videoFileNameList) + + labels = [os.path.split(os.path.dirname(a))[1] for a in videoFileNameList] + + videoMetaList = [VideoMetaInfo(f,30,VideoFormat.AVI) for f in videoFileNameList] + inverseLabelMap = {k:v for (k,v) in enumerate(list(set(labels)))} + + labelMap = {v:k for (k,v) in enumerate(list(set(labels)))} + labelList = [labelMap[l] for l in labels] + + return (videoMetaList, labelList, inverseLabelMap) + + def load(self, batchSize): + + print("load") + + videoMetaIndex = 0 + while videoMetaIndex < len(self.videoMetaList): + + # Get a single batch + frames = [] + labels = np.zeros((0,51)) + while len(frames) < batchSize: + + # Load a single video + meta = self.videoMetaList[videoMetaIndex] + videoFrames, info = self.loadVideo(meta) + videoLabels = np.zeros((len(videoFrames),51)) + videoLabels[:,self.labelList[videoMetaIndex]] = 1 + videoMetaIndex += 1 + + # Skip unsupported frame types + if info != FrameInfo(240, 320, 3, ColorSpace.RGB): continue + + # Append onto frames and labels + frames += videoFrames + labels = np.append(labels, videoLabels, axis=0) + + yield FrameBatch(frames, info), labels + + def loadVideo(self, meta): + video = cv2.VideoCapture(meta.file) + video.set(cv2.CAP_PROP_POS_FRAMES, 0) + + _, frame = video.read() + frame_ind = 0 + + info = None + if frame is not None: + (height, width, channels) = frame.shape + info = FrameInfo(height, width, channels, ColorSpace.RGB) + + frames = [] + while frame is not None: + # Save frame + eva_frame = Frame(frame_ind, frame, info) + frames.append(eva_frame) + + # Read next frame + _, frame = video.read() + frame_ind += 1 + + return (frames, info) \ No newline at end of file diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index c94e333611..fbb67af889 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -5,7 +5,7 @@ from src.models.storage.batch import FrameBatch from src.models.inference.classifier_prediction import Prediction -from src.loaders.video_loader import SimpleVideoLoader +from src.loaders.action_classify_loader import ActionClassificationLoader from src.udfs.abstract_udfs import AbstractClassifierUDF @@ -20,100 +20,6 @@ import random import os - -class ActionClassificationLoader: - def __init__(self, path): - self.path = path - self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) - - def getLabelMap(self): - return self.labelMap - - def findDataNames(self, searchDir): - """ - findDataNames enumerates all training data for the model and - returns a list of tuples where the first element is a EVA VideoMetaInfo - object and the second is a string label of the correct video classification - - Inputs: - - searchDir = path to the directory containing the video data - - Outputs: - - videoFileNameList = list of tuples where each tuple corresponds to a video - in the data set. The tuple contains the path to the video, - its label, and a nest tuple containing the shape - - labelList = a list of labels that correspond to the labels in labelMap - - inverseLabelMap = an inverse mapping between the string representation of the label - name and an integer representation of that label - - """ - - # Find all video files and corresponding labels in search directory - videoFileNameList = glob(searchDir+"**/*.avi", recursive=True) - random.shuffle(videoFileNameList) - - labels = [os.path.split(os.path.dirname(a))[1] for a in videoFileNameList] - - videoMetaList = [VideoMetaInfo(f,30,VideoFormat.AVI) for f in videoFileNameList] - inverseLabelMap = {k:v for (k,v) in enumerate(list(set(labels)))} - - labelMap = {v:k for (k,v) in enumerate(list(set(labels)))} - labelList = [labelMap[l] for l in labels] - - return (videoMetaList, labelList, inverseLabelMap) - - def load(self, batchSize): - - print("load") - - videoMetaIndex = 0 - while videoMetaIndex < len(self.videoMetaList): - - # Get a single batch - frames = [] - labels = np.zeros((0,51)) - while len(frames) < batchSize: - - # Load a single video - meta = self.videoMetaList[videoMetaIndex] - videoFrames, info = self.loadVideo(meta) - videoLabels = np.zeros((len(videoFrames),51)) - videoLabels[:,self.labelList[videoMetaIndex]] = 1 - videoMetaIndex += 1 - - # Skip unsupported frame types - if info != FrameInfo(240, 320, 3, ColorSpace.RGB): continue - - # Append onto frames and labels - frames += videoFrames - labels = np.append(labels, videoLabels, axis=0) - - yield FrameBatch(frames, info), labels - - def loadVideo(self, meta): - video = cv2.VideoCapture(meta.file) - video.set(cv2.CAP_PROP_POS_FRAMES, 0) - - _, frame = video.read() - frame_ind = 0 - - info = None - if frame is not None: - (height, width, channels) = frame.shape - info = FrameInfo(height, width, channels, ColorSpace.RGB) - - frames = [] - while frame is not None: - # Save frame - eva_frame = Frame(frame_ind, frame, info) - frames.append(eva_frame) - - # Read next frame - _, frame = video.read() - frame_ind += 1 - - return (frames, info) - class VideoToFrameClassifier(AbstractClassifierUDF): def __init__(self): From 220715e22049b99bbf12762d373d2fe6952fa381 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 11:17:09 -0500 Subject: [PATCH 40/82] Updated loader to subclass the abstract loader --- src/loaders/action_classify_loader.py | 24 ++++++++++++++++++------ src/udfs/video_action_classification.py | 8 ++++---- 2 files changed, 22 insertions(+), 10 deletions(-) diff --git a/src/loaders/action_classify_loader.py b/src/loaders/action_classify_loader.py index ac48d77883..20f6f91142 100644 --- a/src/loaders/action_classify_loader.py +++ b/src/loaders/action_classify_loader.py @@ -4,6 +4,7 @@ from src.models.storage.frame import Frame from src.models.storage.batch import FrameBatch from src.loaders.video_loader import SimpleVideoLoader +from src.loaders.abstract_loader import AbstractLoader import cv2 from typing import List, Tuple @@ -13,10 +14,18 @@ import os -class ActionClassificationLoader: - def __init__(self, path): - self.path = path - self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) +class ActionClassificationLoader(AbstractLoader): + def __init__(self, batchSize): + self.batchSize = batchSize + + def load_images(self, dir: str): + return None + + def load_labels(self, dir: str): + return None + + def load_boxes(self, dir: str): + return None def getLabelMap(self): return self.labelMap @@ -54,9 +63,12 @@ def findDataNames(self, searchDir): return (videoMetaList, labelList, inverseLabelMap) - def load(self, batchSize): + def load_video(self, searchDir): print("load") + + self.path = searchDir + self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) videoMetaIndex = 0 while videoMetaIndex < len(self.videoMetaList): @@ -64,7 +76,7 @@ def load(self, batchSize): # Get a single batch frames = [] labels = np.zeros((0,51)) - while len(frames) < batchSize: + while len(frames) < self.batchSize: # Load a single video meta = self.videoMetaList[videoMetaIndex] diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index fbb67af889..21cdb75a92 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -42,13 +42,13 @@ def trainModel(self): - labelList = list of labels derived from the labelMap - n = integer value for how many videos to act on at a time """ - videoLoader = ActionClassificationLoader("./data/hmdb/") - self.labelMap = videoLoader.getLabelMap() + videoLoader = ActionClassificationLoader(1000) + + for batch,labels in videoLoader.load_video("./data/hmdb/"): + self.labelMap = videoLoader.getLabelMap() - for batch,labels in videoLoader.load(1000): # Get the frames as a numpy array frames = batch.frames_as_numpy_array() - print(frames.shape) print(labels.shape) From 211b9df8d62b5d38bd026642b361626a46b998b5 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 11:38:22 -0500 Subject: [PATCH 41/82] Added a unit test for 2D action classifier model --- src/udfs/video_action_classification.py | 5 +++-- test/udfs/vid_to_frame_classifier_test.py | 16 ++++++++++++---- 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index 21cdb75a92..9e80095df5 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -20,6 +20,7 @@ import random import os + class VideoToFrameClassifier(AbstractClassifierUDF): def __init__(self): @@ -112,5 +113,5 @@ def classify(self, batch: FrameBatch) -> List[Prediction]: List[Prediction]: The predictions made by the classifier """ - pred = model.predict(batch.frames_as_numpy_array()) - return [self.labelMap[l] for l in pred] + pred = self.model.predict(batch.frames_as_numpy_array()) + return [self.labels()[np.argmax(l)] for l in pred] diff --git a/test/udfs/vid_to_frame_classifier_test.py b/test/udfs/vid_to_frame_classifier_test.py index bceb252ea2..ac9a1210ba 100644 --- a/test/udfs/vid_to_frame_classifier_test.py +++ b/test/udfs/vid_to_frame_classifier_test.py @@ -1,9 +1,17 @@ from src.udfs import video_action_classification +from src.models.storage.batch import FrameBatch +from src.models.storage.frame import Frame +import numpy as np +import unittest -def test_VidToFrameClassifier(): - # model = video_action_classification.VideoToFrameClassifier() - # assert model != None - pass +class VidToFrameClassifier_Test(unittest.TestCase): + + def test_VidToFrameClassifier(self): + model = video_action_classification.VideoToFrameClassifier() + assert model != None + + X = np.random.random([240, 320, 3]) + model.classify(FrameBatch([Frame(0,X,None)],None)) From 67ee20cd3ebf80d182e3269250ae355282535fc6 Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 11:51:47 -0500 Subject: [PATCH 42/82] Fixed formatting to pass requirements --- src/loaders/action_classify_loader.py | 46 +++++++++++++---------- test/filters/test_kdewrapper.py | 6 ++- test/filters/test_minimum_filter.py | 3 +- test/filters/test_pp.py | 1 + test/filters/test_research_filter.py | 3 +- test/udfs/vid_to_frame_classifier_test.py | 6 +-- 6 files changed, 38 insertions(+), 27 deletions(-) diff --git a/src/loaders/action_classify_loader.py b/src/loaders/action_classify_loader.py index 20f6f91142..cdb0b1ba09 100644 --- a/src/loaders/action_classify_loader.py +++ b/src/loaders/action_classify_loader.py @@ -1,11 +1,12 @@ from src.models.catalog.frame_info import FrameInfo from src.models.catalog.properties import VideoFormat, ColorSpace from src.models.catalog.video_info import VideoMetaInfo -from src.models.storage.frame import Frame -from src.models.storage.batch import FrameBatch +from src.models.storage.frame import Frame +from src.models.storage.batch import FrameBatch from src.loaders.video_loader import SimpleVideoLoader from src.loaders.abstract_loader import AbstractLoader +from os.path import split, dirname import cv2 from typing import List, Tuple from glob import glob @@ -34,31 +35,35 @@ def findDataNames(self, searchDir): """ findDataNames enumerates all training data for the model and returns a list of tuples where the first element is a EVA VideoMetaInfo - object and the second is a string label of the correct video classification + object and the second is a string label of + the correct video classification Inputs: - searchDir = path to the directory containing the video data Outputs: - - videoFileNameList = list of tuples where each tuple corresponds to a video - in the data set. The tuple contains the path to the video, + - videoFileNameList = list of tuples where each tuple corresponds + to a video in the data set. + The tuple contains the path to the video, its label, and a nest tuple containing the shape - - labelList = a list of labels that correspond to the labels in labelMap - - inverseLabelMap = an inverse mapping between the string representation of the label - name and an integer representation of that label - + - labelList = a list of labels that correspond to labels in labelMap + - inverseLabelMap = an inverse mapping between the string + representation of the label name and an + integer representation of that label """ # Find all video files and corresponding labels in search directory - videoFileNameList = glob(searchDir+"**/*.avi", recursive=True) + videoFileNameList = glob(searchDir + "**/*.avi", recursive=True) random.shuffle(videoFileNameList) - labels = [os.path.split(os.path.dirname(a))[1] for a in videoFileNameList] + labels = [split(dirname(a))[1] for a in videoFileNameList] + + videoMetaList = [VideoMetaInfo(f, 30, VideoFormat.AVI) + for f in videoFileNameList] - videoMetaList = [VideoMetaInfo(f,30,VideoFormat.AVI) for f in videoFileNameList] - inverseLabelMap = {k:v for (k,v) in enumerate(list(set(labels)))} + inverseLabelMap = {k: v for (k, v) in enumerate(list(set(labels)))} - labelMap = {v:k for (k,v) in enumerate(list(set(labels)))} + labelMap = {v: k for (k, v) in enumerate(list(set(labels)))} labelList = [labelMap[l] for l in labels] return (videoMetaList, labelList, inverseLabelMap) @@ -68,25 +73,28 @@ def load_video(self, searchDir): print("load") self.path = searchDir - self.videoMetaList, self.labelList, self.labelMap = self.findDataNames(self.path) + (self.videoMetaList, + self.labelList, + self.labelMap) = self.findDataNames(self.path) videoMetaIndex = 0 while videoMetaIndex < len(self.videoMetaList): # Get a single batch frames = [] - labels = np.zeros((0,51)) + labels = np.zeros((0, 51)) while len(frames) < self.batchSize: # Load a single video meta = self.videoMetaList[videoMetaIndex] videoFrames, info = self.loadVideo(meta) - videoLabels = np.zeros((len(videoFrames),51)) - videoLabels[:,self.labelList[videoMetaIndex]] = 1 + videoLabels = np.zeros((len(videoFrames), 51)) + videoLabels[:, self.labelList[videoMetaIndex]] = 1 videoMetaIndex += 1 # Skip unsupported frame types - if info != FrameInfo(240, 320, 3, ColorSpace.RGB): continue + if info != FrameInfo(240, 320, 3, ColorSpace.RGB): + continue # Append onto frames and labels frames += videoFrames diff --git a/test/filters/test_kdewrapper.py b/test/filters/test_kdewrapper.py index 2453da3c4c..dd8261155b 100644 --- a/test/filters/test_kdewrapper.py +++ b/test/filters/test_kdewrapper.py @@ -2,11 +2,13 @@ import numpy as np import unittest + class KDE_Wrapper_Test(unittest.TestCase): def test_KD_Wrapper(self): - # Construct the filter research and test it with randomized values - # The idea is just to run it and make sure that things run to completion + # Construct the filter research and test it with + # randomized values -- idea is just to run it + # and make sure that things run to completion # No actual output or known inputs are tested wrapper = KernelDensityWrapper() diff --git a/test/filters/test_minimum_filter.py b/test/filters/test_minimum_filter.py index e98330f49e..798d7e00f7 100644 --- a/test/filters/test_minimum_filter.py +++ b/test/filters/test_minimum_filter.py @@ -4,11 +4,12 @@ import numpy as np import unittest + class FilterMinimum_Test(unittest.TestCase): def test_FilterMinimum(self): # Construct the filter minimum and test it with randomized values - # The idea is just to run it and make sure that things run to completion + # Idea is just to run it and make sure that things run to completion # No actual output or known inputs are tested filter = FilterMinimum() diff --git a/test/filters/test_pp.py b/test/filters/test_pp.py index c2f097ceb0..4c6c35eddf 100644 --- a/test/filters/test_pp.py +++ b/test/filters/test_pp.py @@ -2,6 +2,7 @@ from src.filters.pp import PP import unittest + class PP_Test(unittest.TestCase): def test_PP(self): diff --git a/test/filters/test_research_filter.py b/test/filters/test_research_filter.py index 59755f49ed..c215cf6bbf 100644 --- a/test/filters/test_research_filter.py +++ b/test/filters/test_research_filter.py @@ -4,11 +4,12 @@ import numpy as np import unittest + class ResearchFilter_Test(unittest.TestCase): def test_FilterResearch(self): # Construct the filter research and test it with randomized values - # The idea is just to run it and make sure that things run to completion + # Idea is just to run it and make sure that things run to completion # No actual output or known inputs are tested filter = FilterResearch() diff --git a/test/udfs/vid_to_frame_classifier_test.py b/test/udfs/vid_to_frame_classifier_test.py index ac9a1210ba..12ccb5bc34 100644 --- a/test/udfs/vid_to_frame_classifier_test.py +++ b/test/udfs/vid_to_frame_classifier_test.py @@ -9,9 +9,7 @@ class VidToFrameClassifier_Test(unittest.TestCase): def test_VidToFrameClassifier(self): model = video_action_classification.VideoToFrameClassifier() - assert model != None + assert model is not None X = np.random.random([240, 320, 3]) - model.classify(FrameBatch([Frame(0,X,None)],None)) - - + model.classify(FrameBatch([Frame(0, X, None)], None)) \ No newline at end of file From 5d12dc6593f9e9147b027526d22738615c80422f Mon Sep 17 00:00:00 2001 From: Paula Gluss Date: Sun, 8 Dec 2019 12:32:20 -0500 Subject: [PATCH 43/82] Fixed some formatting errors --- src/udfs/video_action_classification.py | 67 ++++++++++++++----------- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/src/udfs/video_action_classification.py b/src/udfs/video_action_classification.py index 9e80095df5..8136ebb5e9 100644 --- a/src/udfs/video_action_classification.py +++ b/src/udfs/video_action_classification.py @@ -1,14 +1,13 @@ from src.models.catalog.frame_info import FrameInfo from src.models.catalog.properties import VideoFormat, ColorSpace from src.models.catalog.video_info import VideoMetaInfo -from src.models.storage.frame import Frame -from src.models.storage.batch import FrameBatch +from src.models.storage.frame import Frame +from src.models.storage.batch import FrameBatch from src.models.inference.classifier_prediction import Prediction from src.loaders.action_classify_loader import ActionClassificationLoader from src.udfs.abstract_udfs import AbstractClassifierUDF - from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, Conv2D, Flatten @@ -30,14 +29,14 @@ def __init__(self): # Train the model using shuffled data self.trainModel() - def trainModel(self): """ trainModel trains the built model using chunks of data of size n videos Inputs: - model = model object to be trained - - videoMetaList = list of tuples where the first element is a EVA VideoMetaInfo + - videoMetaList = list of tuples where the first element is + a EVA VideoMetaInfo object and the second is a string label of the correct video classification - labelList = list of labels derived from the labelMap @@ -45,7 +44,7 @@ def trainModel(self): """ videoLoader = ActionClassificationLoader(1000) - for batch,labels in videoLoader.load_video("./data/hmdb/"): + for batch, labels in videoLoader.load_video("./data/hmdb/"): self.labelMap = videoLoader.getLabelMap() # Get the frames as a numpy array @@ -54,34 +53,39 @@ def trainModel(self): print(labels.shape) # Split x and y into training and validation sets - xTrain = frames[0:int(0.8*frames.shape[0])] - yTrain = labels[0:int(0.8*labels.shape[0])] - xTest = frames[int(0.8*frames.shape[0]):] - yTest = labels[int(0.8*labels.shape[0]):] + xTrain = frames[0:int(0.8 * frames.shape[0])] + yTrain = labels[0:int(0.8 * labels.shape[0])] + xTest = frames[int(0.8 * frames.shape[0]):] + yTest = labels[int(0.8 * labels.shape[0]):] - # Train the model using cross-validation (so we don't need to explicitly do CV outside of training) - self.model.fit(xTrain, yTrain, validation_data = (xTest, yTest), epochs = 2) + # Train the model using cross-validation + # (so we don't need to explicitly do CV outside of training) + self.model.fit(xTrain, yTrain, + validation_data=(xTest, yTest), epochs=2) self.model.save("./data/hmdb/2d_action_classifier.h5") - def buildModel(self): """ - buildModel sets up a convolutional 2D network using a reLu activation function + buildModel sets up a convolutional 2D network + using a reLu activation function Outputs: - - model = model object to be used later for training and classification + - model = model obj to be used later for training and classification """ - # We need to incrementally train the model so we'll set it up before preparing the data + # We must incrementally train the model so + # we'll set it up before preparing the data model = Sequential() # Add layers to the model - model.add(Conv2D(64, kernel_size = 3, activation = "relu", input_shape=(240, 320, 3))) - model.add(Conv2D(32, kernel_size = 3, activation = "relu")) + model.add(Conv2D(64, kernel_size=3, activation="relu", + input_shape=(240, 320, 3))) + model.add(Conv2D(32, kernel_size=3, activation="relu")) model.add(Flatten()) - model.add(Dense(51, activation = "softmax")) + model.add(Dense(51, activation="softmax")) # Compile model and use accuracy to measure performance - model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ["accuracy"]) + model.compile(optimizer="adam", + loss="categorical_crossentropy", metrics=["accuracy"]) return model @@ -94,20 +98,27 @@ def name(self) -> str: def labels(self) -> List[str]: return [ - 'brush_hair', 'clap', 'draw_sword', 'fall_floor', 'handstand', 'kick', 'pick', 'push', 'run', - 'shoot_gun', 'smoke', 'sword', 'turn', 'cartwheel', 'climb', 'dribble', 'fencing', 'hit', - 'kick_ball', 'pour', 'pushup', 'shake_hands', 'sit', 'somersault', 'sword_exercise', 'walk', 'catch', - 'climb_stairs', 'drink', 'flic_flac', 'hug', 'kiss', 'pullup', 'ride_bike', 'shoot_ball', 'situp', - 'stand', 'talk', 'wave', 'chew', 'dive', 'eat', 'golf', 'jump', 'laugh', 'punch', 'ride_horse', - 'shoot_bow', 'smile', 'swing_baseball', 'throw', + 'brush_hair', 'clap', 'draw_sword', 'fall_floor', 'handstand', + 'kick', 'pick', 'push', 'run', + 'shoot_gun', 'smoke', 'sword', 'turn', 'cartwheel', 'climb', + 'dribble', 'fencing', 'hit', + 'kick_ball', 'pour', 'pushup', 'shake_hands', 'sit', 'somersault', + 'sword_exercise', 'walk', 'catch', + 'climb_stairs', 'drink', 'flic_flac', 'hug', 'kiss', 'pullup', + 'ride_bike', 'shoot_ball', 'situp', + 'stand', 'talk', 'wave', 'chew', 'dive', 'eat', 'golf', + 'jump', 'laugh', 'punch', 'ride_horse', + 'shoot_bow', 'smile', 'swing_baseball', 'throw' ] def classify(self, batch: FrameBatch) -> List[Prediction]: """ - Takes as input a batch of frames and returns the predictions by applying the classification model. + Takes as input a batch of frames and returns the + predictions by applying the classification model. Arguments: - batch (FrameBatch): Input batch of frames on which prediction needs to be made + batch (FrameBatch): Input batch of frames + on which prediction needs to be made Returns: List[Prediction]: The predictions made by the classifier From f8becc4f11a8be6996138919139a14568680835a Mon Sep 17 00:00:00 2001 From: Sahith Dambekodi Date: Mon, 9 Dec 2019 15:14:57 -0500 Subject: [PATCH 44/82] Update README.md --- README.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 124c0c6309..077e73945e 100644 --- a/README.md +++ b/README.md @@ -28,17 +28,22 @@ git config core.hooksPath .githooks ### Demos We have demos for the following components: -1. Eva analytics (pipeline for loading the dataset, training the filters, and outputting the optimal plan) +1. Command Line Interface Demo. Type in queries to get results in terms of frames and file names. Exit the command line interface by typing exit. +```commandline + cd + python demo.py +``` +2. Eva analytics (pipeline for loading the dataset, training the filters, and outputting the optimal plan) ```commandline cd python pipeline.py ``` -2. Eva Query Optimizer (Will show converted queries for the original queries) +3. Eva Query Optimizer (Will show converted queries for the original queries) ```commandline cd python query_optimizer/query_optimizer.py ``` -3. Eva Loader (Loads UA-DETRAC dataset) +4. Eva Loader (Loads UA-DETRAC dataset) ```commandline cd python loaders/load.py From 06263b7a3e6523a5cba5659bf53115910dd0818d Mon Sep 17 00:00:00 2001 From: jarulraj Date: Mon, 20 Jan 2020 11:10:44 -0500 Subject: [PATCH 45/82] Refactoring parser --- src/parser/eva_parser.py | 16 ++++++++++++---- ...arser_visitor.py => evaql_parser_visitor.py} | 15 ++++++++++----- test/parser/test_parser.py | 17 ++++++++++++++--- test/parser/test_parser_visitor.py | 16 ++++++++-------- 4 files changed, 44 insertions(+), 20 deletions(-) rename src/parser/{eva_ql_parser_visitor.py => evaql_parser_visitor.py} (98%) diff --git a/src/parser/eva_parser.py b/src/parser/eva_parser.py index 32d6f71972..59f9286647 100644 --- a/src/parser/eva_parser.py +++ b/src/parser/eva_parser.py @@ -17,16 +17,24 @@ from src.parser.evaql.evaql_parser import evaql_parser from src.parser.evaql.evaql_lexer import evaql_lexer -from src.parser.eva_ql_parser_visitor import EvaParserVisitor +from src.parser.evaql_parser_visitor import EvaQLParserVisitor -class EvaFrameQLParser(): + +class EvaQLParser(object): """ - Parser for eva; based on frameQL grammar + Parser for eva; based on EVAQL grammar """ + _instance = None + _visitor = None + + def __new__(cls): + if cls._instance is None: + cls._instance = super(EvaQLParser, cls).__new__(cls) + return cls._instance def __init__(self): - self._visitor = EvaParserVisitor() + self._visitor = EvaQLParserVisitor() def parse(self, query_string: str) -> list: lexer = evaql_lexer(InputStream(query_string)) diff --git a/src/parser/eva_ql_parser_visitor.py b/src/parser/evaql_parser_visitor.py similarity index 98% rename from src/parser/eva_ql_parser_visitor.py rename to src/parser/evaql_parser_visitor.py index 2f0ae4ce2a..c4604e9628 100644 --- a/src/parser/eva_ql_parser_visitor.py +++ b/src/parser/evaql_parser_visitor.py @@ -13,21 +13,26 @@ # See the License for the specific language governing permissions and # limitations under the License. +import warnings + from antlr4 import TerminalNode + from src.expression.abstract_expression import (AbstractExpression, ExpressionType) from src.expression.comparison_expression import ComparisonExpression from src.expression.constant_value_expression import ConstantValueExpression from src.expression.logical_expression import LogicalExpression from src.expression.tuple_value_expression import TupleValueExpression + from src.parser.select_statement import SelectStatement +from src.parser.table_ref import TableRef, TableInfo + from src.parser.evaql.evaql_parser import evaql_parser from src.parser.evaql.evaql_parserVisitor import evaql_parserVisitor -from src.parser.table_ref import TableRef, TableInfo -import warnings -class EvaParserVisitor(evaql_parserVisitor): +class EvaQLParserVisitor(evaql_parserVisitor): + # Visit a parse tree produced by evaql_parser#root. def visitRoot(self, ctx: evaql_parser.RootContext): for child in ctx.children: @@ -46,8 +51,8 @@ def visitSqlStatements(self, ctx: evaql_parser.SqlStatementsContext): # Visit a parse tree produced by evaql_parser#simpleSelect. def visitSimpleSelect(self, ctx: evaql_parser.SimpleSelectContext): - select_stm = self.visitChildren(ctx) - return select_stm + select_stmt = self.visitChildren(ctx) + return select_stmt # Visit a parse tree produced by evaql_parser#tableSources. def visitTableSources(self, ctx: evaql_parser.TableSourcesContext): diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 2192d4d109..05c4bc975c 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -15,7 +15,7 @@ import unittest -from src.parser.eva_parser import EvaFrameQLParser +from src.parser.eva_parser import EvaQLParser from src.parser.eva_statement import EvaStatement from src.parser.eva_statement import StatementType @@ -28,7 +28,12 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_eva_parser(self): - parser = EvaFrameQLParser() + parser = EvaQLParser() + print(parser) + + parser = EvaQLParser() + print(parser) + single_queries = [] single_queries.append("SELECT CLASS FROM TAIPAI;") single_queries.append("SELECT CLASS FROM TAIPAI WHERE CLASS = 'VAN';") @@ -38,8 +43,14 @@ def test_eva_parser(self): WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;") single_queries.append("SELECT CLASS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;") + + #single_queries.append("CREATE TABLE Persons ( PersonID INTEGER);") + for query in single_queries: eva_statement_list = parser.parse(query) + + print(eva_statement_list[0]) + self.assertIsInstance(eva_statement_list, list) self.assertEqual(len(eva_statement_list), 1) self.assertIsInstance( @@ -61,7 +72,7 @@ def test_eva_parser(self): eva_statement_list[1], EvaStatement) def test_select_parser(self): - parser = EvaFrameQLParser() + parser = EvaQLParser() select_query = "SELECT CLASS, REDNESS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;" eva_statement_list = parser.parse(select_query) diff --git a/test/parser/test_parser_visitor.py b/test/parser/test_parser_visitor.py index 3864a4a079..dc0ba8acb3 100644 --- a/test/parser/test_parser_visitor.py +++ b/test/parser/test_parser_visitor.py @@ -18,7 +18,7 @@ from unittest import mock from unittest.mock import MagicMock, call -from src.parser.eva_ql_parser_visitor import EvaParserVisitor +from src.parser.evaql_parser_visitor import EvaQLParserVisitor from src.parser.evaql.evaql_parser import evaql_parser from src.expression.abstract_expression import ExpressionType @@ -28,12 +28,12 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_should_query_specification_visitor(self): - EvaParserVisitor.visit = MagicMock() - mock_visit = EvaParserVisitor.visit + EvaQLParserVisitor.visit = MagicMock() + mock_visit = EvaQLParserVisitor.visit mock_visit.side_effect = ["target", {"from": ["from"], "where": "where"}] - visitor = EvaParserVisitor() + visitor = EvaQLParserVisitor() ctx = MagicMock() child_1 = MagicMock() child_1.getRuleIndex.return_value = evaql_parser.RULE_selectElements @@ -50,7 +50,7 @@ def test_should_query_specification_visitor(self): self.assertEqual(expected.where_clause, "where") self.assertEqual(expected.target_list, "target") - @mock.patch.object(EvaParserVisitor, 'visit') + @mock.patch.object(EvaQLParserVisitor, 'visit') def test_from_clause_visitor(self, mock_visit): mock_visit.side_effect = ["from", "where"] @@ -60,7 +60,7 @@ def test_from_clause_visitor(self, mock_visit): whereExpr = MagicMock() ctx.whereExpr = whereExpr - visitor = EvaParserVisitor() + visitor = EvaQLParserVisitor() expected = visitor.visitFromClause(ctx) mock_visit.assert_has_calls([call(tableSources), call(whereExpr)]) @@ -69,7 +69,7 @@ def test_from_clause_visitor(self, mock_visit): def test_logical_operator(self): ctx = MagicMock() - visitor = EvaParserVisitor() + visitor = EvaQLParserVisitor() self.assertEqual( visitor.visitLogicalOperator(ctx), @@ -87,7 +87,7 @@ def test_logical_operator(self): def test_comparison_operator(self): ctx = MagicMock() - visitor = EvaParserVisitor() + visitor = EvaQLParserVisitor() self.assertEqual( visitor.visitComparisonOperator(ctx), From 639dffb2f903744a170ddd3576e0a391a82b4647 Mon Sep 17 00:00:00 2001 From: jarulraj Date: Mon, 20 Jan 2020 14:28:42 -0500 Subject: [PATCH 46/82] Adding create statement --- src/parser/create_statement.py | 45 +++++++++++++++++++++ src/parser/evaql/evaql_lexer.g4 | 2 +- src/parser/evaql/evaql_parser.g4 | 8 ++-- src/parser/evaql_parser_visitor.py | 64 ++++++++++++++++++++++++++++-- src/parser/select_statement.py | 2 + src/parser/table_ref.py | 10 +++++ test/parser/test_parser.py | 30 +++++++++----- test/parser/test_parser_visitor.py | 16 ++++---- 8 files changed, 152 insertions(+), 25 deletions(-) create mode 100644 src/parser/create_statement.py diff --git a/src/parser/create_statement.py b/src/parser/create_statement.py new file mode 100644 index 0000000000..1174e2c589 --- /dev/null +++ b/src/parser/create_statement.py @@ -0,0 +1,45 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from src.parser.eva_statement import EvaStatement + +from src.parser.types import StatementType +from src.expression.abstract_expression import AbstractExpression +from src.parser.table_ref import TableRef +from typing import List + + +class CreateTableStatement(EvaStatement): + """ + Create Table Statement constructed after parsing the input query + + Attributes + ---------- + TableRef: + table reference in the create table statement + ColumnList: + list of columns + **kwargs : to support other functionality, Orderby, Distinct, Groupby. + """ + + def __init__(self, + table_name: TableRef = None, + column_list: List[AbstractExpression] = None): + super().__init__(StatementType.SELECT) + self._table_name = table_name + + def __str__(self) -> str: + print_str = "CREATE TABLE {} ".format(self._table_name) + return print_str diff --git a/src/parser/evaql/evaql_lexer.g4 b/src/parser/evaql/evaql_lexer.g4 index e2fdf978a8..881f1c8314 100644 --- a/src/parser/evaql/evaql_lexer.g4 +++ b/src/parser/evaql/evaql_lexer.g4 @@ -227,7 +227,7 @@ GLOBAL_ID: '@' '@' // Fragments for Literal primitives fragment EXPONENT_NUM_PART: 'E' '-'? DEC_DIGIT+; -fragment ID_LITERAL: [A-Z_$0-9]*?[A-Z_$]+?[A-Z_$0-9]*; +fragment ID_LITERAL: [A-Za-z_$0-9]*?[A-Za-z_$]+?[A-Za-z_$0-9]*; fragment DQUOTA_STRING: '"' ( '\\'. | '""' | ~('"'| '\\') )* '"'; fragment SQUOTA_STRING: '\'' ('\\'. | '\'\'' | ~('\'' | '\\'))* '\''; fragment BQUOTA_STRING: '`' ( '\\'. | '``' | ~('`'|'\\'))* '`'; diff --git a/src/parser/evaql/evaql_parser.g4 b/src/parser/evaql/evaql_parser.g4 index 7397fba75f..e3ecec90c7 100644 --- a/src/parser/evaql/evaql_parser.g4 +++ b/src/parser/evaql/evaql_parser.g4 @@ -54,8 +54,9 @@ createIndex ; createTable - : CREATE TABLE ifNotExists? - tableName createDefinitions #columnCreateTable + : CREATE TABLE + ifNotExists? + tableName createDefinitions #columnCreateTable ; // details @@ -198,7 +199,8 @@ queryExpression //frameQL statement added querySpecification : SELECT selectElements - fromClause? orderByClause? limitClause? errorBoundsExpression? confidenceLevelExpression? + fromClause? orderByClause? limitClause? + errorBoundsExpression? confidenceLevelExpression? ; // details diff --git a/src/parser/evaql_parser_visitor.py b/src/parser/evaql_parser_visitor.py index c4604e9628..d726d09035 100644 --- a/src/parser/evaql_parser_visitor.py +++ b/src/parser/evaql_parser_visitor.py @@ -25,6 +25,8 @@ from src.expression.tuple_value_expression import TupleValueExpression from src.parser.select_statement import SelectStatement +from src.parser.create_statement import CreateTableStatement + from src.parser.table_ref import TableRef, TableInfo from src.parser.evaql.evaql_parser import evaql_parser @@ -49,11 +51,61 @@ def visitSqlStatements(self, ctx: evaql_parser.SqlStatementsContext): return eva_statements + ################################################################## + # STATEMENTS + ################################################################## + + def visitDdlStatement(self, ctx: evaql_parser.DdlStatementContext): + ddl_statement = self.visitChildren(ctx) + return ddl_statement + + def visitDmlStatement(self, ctx: evaql_parser.DdlStatementContext): + dml_statement = self.visitChildren(ctx) + return dml_statement + + ################################################################## + # CREATE STATEMENTS + ################################################################## + + def visitColumnCreateTable( + self, ctx: evaql_parser.ColumnCreateTableContext): + + table_ref = None + # first two children will be CREATE TABLE terminal token + for child in ctx.children[2:]: + try: + rule_idx = child.getRuleIndex() + + if rule_idx == evaql_parser.RULE_tableName: + table_ref = self.visit(ctx.tableName()) + + elif rule_idx == evaql_parser.RULE_ifNotExists: + pass + + elif rule_idx == evaql_parser.RULE_createDefinitions: + pass + + except BaseException: + print("Exception") + # stop parsing something bad happened + return None + + create_stmt = CreateTableStatement(table_ref) + return create_stmt + + ################################################################## + # SELECT STATEMENT + ################################################################## + # Visit a parse tree produced by evaql_parser#simpleSelect. def visitSimpleSelect(self, ctx: evaql_parser.SimpleSelectContext): select_stmt = self.visitChildren(ctx) return select_stmt + ################################################################## + # TABLE SOURCES + ################################################################## + # Visit a parse tree produced by evaql_parser#tableSources. def visitTableSources(self, ctx: evaql_parser.TableSourcesContext): table_list = [] @@ -80,13 +132,17 @@ def visitQuerySpecification( clause = self.visit(child) from_clause = clause.get('from', None) where_clause = clause.get('where', None) + except BaseException: # stop parsing something bad happened return None + # we don't support multiple table sources if from_clause is not None: from_clause = from_clause[0] + select_stmt = SelectStatement(target_list, from_clause, where_clause) + return select_stmt # Visit a parse tree produced by evaql_parser#selectElements. @@ -113,10 +169,8 @@ def visitFromClause(self, ctx: evaql_parser.FromClauseContext): # Visit a parse tree produced by evaql_parser#tableName. def visitTableName(self, ctx: evaql_parser.TableNameContext): + table_name = self.visit(ctx.fullId()) - # assuming we get just table name - # todo - # handle database name and schema names if table_name is not None: table_info = TableInfo(table_name=table_name) return TableRef(table_info) @@ -138,6 +192,10 @@ def visitSimpleId(self, ctx: evaql_parser.SimpleIdContext): return ctx.getText() # return self.visitChildren(ctx) + ################################################################## + # EXPRESSIONS + ################################################################## + # Visit a parse tree produced by evaql_parser#stringLiteral. def visitStringLiteral(self, ctx: evaql_parser.StringLiteralContext): if ctx.STRING_LITERAL() is not None: diff --git a/src/parser/select_statement.py b/src/parser/select_statement.py index 9c27f97d11..2dec631380 100644 --- a/src/parser/select_statement.py +++ b/src/parser/select_statement.py @@ -12,7 +12,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + from src.parser.eva_statement import EvaStatement + from src.parser.types import StatementType from src.expression.abstract_expression import AbstractExpression from src.parser.table_ref import TableRef diff --git a/src/parser/table_ref.py b/src/parser/table_ref.py index b2d3589d9d..d4a9cb55b8 100644 --- a/src/parser/table_ref.py +++ b/src/parser/table_ref.py @@ -36,6 +36,11 @@ def schema_name(self): def database_name(self): return self._database_name + def __str__(self): + table_info_str = "TABLE INFO:: (" + self._table_name + ")\n" + + return table_info_str + class TableRef: """ @@ -50,3 +55,8 @@ def __init__(self, table_info: TableInfo): @property def table_info(self): return self._table_info + + def __str__(self): + table_ref_str = "TABLE REF:: (" + str(self._table_info) + ")\n" + + return table_ref_str diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 05c4bc975c..01778d710f 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -17,8 +17,9 @@ from src.parser.eva_parser import EvaQLParser from src.parser.eva_statement import EvaStatement -from src.parser.eva_statement import StatementType +from src.parser.eva_statement import StatementType + from src.expression.abstract_expression import ExpressionType from src.parser.table_ref import TableRef @@ -27,12 +28,23 @@ class ParserTests(unittest.TestCase): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) - def test_eva_parser(self): + def test_create_statement(self): parser = EvaQLParser() - print(parser) + single_queries = [] + single_queries.append("CREATE TABLE Persons (PersonID INTEGER);") + + for query in single_queries: + eva_statement_list = parser.parse(query) + self.assertIsInstance(eva_statement_list, list) + self.assertEqual(len(eva_statement_list), 1) + self.assertIsInstance( + eva_statement_list[0], EvaStatement) + + print(eva_statement_list[0]) + + def test_single_statement_queries(self): parser = EvaQLParser() - print(parser) single_queries = [] single_queries.append("SELECT CLASS FROM TAIPAI;") @@ -44,18 +56,16 @@ def test_eva_parser(self): single_queries.append("SELECT CLASS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;") - #single_queries.append("CREATE TABLE Persons ( PersonID INTEGER);") - for query in single_queries: eva_statement_list = parser.parse(query) - - print(eva_statement_list[0]) - self.assertIsInstance(eva_statement_list, list) self.assertEqual(len(eva_statement_list), 1) self.assertIsInstance( eva_statement_list[0], EvaStatement) + def test_multiple_statement_queries(self): + parser = EvaQLParser() + multiple_queries = [] multiple_queries.append("SELECT CLASS FROM TAIPAI \ WHERE CLASS = 'VAN' AND REDNESS < 300 OR REDNESS > 500; \ @@ -71,7 +81,7 @@ def test_eva_parser(self): self.assertIsInstance( eva_statement_list[1], EvaStatement) - def test_select_parser(self): + def test_select_statement(self): parser = EvaQLParser() select_query = "SELECT CLASS, REDNESS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;" diff --git a/test/parser/test_parser_visitor.py b/test/parser/test_parser_visitor.py index dc0ba8acb3..a1f28dbf97 100644 --- a/test/parser/test_parser_visitor.py +++ b/test/parser/test_parser_visitor.py @@ -30,8 +30,8 @@ def __init__(self, *args, **kwargs): def test_should_query_specification_visitor(self): EvaQLParserVisitor.visit = MagicMock() mock_visit = EvaQLParserVisitor.visit - mock_visit.side_effect = ["target", - {"from": ["from"], "where": "where"}] + mock_visit.side_effect = ["columns", + {"from": ["tables"], "where": "predicates"}] visitor = EvaQLParserVisitor() ctx = MagicMock() @@ -46,13 +46,13 @@ def test_should_query_specification_visitor(self): mock_visit.assert_has_calls([call(child_1), call(child_2)]) - self.assertEqual(expected.from_table, "from") - self.assertEqual(expected.where_clause, "where") - self.assertEqual(expected.target_list, "target") + self.assertEqual(expected.from_table, "tables") + self.assertEqual(expected.where_clause, "predicates") + self.assertEqual(expected.target_list, "columns") @mock.patch.object(EvaQLParserVisitor, 'visit') def test_from_clause_visitor(self, mock_visit): - mock_visit.side_effect = ["from", "where"] + mock_visit.side_effect = ["tables", "predicates"] ctx = MagicMock() tableSources = MagicMock() @@ -64,8 +64,8 @@ def test_from_clause_visitor(self, mock_visit): expected = visitor.visitFromClause(ctx) mock_visit.assert_has_calls([call(tableSources), call(whereExpr)]) - self.assertEqual(expected.get('where'), 'where') - self.assertEqual(expected.get('from'), 'from') + self.assertEqual(expected.get('where'), 'predicates') + self.assertEqual(expected.get('from'), 'tables') def test_logical_operator(self): ctx = MagicMock() From fe4c1fcb65a7c0e1339d858581840a8e2cf8d873 Mon Sep 17 00:00:00 2001 From: jarulraj Date: Tue, 21 Jan 2020 00:08:00 -0500 Subject: [PATCH 47/82] Adding support for column definition --- src/parser/create_statement.py | 7 ++- src/parser/evaql_parser_visitor.py | 81 ++++++++++++++++++++++-------- src/parser/table_ref.py | 5 +- test/parser/test_parser.py | 6 ++- 4 files changed, 73 insertions(+), 26 deletions(-) diff --git a/src/parser/create_statement.py b/src/parser/create_statement.py index 1174e2c589..14576b1ade 100644 --- a/src/parser/create_statement.py +++ b/src/parser/create_statement.py @@ -35,11 +35,14 @@ class CreateTableStatement(EvaStatement): """ def __init__(self, - table_name: TableRef = None, + table_name: TableRef, + if_not_exists: bool, column_list: List[AbstractExpression] = None): super().__init__(StatementType.SELECT) self._table_name = table_name + self._if_not_exists = if_not_exists def __str__(self) -> str: - print_str = "CREATE TABLE {} ".format(self._table_name) + print_str = "CREATE TABLE {} ({}) ".format(self._table_name, + self._if_not_exists) return print_str diff --git a/src/parser/evaql_parser_visitor.py b/src/parser/evaql_parser_visitor.py index d726d09035..4365ea93ad 100644 --- a/src/parser/evaql_parser_visitor.py +++ b/src/parser/evaql_parser_visitor.py @@ -33,15 +33,16 @@ from src.parser.evaql.evaql_parserVisitor import evaql_parserVisitor +from src.catalog.schema import ColumnType, Column + + class EvaQLParserVisitor(evaql_parserVisitor): - # Visit a parse tree produced by evaql_parser#root. def visitRoot(self, ctx: evaql_parser.RootContext): for child in ctx.children: if child is not TerminalNode: return self.visit(child) - # Visit a parse tree produced by evaql_parser#sqlStatements. def visitSqlStatements(self, ctx: evaql_parser.SqlStatementsContext): eva_statements = [] for child in ctx.children: @@ -71,6 +72,9 @@ def visitColumnCreateTable( self, ctx: evaql_parser.ColumnCreateTableContext): table_ref = None + if_not_exists = False + create_definitions = [] + # first two children will be CREATE TABLE terminal token for child in ctx.children[2:]: try: @@ -80,24 +84,75 @@ def visitColumnCreateTable( table_ref = self.visit(ctx.tableName()) elif rule_idx == evaql_parser.RULE_ifNotExists: - pass + if_not_exists = True elif rule_idx == evaql_parser.RULE_createDefinitions: - pass + create_definitions = self.visit(ctx.createDefinitions()) except BaseException: print("Exception") # stop parsing something bad happened return None - create_stmt = CreateTableStatement(table_ref) + print(create_definitions) + create_stmt = CreateTableStatement(table_ref, + if_not_exists, + create_definitions) return create_stmt + def visitCreateDefinitions( + self, ctx: evaql_parser.CreateDefinitionsContext): + column_definitions = [] + child_index = 0 + for child in ctx.children: + create_definition = ctx.createDefinition(child_index) + if create_definition is not None: + column_definition = self.visit(create_definition) + column_definitions.append(column_definition) + child_index = child_index + 1 + + for column_definition in column_definitions: + print(str(column_definition)) + + return column_definitions + + def visitColumnDeclaration( + self, ctx: evaql_parser.ColumnDeclarationContext): + data_type = self.visit(ctx.columnDefinition()) + column_name = self.visit(ctx.uid()) + + column = Column(column_name, data_type) + return column + + def visitColumnDefinition(self, ctx: evaql_parser.ColumnDefinitionContext): + data_type = self.visit(ctx.dataType()) + return data_type + + def visitDimensionDataType( + self, ctx: evaql_parser.DimensionDataTypeContext): + + column_type = None + if ctx.FLOAT() is not None: + column_type = ColumnType.FLOAT + elif ctx.INTEGER() is not None: + column_type = ColumnType.INTEGER + elif ctx.UNSIGNED() is not None: + column_type = ColumnType.INTEGER + + return column_type + + def visitStringDataType(self, ctx: evaql_parser.StringDataTypeContext): + + column_type = None + if ctx.TEXT() is not None: + column_type = ColumnType.STRING + + return column_type + ################################################################## # SELECT STATEMENT ################################################################## - # Visit a parse tree produced by evaql_parser#simpleSelect. def visitSimpleSelect(self, ctx: evaql_parser.SimpleSelectContext): select_stmt = self.visitChildren(ctx) return select_stmt @@ -106,7 +161,6 @@ def visitSimpleSelect(self, ctx: evaql_parser.SimpleSelectContext): # TABLE SOURCES ################################################################## - # Visit a parse tree produced by evaql_parser#tableSources. def visitTableSources(self, ctx: evaql_parser.TableSourcesContext): table_list = [] for child in ctx.children: @@ -115,7 +169,6 @@ def visitTableSources(self, ctx: evaql_parser.TableSourcesContext): table_list.append(table) return table_list - # Visit a parse tree produced by evaql_parser#querySpecification. def visitQuerySpecification( self, ctx: evaql_parser.QuerySpecificationContext): target_list = None @@ -145,7 +198,6 @@ def visitQuerySpecification( return select_stmt - # Visit a parse tree produced by evaql_parser#selectElements. def visitSelectElements(self, ctx: evaql_parser.SelectElementsContext): select_list = [] for child in ctx.children: @@ -155,7 +207,6 @@ def visitSelectElements(self, ctx: evaql_parser.SelectElementsContext): return select_list - # Visit a parse tree produced by evaql_parser#fromClause. def visitFromClause(self, ctx: evaql_parser.FromClauseContext): from_table = None where_clause = None @@ -167,7 +218,6 @@ def visitFromClause(self, ctx: evaql_parser.FromClauseContext): return {"from": from_table, "where": where_clause} - # Visit a parse tree produced by evaql_parser#tableName. def visitTableName(self, ctx: evaql_parser.TableNameContext): table_name = self.visit(ctx.fullId()) @@ -177,7 +227,6 @@ def visitTableName(self, ctx: evaql_parser.TableNameContext): else: warnings.warn("Invalid from table", SyntaxWarning) - # Visit a parse tree produced by evaql_parser#fullColumnName. def visitFullColumnName(self, ctx: evaql_parser.FullColumnNameContext): # dotted id not supported yet column_name = self.visit(ctx.uid()) @@ -186,7 +235,6 @@ def visitFullColumnName(self, ctx: evaql_parser.FullColumnNameContext): else: warnings.warn("Column Name Missing", SyntaxWarning) - # Visit a parse tree produced by evaql_parser#simpleId. def visitSimpleId(self, ctx: evaql_parser.SimpleIdContext): # todo handle children, right now assuming TupleValueExpr return ctx.getText() @@ -196,21 +244,18 @@ def visitSimpleId(self, ctx: evaql_parser.SimpleIdContext): # EXPRESSIONS ################################################################## - # Visit a parse tree produced by evaql_parser#stringLiteral. def visitStringLiteral(self, ctx: evaql_parser.StringLiteralContext): if ctx.STRING_LITERAL() is not None: return ConstantValueExpression(ctx.getText()) # todo handle other types return self.visitChildren(ctx) - # Visit a parse tree produced by evaql_parser#constant. def visitConstant(self, ctx: evaql_parser.ConstantContext): if ctx.REAL_LITERAL() is not None: return ConstantValueExpression(float(ctx.getText())) return self.visitChildren(ctx) - # Visit a parse tree produced by evaql_parser#logicalExpression. def visitLogicalExpression( self, ctx: evaql_parser.LogicalExpressionContext): if len(ctx.children) < 3: @@ -221,7 +266,6 @@ def visitLogicalExpression( right = self.visit(ctx.getChild(2)) return LogicalExpression(op, left, right) - # Visit a parse tree produced by evaql_parser#binaryComparasionPredicate. def visitBinaryComparasionPredicate( self, ctx: evaql_parser.BinaryComparisonPredicateContext): left = self.visit(ctx.left) @@ -229,14 +273,12 @@ def visitBinaryComparasionPredicate( op = self.visit(ctx.comparisonOperator()) return ComparisonExpression(op, left, right) - # Visit a parse tree produced by evaql_parser#nestedExpressionAtom. def visitNestedExpressionAtom( self, ctx: evaql_parser.NestedExpressionAtomContext): # ToDo Can there be >1 expression in this case expr = ctx.expression(0) return self.visit(expr) - # Visit a parse tree produced by evaql_parser#comparisonOperator. def visitComparisonOperator( self, ctx: evaql_parser.ComparisonOperatorContext): op = ctx.getText() @@ -249,7 +291,6 @@ def visitComparisonOperator( else: return ExpressionType.INVALID - # Visit a parse tree produced by evaql_parser#logicalOperator. def visitLogicalOperator(self, ctx: evaql_parser.LogicalOperatorContext): op = ctx.getText() diff --git a/src/parser/table_ref.py b/src/parser/table_ref.py index d4a9cb55b8..930d07583e 100644 --- a/src/parser/table_ref.py +++ b/src/parser/table_ref.py @@ -37,7 +37,7 @@ def database_name(self): return self._database_name def __str__(self): - table_info_str = "TABLE INFO:: (" + self._table_name + ")\n" + table_info_str = "TABLE INFO:: (" + self._table_name + ")" return table_info_str @@ -57,6 +57,5 @@ def table_info(self): return self._table_info def __str__(self): - table_ref_str = "TABLE REF:: (" + str(self._table_info) + ")\n" - + table_ref_str = "TABLE REF:: (" + str(self._table_info) + ")" return table_ref_str diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 01778d710f..cf9ea9acdb 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -32,7 +32,11 @@ def test_create_statement(self): parser = EvaQLParser() single_queries = [] - single_queries.append("CREATE TABLE Persons (PersonID INTEGER);") + single_queries.append( + """CREATE TABLE IF NOT EXISTS Persons ( + Frame_ID INTEGER, + Frame_Data TEXT + );""") for query in single_queries: eva_statement_list = parser.parse(query) From 3fb3f88770ee15db16a7f0ea6a425cc781cd3d84 Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 23 Jan 2020 23:10:02 -0500 Subject: [PATCH 48/82] temp commit --- environment.yml | 2 + src/catalog/catalog_manager.py | 47 ++++++--- src/catalog/df_column.py | 113 ++++++++++++++++++++ src/catalog/df_metadata.py | 46 +++++++++ src/catalog/df_schema.py | 49 +++++++++ src/catalog/rough.ipynb | 181 +++++++++++++++++++++++++++++++++ src/catalog/schema.py | 149 --------------------------- src/catalog/sql_config.py | 26 +++++ 8 files changed, 452 insertions(+), 161 deletions(-) create mode 100644 src/catalog/df_column.py create mode 100644 src/catalog/df_metadata.py create mode 100644 src/catalog/df_schema.py create mode 100644 src/catalog/rough.ipynb delete mode 100644 src/catalog/schema.py create mode 100644 src/catalog/sql_config.py diff --git a/environment.yml b/environment.yml index 12de9f6dd8..54a5ddc058 100644 --- a/environment.yml +++ b/environment.yml @@ -20,6 +20,8 @@ dependencies: - pytorch - tensorboard - pillow=6.1 + - sqlalchemy + - pymysql - pip: - antlr4-python3-runtime==4.8 - petastorm diff --git a/src/catalog/catalog_manager.py b/src/catalog/catalog_manager.py index 3df42de26a..f4beaec507 100644 --- a/src/catalog/catalog_manager.py +++ b/src/catalog/catalog_manager.py @@ -14,26 +14,24 @@ # limitations under the License. import os - -from src.utils.logging_manager import LoggingManager -from src.utils.logging_manager import LoggingLevel - -from src.configuration.configuration_manager import ConfigurationManager -from src.configuration.dictionary import CATALOG_DIR - from urllib.parse import urlparse -from src.catalog.catalog_dataframes import load_catalog_dataframes from src.catalog.catalog_dataframes import create_catalog_dataframes - +from src.catalog.catalog_dataframes import load_catalog_dataframes +from src.catalog.df_column import DataframeColumn +from src.catalog.df_metadata import DataFrameMetadata +from src.catalog.df_schema import Schema +from src.catalog.sql_config import sql_conn +from src.configuration.configuration_manager import ConfigurationManager +from src.configuration.dictionary import CATALOG_DIR from src.configuration.dictionary import DATASET_DATAFRAME_NAME - -from src.storage.dataframe import load_dataframe, get_next_row_id from src.storage.dataframe import append_rows +from src.storage.dataframe import load_dataframe, get_next_row_id +from src.utils.logging_manager import LoggingLevel +from src.utils.logging_manager import LoggingManager class CatalogManager(object): - _instance = None _catalog = None _catalog_dictionary = {} @@ -68,6 +66,31 @@ def bootstrap_catalog(self): create_catalog_dataframes( catalog_dir_url, self._catalog_dictionary) + def load_dataframe_metadata(self): + + # todo: move sql queries to a separate file + session = sql_conn.get_session() + metadata_list = session.query(DataFrameMetadata).all() + df_ids = [df.get_id() for df in metadata_list] + df_columns = session.query(DataframeColumn).filter( + DataframeColumn._df_id.in_(df_ids)) + + metadata_list = self.construct_dataframe_metadata(metadata_list, + df_columns) + for metadata in metadata_list: + self._catalog_dictionary.update({metadata.get_name(): metadata}) + + def construct_dataframe_metadata(self, metadata_list, df_columns): + col_dict = {} + for col in df_columns: + col_dict[col._df_id] = col_dict.get(col.get_df_id(), []).append( + col) + for df in metadata_list: + schema = Schema(df.get_name(), col_dict[df.get_id()]) + df.set_schema(schema) + return metadata_list + + def create_dataset(self, dataset_name: str): dataset_catalog_entry = \ diff --git a/src/catalog/df_column.py b/src/catalog/df_column.py new file mode 100644 index 0000000000..9a51b0000e --- /dev/null +++ b/src/catalog/df_column.py @@ -0,0 +1,113 @@ +import json +from enum import Enum +from typing import List + +import numpy as np +from petastorm.codecs import NdarrayCodec +from petastorm.codecs import ScalarCodec +from petastorm.unischema import UnischemaField +from pyspark.sql.types import IntegerType, FloatType, StringType +from sqlalchemy import Column, String, Integer, Boolean + +from src.catalog.sql_config import sql_conn +from src.utils.logging_manager import LoggingLevel +from src.utils.logging_manager import LoggingManager + + +class DataframeColumnType(Enum): + INTEGER = 1 + FLOAT = 2 + STRING = 3 + NDARRAY = 4 + + +class DataframeColumn(sql_conn.base): + __tablename__ = 'df_column' + + _id = Column('id', Integer, primary_key=True) + _name = Column('name', String) + _type = Column('type', Enum(DataframeColumnType), + default=DataframeColumnType.INTEGER) + _is_nullable = Column('is_nullable', Boolean, default=False) + _array_dimensions = Column('array_dimensions', String, default='[]') + _dataframe_id = Column('dataframe_id', Integer) + + def __init__(self, + name: str, + type: DataframeColumnType, + is_nullable: bool = False, + array_dimensions: List[int] = []): + self._name = name + self._type = type + self._is_nullable = is_nullable + self._array_dimensions = array_dimensions + + def get_name(self): + return self._name + + def get_type(self): + return self._type + + def is_nullable(self): + return self._is_nullable + + def get_array_dimensions(self): + return json.loads(self._array_dimensions) + + def set_array_dimensions(self, array_dimensions): + self._array_dimensions = str(array_dimensions) + + def __str__(self): + column_str = "\tColumn: (%s, %s, %s, " % (self._name, + self._type.name, + self._is_nullable) + + column_str += "[" + column_str += ', '.join(['%d'] * len(self._array_dimensions)) \ + % tuple(self._array_dimensions) + column_str += "] " + column_str += ")\n" + + return column_str + + @staticmethod + def get_petastorm_column(column): + + column_type = column.get_type() + column_name = column.get_name() + column_is_nullable = column.is_nullable() + column_array_dimensions = column.get_array_dimensions() + + # Reference: + # https://github.com/uber/petastorm/blob/master/petastorm/ + # tests/test_common.py + + if column_type == DataframeColumnType.INTEGER: + petastorm_column = UnischemaField(column_name, + np.int32, + (), + ScalarCodec(IntegerType()), + column_is_nullable) + elif column_type == DataframeColumnType.FLOAT: + petastorm_column = UnischemaField(column_name, + np.float64, + (), + ScalarCodec(FloatType()), + column_is_nullable) + elif column_type == DataframeColumnType.STRING: + petastorm_column = UnischemaField(column_name, + np.string_, + (), + ScalarCodec(StringType()), + column_is_nullable) + elif column_type == DataframeColumnType.NDARRAY: + petastorm_column = UnischemaField(column_name, + np.uint8, + column_array_dimensions, + NdarrayCodec(), + column_is_nullable) + else: + LoggingManager().log("Invalid column type: " + str(column_type), + LoggingLevel.ERROR) + + return petastorm_column diff --git a/src/catalog/df_metadata.py b/src/catalog/df_metadata.py new file mode 100644 index 0000000000..a07b8a47f4 --- /dev/null +++ b/src/catalog/df_metadata.py @@ -0,0 +1,46 @@ +from sqlalchemy import Column, String, Integer + + +class DataFrameMetadata(object): + __tablename__ = 'df_metadata' + + _id = Column('id', Integer, primary_key=True) + _name = Column('name', String) + _file_url = Column('file_url', String) + _schema_id = Column('schema_id', Integer) + + def __init__(self, + dataframe_file_url, + dataframe_schema + ): + self._file_url = dataframe_file_url + self._dataframe_schema = dataframe_schema + self._dataframe_petastorm_schema = \ + dataframe_schema.get_petastorm_schema() + self._dataframe_pyspark_schema = \ + self._dataframe_petastorm_schema.as_spark_schema() + + def set_schema(self, schema): + self._dataframe_schema = schema + self._dataframe_petastorm_schema = \ + schema.get_petastorm_schema() + self._dataframe_pyspark_schema = \ + self._dataframe_petastorm_schema.as_spark_schema() + + def get_id(self): + return self._id + + def get_dataframe_file_url(self): + return self._file_url + + def get_schema_id(self): + return self._schema_id + + def get_dataframe_schema(self): + return self._dataframe_schema + + def get_dataframe_petastorm_schema(self): + return self._dataframe_petastorm_schema + + def get_dataframe_pyspark_schema(self): + return self._dataframe_pyspark_schema diff --git a/src/catalog/df_schema.py b/src/catalog/df_schema.py new file mode 100644 index 0000000000..1296db3e10 --- /dev/null +++ b/src/catalog/df_schema.py @@ -0,0 +1,49 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import List +from src.catalog.sql_config import sql_conn +from sqlalchemy import Column, String, Integer, Boolean +from src.catalog.df_column import DataframeColumn +from petastorm.unischema import Unischema + + +class Schema(sql_conn.base): + + _name = None + _column_list = [] + _petastorm_schema = None + + def __init__(self, name: str, column_list: List[DataframeColumn]): + + self._name = name + self._column_list = column_list + petastorm_column_list = [] + for _column in self._column_list: + petastorm_column = DataframeColumn.get_petastorm_column(_column) + petastorm_column_list.append(petastorm_column) + + self._petastorm_schema = Unischema(self._schema_name, + petastorm_column_list) + + def __str__(self): + schema_str = "SCHEMA:: (" + self._schema_name + ")\n" + for column in self._column_list: + schema_str += str(column) + + return schema_str + + def get_petastorm_schema(self): + return self._petastorm_schema diff --git a/src/catalog/rough.ipynb b/src/catalog/rough.ipynb new file mode 100644 index 0000000000..a7e1ca3242 --- /dev/null +++ b/src/catalog/rough.ipynb @@ -0,0 +1,181 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "pycharm": { + "is_executing": false + } + }, + "outputs": [], + "source": [ + "from sqlalchemy import create_engine\n", + "from sqlalchemy.ext.declarative import declarative_base\n", + "from sqlalchemy.orm import sessionmaker\n", + "from sqlalchemy import Column, String, Integer, Boolean" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "engine = create_engine('mysql+pymysql://root:root@localhost/eva_catalog')\n", + "Session = sessionmaker(bind=engine)\n", + "Base = declarative_base()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "class Dummy(Base):\n", + " __tablename__ = 'df_column'\n", + "\n", + " _id = Column('id', Integer, primary_key=True)\n", + " _name = Column('name', String(100))\n", + " _array_dim = Column('array_dim', String(100), default='[]')\n", + " _schema_id = Column('schema_id', Integer)\n", + " \n", + " def __init__(self, name, col_list, array_dim):\n", + " self._name = name\n", + " self.col_list = col_list\n", + " self._array_dim = array_dim" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "Base.metadata.create_all(engine)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "o1 = Dummy('sanjana', [], None)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "session = Session()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "session.add(o1)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "session.commit()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "o2 = Dummy('sanjana2', [1], '[1]')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "session.add(o2)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "session.commit()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "records = session.query(Dummy).all()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2 None\n", + "3 None\n", + "4 None\n", + "5 None\n" + ] + } + ], + "source": [ + "for record in records:\n", + " print(record._id, record._schema_id)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/src/catalog/schema.py b/src/catalog/schema.py deleted file mode 100644 index 773623e688..0000000000 --- a/src/catalog/schema.py +++ /dev/null @@ -1,149 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from enum import Enum -from typing import List - -import numpy as np - -from src.utils.logging_manager import LoggingManager -from src.utils.logging_manager import LoggingLevel - -from pyspark.sql.types import IntegerType, FloatType, StringType - -from petastorm.codecs import ScalarCodec -from petastorm.codecs import NdarrayCodec -from petastorm.unischema import Unischema, UnischemaField - - -class ColumnType(Enum): - INTEGER = 1 - FLOAT = 2 - STRING = 3 - NDARRAY = 4 - - -class Column(object): - - _name = None - _type = 0 - _is_nullable = False - _array_dimensions = [] - - def __init__(self, name: str, - type: ColumnType, - is_nullable: bool = False, - array_dimensions: List[int] = []): - self._name = name - self._type = type - self._is_nullable = is_nullable - self._array_dimensions = array_dimensions - - def get_name(self): - return self._name - - def get_type(self): - return self._type - - def is_nullable(self): - return self._is_nullable - - def get_array_dimensions(self): - return self._array_dimensions - - def __str__(self): - column_str = "\tColumn: (%s, %s, %s, " % (self._name, - self._type.name, - self._is_nullable) - - column_str += "[" - column_str += ', '.join(['%d'] * len(self._array_dimensions))\ - % tuple(self._array_dimensions) - column_str += "] " - column_str += ")\n" - - return column_str - - -def get_petastorm_column(column): - - column_type = column.get_type() - column_name = column.get_name() - column_is_nullable = column.is_nullable() - column_array_dimensions = column.get_array_dimensions() - - # Reference: - # https://github.com/uber/petastorm/blob/master/petastorm/ - # tests/test_common.py - - if column_type == ColumnType.INTEGER: - petastorm_column = UnischemaField(column_name, - np.int32, - (), - ScalarCodec(IntegerType()), - column_is_nullable) - elif column_type == ColumnType.FLOAT: - petastorm_column = UnischemaField(column_name, - np.float64, - (), - ScalarCodec(FloatType()), - column_is_nullable) - elif column_type == ColumnType.STRING: - petastorm_column = UnischemaField(column_name, - np.string_, - (), - ScalarCodec(StringType()), - column_is_nullable) - elif column_type == ColumnType.NDARRAY: - petastorm_column = UnischemaField(column_name, - np.uint8, - column_array_dimensions, - NdarrayCodec(), - column_is_nullable) - else: - LoggingManager().log("Invalid column type: " + str(column_type), - LoggingLevel.ERROR) - - return petastorm_column - - -class Schema(object): - - _schema_name = None - _column_list = [] - _petastorm_schema = None - - def __init__(self, schema_name: str, column_list: List[Column]): - - self._schema_name = schema_name - self._column_list = column_list - - petastorm_column_list = [] - for _column in self._column_list: - petastorm_column = get_petastorm_column(_column) - petastorm_column_list.append(petastorm_column) - - self._petastorm_schema = Unischema(self._schema_name, - petastorm_column_list) - - def __str__(self): - schema_str = "SCHEMA:: (" + self._schema_name + ")\n" - for column in self._column_list: - schema_str += str(column) - - return schema_str - - def get_petastorm_schema(self): - return self._petastorm_schema diff --git a/src/catalog/sql_config.py b/src/catalog/sql_config.py new file mode 100644 index 0000000000..76bddf68e5 --- /dev/null +++ b/src/catalog/sql_config.py @@ -0,0 +1,26 @@ +from sqlalchemy import create_engine +from sqlalchemy.ext.declarative import declarative_base +from sqlalchemy.orm import sessionmaker + + +class SQLConfig(object): + base = declarative_base() + + def __new__(cls): + if cls._instance is None: + cls._instance = super(SQLConfig, cls).__new__(cls) + return cls._instance + + def __init__(self): + self.engine = create_engine('mysql+pymysql://root:root@localhost/eva_catalog') + self.session_factory = sessionmaker(bind=self.engine) + self.session = self.session_factory() + self.base.metadata.create_all(self.engine) + + def get_session(self): + if self.session is None: + self.session = self.session_factory() + return self.session + + +sql_conn = SQLConfig() From bdc4164455356dc3938084325831ef90b87a8323 Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 01:36:55 -0500 Subject: [PATCH 49/82] planner nodes updated --- src/query_planner/abstract_plan.py | 14 ++++++++++---- src/query_planner/abstract_scan_plan.py | 24 +++++++++++++++++++++--- src/query_planner/seq_scan_plan.py | 24 ++++++++++-------------- 3 files changed, 41 insertions(+), 21 deletions(-) diff --git a/src/query_planner/abstract_plan.py b/src/query_planner/abstract_plan.py index 546d981606..c6bad46656 100644 --- a/src/query_planner/abstract_plan.py +++ b/src/query_planner/abstract_plan.py @@ -1,7 +1,7 @@ from abc import ABC from src.query_planner.types import PlanNodeType - +from typing import list class AbstractPlan(ABC): @@ -29,7 +29,7 @@ def parent(self): @parent.setter def parent(self, node: 'AbstractPlan'): - """returns parent of current node + """sets parent of current node Arguments: node {AbstractPlan} -- parent node @@ -39,8 +39,8 @@ def parent(self, node: 'AbstractPlan'): self._parent = node @property - def children(self): - """returns children list pf current node + def children(self) -> List[AbstractPlan]: + """returns children list of current node Returns: List[AbstractPlan] -- children list @@ -56,3 +56,9 @@ def node_type(self) -> PlanNodeType: PlanNodeType: The node type corresponding to the plan """ return self._node_type + + def __str__(self, level=0): + out_string = "\t" * level + '' + "\n" + for child in self.children: + out_string += child.__str__(level + 1) + return out_string \ No newline at end of file diff --git a/src/query_planner/abstract_scan_plan.py b/src/query_planner/abstract_scan_plan.py index 2895f0bbb5..9d68f931e4 100644 --- a/src/query_planner/abstract_scan_plan.py +++ b/src/query_planner/abstract_scan_plan.py @@ -6,19 +6,37 @@ from src.query_planner.abstract_plan import AbstractPlan from src.query_planner.types import PlanNodeType +from src.query_parser.table_ref import TableRef +from typing import List class AbstractScan(AbstractPlan): """Abstract class for all the scan based planners Arguments: - predicate (AbstractExpression): An expression used for filtering + column_ids: List[str] + list of column names string in the plan + video: TableRef + video reference for the plan + predicate: AbstractExpression + An expression used for filtering """ - def __init__(self, node_type: PlanNodeType, predicate: AbstractExpression): - super(AbstractScan, self).__init__(node_type) + def __init__(self, node_type: PlanNodeType, column_ids: List[AbstractExpression], video: TableRef, predicate: AbstractExpression): + super(AbstractPlan, self).__init__(node_type) + self._column_ids = column_ids + self._video = video self._predicate = predicate @property def predicate(self) -> AbstractExpression: return self._predicate + + @property + def column_ids(self) -> List[AbstractExpression]: + return self._column_ids + + @property + def video(self) -> TableRef: + return self._video + \ No newline at end of file diff --git a/src/query_planner/seq_scan_plan.py b/src/query_planner/seq_scan_plan.py index 9650bfec28..3576b70050 100644 --- a/src/query_planner/seq_scan_plan.py +++ b/src/query_planner/seq_scan_plan.py @@ -11,20 +11,16 @@ class SeqScanPlan(AbstractScan): operations. Arguments: - predicate (AbstractExpression): A predicate expression used for - filtering frames - - column_ids List[int]: List of columns which need to be selected - (Note: This attribute might be removed in future) + column_ids: List[str] + list of column names string in the plan + video: TableRef + video reference for the plan + predicate: AbstractExpression + An expression used for filtering """ - def __init__(self, predicate: AbstractExpression, - column_ids: List[int] = None): - if column_ids is None: - column_ids = [] - super().__init__(PlanNodeType.SEQUENTIAL_SCAN_TYPE, predicate) - self._column_ids = column_ids + def __init__(self, column_ids: List[str], video: TableRef, predicate: AbstractExpression, + ): + super().__init__(PlanNodeType.SEQUENTIAL_SCAN_TYPE, column_ids, video, predicate) - @property - def column_ids(self) -> List: - return self._column_ids + \ No newline at end of file From 7f1ea1387076fa8997f246f17ff450adc9cc19a3 Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 01:37:59 -0500 Subject: [PATCH 50/82] Plan generator and convertor --- src/query_optimizer/plan_generator.py | 0 .../statement_to_plan_convertor.py | 39 +++++++++++++++++++ 2 files changed, 39 insertions(+) create mode 100644 src/query_optimizer/plan_generator.py create mode 100644 src/query_optimizer/statement_to_plan_convertor.py diff --git a/src/query_optimizer/plan_generator.py b/src/query_optimizer/plan_generator.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/src/query_optimizer/statement_to_plan_convertor.py b/src/query_optimizer/statement_to_plan_convertor.py new file mode 100644 index 0000000000..e90ff8b36e --- /dev/null +++ b/src/query_optimizer/statement_to_plan_convertor.py @@ -0,0 +1,39 @@ +from src.query_parser.eva_statement import EvaStatement +from src.query_parser.select_statement import SelectStatement +from src.query_planner.abstract_scan_plan import AbstractScan + + +class StatementToPlanConvertor(): + def __init__(self): + self._plan = None + + def visit(self, statement: EvaStatement): + """Based on the instance of the statement the corresponding visit is called. The logic is heidden form client. + + Arguments: + statement {EvaStatement} -- [Input statement] + """ + if isinstance(statement, SelectStatement): + visit_select(statement) + + def visit_select(self, statement: EvaStatement): + """convertor for select statement + + Arguments: + statement {EvaStatement} -- [input select statement] + """ + video = statement.from_table + #data table binding goes here + #use the catalog + select_columns = statement.target_list + #binding for columns has to be done here + predicate = statement.where_clause + + logical_plan = AbstractScan(select_columns, video, predicate) + self._plan = logical_plan + + @property + def plan(self): + return self._plan + + \ No newline at end of file From f39fa0a9963bf7a9832cc21ae15d83917fb3453d Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 14:32:50 -0500 Subject: [PATCH 51/82] File deleted. renaming to operators --- .../statement_to_plan_convertor.py | 39 ------------------- 1 file changed, 39 deletions(-) delete mode 100644 src/query_optimizer/statement_to_plan_convertor.py diff --git a/src/query_optimizer/statement_to_plan_convertor.py b/src/query_optimizer/statement_to_plan_convertor.py deleted file mode 100644 index e90ff8b36e..0000000000 --- a/src/query_optimizer/statement_to_plan_convertor.py +++ /dev/null @@ -1,39 +0,0 @@ -from src.query_parser.eva_statement import EvaStatement -from src.query_parser.select_statement import SelectStatement -from src.query_planner.abstract_scan_plan import AbstractScan - - -class StatementToPlanConvertor(): - def __init__(self): - self._plan = None - - def visit(self, statement: EvaStatement): - """Based on the instance of the statement the corresponding visit is called. The logic is heidden form client. - - Arguments: - statement {EvaStatement} -- [Input statement] - """ - if isinstance(statement, SelectStatement): - visit_select(statement) - - def visit_select(self, statement: EvaStatement): - """convertor for select statement - - Arguments: - statement {EvaStatement} -- [input select statement] - """ - video = statement.from_table - #data table binding goes here - #use the catalog - select_columns = statement.target_list - #binding for columns has to be done here - predicate = statement.where_clause - - logical_plan = AbstractScan(select_columns, video, predicate) - self._plan = logical_plan - - @property - def plan(self): - return self._plan - - \ No newline at end of file From f99afddcc3fac556556e8fdb7b1c2c3208fbd091 Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 14:33:25 -0500 Subject: [PATCH 52/82] parser to logical operator --- .../statement_to_opr_convertor.py | 66 +++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 src/query_optimizer/statement_to_opr_convertor.py diff --git a/src/query_optimizer/statement_to_opr_convertor.py b/src/query_optimizer/statement_to_opr_convertor.py new file mode 100644 index 0000000000..c7cd0834bf --- /dev/null +++ b/src/query_optimizer/statement_to_opr_convertor.py @@ -0,0 +1,66 @@ +from src.query_parser.eva_statement import EvaStatement +from src.query_parser.select_statement import SelectStatement +from src.query_planner.abstract_scan_plan import AbstractScan + + +class StatementToPlanConvertor(): + def __init__(self): + self._plan = None + + def visit(self, statement: EvaStatement): + """Based on the instance of the statement the corresponding visit is called. The logic is hidden from client. + + Arguments: + statement {EvaStatement} -- [Input statement] + """ + if isinstance(statement, SelectStatement): + visit_select(statement) + + def visit_select(self, statement: EvaStatement): + """convertor for select statement + + Arguments: + statement {EvaStatement} -- [input select statement] + """ + + #Create a logical get node + video = statement.from_table + if video is not None: + visit_table_ref(video) + + #Filter Operator + predicate = statement.where_clause + if predicate is not None: + #ToDo Binding the expression + filter_opr = LogicalFilter(predicate) + filter_opr.append_child(self._plan) + self._plan = filter_opr + + #Projection operator + select_columns = statement.target_list + #ToDO + # add support for SELECT STAR + if select_columns is not None: + #ToDo Bind the columns using catalog + projection_opr = LogicalProject(select_columns) + projection_opr.append_child(self._plan) + self._plan = projection_opr + + + def visit_table_ref(self, video: TableRef): + """Bind table ref object and convert to Logical get operator + + Arguments: + video {TableRef} -- [Input table ref object created by the parser] + """ + video_data = None + #Call catalog with Table ref details to get hold of the storage DataFrame + #video_data = catalog.get_table_catalog_entry(video.info) + + get_opr = LogicalGet(video_data) + self._plan = get_opr + @property + def plan(self): + return self._plan + + \ No newline at end of file From e8a9aedaae3880ccf596e191277a29acb46b9b7a Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 15:27:54 -0500 Subject: [PATCH 53/82] Formatting query_planner --- src/{query_planner => planner}/__init__.py | 0 .../abstract_plan.py | 12 +++++++----- .../abstract_scan_plan.py | 19 +++++++++++-------- src/{query_planner => planner}/pp_plan.py | 4 ++-- .../seq_scan_plan.py | 16 ++++++++-------- .../storage_plan.py | 4 ++-- src/{query_planner => planner}/types.py | 0 7 files changed, 30 insertions(+), 25 deletions(-) rename src/{query_planner => planner}/__init__.py (100%) rename src/{query_planner => planner}/abstract_plan.py (94%) rename src/{query_planner => planner}/abstract_scan_plan.py (79%) rename src/{query_planner => planner}/pp_plan.py (90%) rename src/{query_planner => planner}/seq_scan_plan.py (76%) rename src/{query_planner => planner}/storage_plan.py (93%) rename src/{query_planner => planner}/types.py (100%) diff --git a/src/query_planner/__init__.py b/src/planner/__init__.py similarity index 100% rename from src/query_planner/__init__.py rename to src/planner/__init__.py diff --git a/src/query_planner/abstract_plan.py b/src/planner/abstract_plan.py similarity index 94% rename from src/query_planner/abstract_plan.py rename to src/planner/abstract_plan.py index a173ce3d28..6b9f0eb8bf 100644 --- a/src/query_planner/abstract_plan.py +++ b/src/planner/abstract_plan.py @@ -12,10 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + + from abc import ABC +from src.planner.types import PlanNodeType +from typing import List -from src.query_planner.types import PlanNodeType -from typing import list class AbstractPlan(ABC): @@ -44,7 +46,7 @@ def parent(self): @parent.setter def parent(self, node: 'AbstractPlan'): """sets parent of current node - + Arguments: node {AbstractPlan} -- parent node """ @@ -55,7 +57,7 @@ def parent(self, node: 'AbstractPlan'): @property def children(self) -> List[AbstractPlan]: """returns children list of current node - + Returns: List[AbstractPlan] -- children list """ @@ -75,4 +77,4 @@ def __str__(self, level=0): out_string = "\t" * level + '' + "\n" for child in self.children: out_string += child.__str__(level + 1) - return out_string \ No newline at end of file + return out_string diff --git a/src/query_planner/abstract_scan_plan.py b/src/planner/abstract_scan_plan.py similarity index 79% rename from src/query_planner/abstract_scan_plan.py rename to src/planner/abstract_scan_plan.py index c6181d5408..3fd0e478b8 100644 --- a/src/query_planner/abstract_scan_plan.py +++ b/src/planner/abstract_scan_plan.py @@ -12,15 +12,17 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + + """Abstract class for all the scan planners https://www.postgresql.org/docs/9.1/using-explain.html https://www.postgresql.org/docs/9.5/runtime-config-query.html """ from src.expression.abstract_expression import AbstractExpression -from src.query_planner.abstract_plan import AbstractPlan +from src.planner.abstract_plan import AbstractPlan -from src.query_planner.types import PlanNodeType -from src.query_parser.table_ref import TableRef +from src.planner.types import PlanNodeType +from src.parser.table_ref import TableRef from typing import List @@ -28,7 +30,7 @@ class AbstractScan(AbstractPlan): """Abstract class for all the scan based planners Arguments: - column_ids: List[str] + column_ids: List[str] list of column names string in the plan video: TableRef video reference for the plan @@ -36,8 +38,10 @@ class AbstractScan(AbstractPlan): An expression used for filtering """ - def __init__(self, node_type: PlanNodeType, column_ids: List[AbstractExpression], video: TableRef, predicate: AbstractExpression): - super(AbstractPlan, self).__init__(node_type) + def __init__(self, node_type: PlanNodeType, + column_ids: List[AbstractExpression], video: TableRef, + predicate: AbstractExpression): + super().__init__(node_type) self._column_ids = column_ids self._video = video self._predicate = predicate @@ -49,8 +53,7 @@ def predicate(self) -> AbstractExpression: @property def column_ids(self) -> List[AbstractExpression]: return self._column_ids - + @property def video(self) -> TableRef: return self._video - \ No newline at end of file diff --git a/src/query_planner/pp_plan.py b/src/planner/pp_plan.py similarity index 90% rename from src/query_planner/pp_plan.py rename to src/planner/pp_plan.py index f0d8c0831f..ec205d2a3c 100644 --- a/src/query_planner/pp_plan.py +++ b/src/planner/pp_plan.py @@ -13,8 +13,8 @@ # See the License for the specific language governing permissions and # limitations under the License. from src.expression.abstract_expression import AbstractExpression -from src.query_planner.abstract_scan_plan import AbstractScan -from src.query_planner.types import PlanNodeType +from src.planner.abstract_scan_plan import AbstractScan +from src.planner.types import PlanNodeType class PPScanPlan(AbstractScan): diff --git a/src/query_planner/seq_scan_plan.py b/src/planner/seq_scan_plan.py similarity index 76% rename from src/query_planner/seq_scan_plan.py rename to src/planner/seq_scan_plan.py index 459a7866d3..f4456cd5a5 100644 --- a/src/query_planner/seq_scan_plan.py +++ b/src/planner/seq_scan_plan.py @@ -15,8 +15,9 @@ from typing import List from src.expression.abstract_expression import AbstractExpression -from src.query_planner.abstract_scan_plan import AbstractScan -from src.query_planner.types import PlanNodeType +from src.planner.abstract_scan_plan import AbstractScan +from src.planner.types import PlanNodeType +from src.parser.table_ref import TableRef class SeqScanPlan(AbstractScan): @@ -25,7 +26,7 @@ class SeqScanPlan(AbstractScan): operations. Arguments: - column_ids: List[str] + column_ids: List[str] list of column names string in the plan video: TableRef video reference for the plan @@ -33,8 +34,7 @@ class SeqScanPlan(AbstractScan): An expression used for filtering """ - def __init__(self, column_ids: List[str], video: TableRef, predicate: AbstractExpression, - ): - super().__init__(PlanNodeType.SEQUENTIAL_SCAN_TYPE, column_ids, video, predicate) - - \ No newline at end of file + def __init__(self, column_ids: List[str], video: TableRef, + predicate: AbstractExpression): + super().__init__(PlanNodeType.SEQUENTIAL_SCAN_TYPE, column_ids, video, + predicate) diff --git a/src/query_planner/storage_plan.py b/src/planner/storage_plan.py similarity index 93% rename from src/query_planner/storage_plan.py rename to src/planner/storage_plan.py index 409fce06d9..0eb77f7a03 100644 --- a/src/query_planner/storage_plan.py +++ b/src/planner/storage_plan.py @@ -13,8 +13,8 @@ # See the License for the specific language governing permissions and # limitations under the License. from src.models.catalog.video_info import VideoMetaInfo -from src.query_planner.abstract_plan import AbstractPlan -from src.query_planner.types import PlanNodeType +from src.planner.abstract_plan import AbstractPlan +from src.planner.types import PlanNodeType class StoragePlan(AbstractPlan): diff --git a/src/query_planner/types.py b/src/planner/types.py similarity index 100% rename from src/query_planner/types.py rename to src/planner/types.py From 193b4745a0f1c46791a35956da26dd308c09515e Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 21:05:29 -0500 Subject: [PATCH 54/82] formatting changes --- src/query_executor/abstract_executor.py | 2 +- src/query_executor/abstract_storage_executor.py | 2 +- src/query_executor/disk_based_storage_executor.py | 2 +- src/query_executor/plan_executor.py | 4 ++-- src/query_executor/pp_executor.py | 2 +- src/query_executor/seq_scan_executor.py | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/query_executor/abstract_executor.py b/src/query_executor/abstract_executor.py index 559e29f1a9..0f02350d27 100644 --- a/src/query_executor/abstract_executor.py +++ b/src/query_executor/abstract_executor.py @@ -16,7 +16,7 @@ from typing import List, Iterator from src.models.storage.batch import FrameBatch -from src.query_planner.abstract_plan import AbstractPlan +from src.planner.abstract_plan import AbstractPlan class AbstractExecutor(ABC): diff --git a/src/query_executor/abstract_storage_executor.py b/src/query_executor/abstract_storage_executor.py index 8a650a144d..2019fc9646 100644 --- a/src/query_executor/abstract_storage_executor.py +++ b/src/query_executor/abstract_storage_executor.py @@ -15,7 +15,7 @@ from abc import ABC from src.query_executor.abstract_executor import AbstractExecutor -from src.query_planner.storage_plan import StoragePlan +from src.planner.storage_plan import StoragePlan class AbstractStorageExecutor(AbstractExecutor, ABC): diff --git a/src/query_executor/disk_based_storage_executor.py b/src/query_executor/disk_based_storage_executor.py index 60924a8dcd..64b6bffa65 100644 --- a/src/query_executor/disk_based_storage_executor.py +++ b/src/query_executor/disk_based_storage_executor.py @@ -18,7 +18,7 @@ from src.models.storage.batch import FrameBatch from src.query_executor.abstract_storage_executor import \ AbstractStorageExecutor -from src.query_planner.storage_plan import StoragePlan +from src.planner.storage_plan import StoragePlan class DiskStorageExecutor(AbstractStorageExecutor): diff --git a/src/query_executor/plan_executor.py b/src/query_executor/plan_executor.py index 2a74c7f203..8d66b93251 100644 --- a/src/query_executor/plan_executor.py +++ b/src/query_executor/plan_executor.py @@ -14,8 +14,8 @@ # limitations under the License. from src.query_executor.abstract_executor import AbstractExecutor from src.query_executor.seq_scan_executor import SequentialScanExecutor -from src.query_planner.abstract_plan import AbstractPlan -from src.query_planner.types import PlanNodeType +from src.planner.abstract_plan import AbstractPlan +from src.planner.types import PlanNodeType from src.query_executor.disk_based_storage_executor import DiskStorageExecutor from src.query_executor.pp_executor import PPExecutor diff --git a/src/query_executor/pp_executor.py b/src/query_executor/pp_executor.py index 3748ad888f..ae6bd2fa9a 100644 --- a/src/query_executor/pp_executor.py +++ b/src/query_executor/pp_executor.py @@ -16,7 +16,7 @@ from src.models.storage.batch import FrameBatch from src.query_executor.abstract_executor import AbstractExecutor -from src.query_planner.pp_plan import PPScanPlan +from src.planner.pp_plan import PPScanPlan class PPExecutor(AbstractExecutor): diff --git a/src/query_executor/seq_scan_executor.py b/src/query_executor/seq_scan_executor.py index 9276868f39..cefe0fa87c 100644 --- a/src/query_executor/seq_scan_executor.py +++ b/src/query_executor/seq_scan_executor.py @@ -16,7 +16,7 @@ from src.models.storage.batch import FrameBatch from src.query_executor.abstract_executor import AbstractExecutor -from src.query_planner.seq_scan_plan import SeqScanPlan +from src.planner.seq_scan_plan import SeqScanPlan class SequentialScanExecutor(AbstractExecutor): From 5f8b98505b0953b20ffa951a66d81c11caa271a9 Mon Sep 17 00:00:00 2001 From: GTK Date: Sun, 26 Jan 2020 21:06:10 -0500 Subject: [PATCH 55/82] Formatting changes --- test/query_executor/test_disk_storage_executor.py | 2 +- test/query_executor/test_plan_executor.py | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/test/query_executor/test_disk_storage_executor.py b/test/query_executor/test_disk_storage_executor.py index 87f9a9ce70..670c6eb140 100644 --- a/test/query_executor/test_disk_storage_executor.py +++ b/test/query_executor/test_disk_storage_executor.py @@ -18,7 +18,7 @@ from src.models.catalog.properties import VideoFormat from src.models.catalog.video_info import VideoMetaInfo from src.query_executor.disk_based_storage_executor import DiskStorageExecutor -from src.query_planner.storage_plan import StoragePlan +from src.planner.storage_plan import StoragePlan class DiskStorageExecutorTest(unittest.TestCase): diff --git a/test/query_executor/test_plan_executor.py b/test/query_executor/test_plan_executor.py index 862eab1916..09be31a0c8 100644 --- a/test/query_executor/test_plan_executor.py +++ b/test/query_executor/test_plan_executor.py @@ -19,8 +19,8 @@ from src.models.catalog.video_info import VideoMetaInfo from src.models.storage.batch import FrameBatch from src.query_executor.plan_executor import PlanExecutor -from src.query_planner.seq_scan_plan import SeqScanPlan -from src.query_planner.storage_plan import StoragePlan +from src.planner.seq_scan_plan import SeqScanPlan +from src.planner.storage_plan import StoragePlan class PlanExecutorTest(unittest.TestCase): From 603429b7ca7b4eb6c48e0069143b406af9040acf Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 23 Jan 2020 23:10:02 -0500 Subject: [PATCH 56/82] Catalog - Moved classes DataFrameMetadata and DataFrameColumn to database models - Added a database configuration file and a Base Model class - Added methods for retrieving bindings and metadata object --- environment.yml | 3 + src/catalog/catalog_dataframes.py | 45 --------- src/catalog/catalog_manager.py | 105 ++++++++++++--------- src/catalog/column_type.py | 22 +++++ src/catalog/database.py | 111 ++++++++++++++++++++++ src/catalog/df_schema.py | 42 +++++++++ src/catalog/models/__init__.py | 14 +++ src/catalog/models/df_column.py | 89 ++++++++++++++++++ src/catalog/models/df_metadata.py | 70 ++++++++++++++ src/catalog/schema.py | 149 ------------------------------ src/catalog/utils.py | 80 ++++++++++++++++ src/configuration/dictionary.py | 3 +- test/catalog/test_schema.py | 18 ++-- 13 files changed, 501 insertions(+), 250 deletions(-) create mode 100644 src/catalog/column_type.py create mode 100644 src/catalog/database.py create mode 100644 src/catalog/df_schema.py create mode 100644 src/catalog/models/__init__.py create mode 100644 src/catalog/models/df_column.py create mode 100644 src/catalog/models/df_metadata.py delete mode 100644 src/catalog/schema.py create mode 100644 src/catalog/utils.py diff --git a/environment.yml b/environment.yml index 12de9f6dd8..40b389546c 100644 --- a/environment.yml +++ b/environment.yml @@ -20,6 +20,9 @@ dependencies: - pytorch - tensorboard - pillow=6.1 + - sqlalchemy + - pymysql + - sqlalchemy-utils - pip: - antlr4-python3-runtime==4.8 - petastorm diff --git a/src/catalog/catalog_dataframes.py b/src/catalog/catalog_dataframes.py index 9a96dd1a26..e9978151f4 100644 --- a/src/catalog/catalog_dataframes.py +++ b/src/catalog/catalog_dataframes.py @@ -12,48 +12,3 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -import os - -from src.configuration.dictionary import DATASET_DATAFRAME_NAME - -from src.storage.dataframe import create_dataframe -from src.storage.dataframe import DataFrameMetadata - -from src.catalog.schema import Column -from src.catalog.schema import ColumnType -from src.catalog.schema import Schema - - -def get_dataset_schema(): - column_1 = Column("dataset_id", ColumnType.INTEGER, False) - column_2 = Column("dataset_name", ColumnType.STRING, False) - - datset_df_schema = Schema("dataset_df_schema", - [column_1, column_2]) - return datset_df_schema - - -def load_catalog_dataframes(catalog_dir_url: str, - catalog_dictionary): - - dataset_file_url = os.path.join(catalog_dir_url, DATASET_DATAFRAME_NAME) - dataset_df_schema = get_dataset_schema() - dataset_catalog_entry = DataFrameMetadata(dataset_file_url, - dataset_df_schema) - - catalog_dictionary.update({DATASET_DATAFRAME_NAME: dataset_catalog_entry}) - - -def create_catalog_dataframes(catalog_dir_url: str, - catalog_dictionary): - - dataset_df_schema = get_dataset_schema() - dataset_file_url = os.path.join(catalog_dir_url, DATASET_DATAFRAME_NAME) - dataset_catalog_entry = DataFrameMetadata(dataset_file_url, - dataset_df_schema) - - create_dataframe(dataset_catalog_entry) - - # dataframe name : (schema, petastorm_schema, pyspark_schema) - catalog_dictionary.update({DATASET_DATAFRAME_NAME: dataset_catalog_entry}) diff --git a/src/catalog/catalog_manager.py b/src/catalog/catalog_manager.py index 3df42de26a..aa89675422 100644 --- a/src/catalog/catalog_manager.py +++ b/src/catalog/catalog_manager.py @@ -15,25 +15,17 @@ import os -from src.utils.logging_manager import LoggingManager -from src.utils.logging_manager import LoggingLevel - +from src.catalog.database import init_db +from src.catalog.df_schema import DataFrameSchema +from src.catalog.models.df_column import DataFrameColumn +from src.catalog.models.df_metadata import DataFrameMetadata from src.configuration.configuration_manager import ConfigurationManager from src.configuration.dictionary import CATALOG_DIR - -from urllib.parse import urlparse - -from src.catalog.catalog_dataframes import load_catalog_dataframes -from src.catalog.catalog_dataframes import create_catalog_dataframes - -from src.configuration.dictionary import DATASET_DATAFRAME_NAME - -from src.storage.dataframe import load_dataframe, get_next_row_id -from src.storage.dataframe import append_rows +from src.utils.logging_manager import LoggingLevel +from src.utils.logging_manager import LoggingManager class CatalogManager(object): - _instance = None _catalog = None _catalog_dictionary = {} @@ -52,35 +44,56 @@ def bootstrap_catalog(self): output_url = os.path.join(eva_dir, CATALOG_DIR) LoggingManager().log("Bootstrapping catalog" + str(output_url), LoggingLevel.INFO) - - # Construct output location - catalog_dir_url = os.path.join(eva_dir, "catalog") - - # Get filesystem path - catalog_os_path = urlparse(catalog_dir_url).path - - # Check if catalog exists - if os.path.exists(catalog_os_path): - # Load catalog if it exists - load_catalog_dataframes(catalog_dir_url, self._catalog_dictionary) - else: - # Create catalog if it does not exist - create_catalog_dataframes( - catalog_dir_url, self._catalog_dictionary) - - def create_dataset(self, dataset_name: str): - - dataset_catalog_entry = \ - self._catalog_dictionary.get(DATASET_DATAFRAME_NAME) - - dataset_df = \ - load_dataframe(dataset_catalog_entry.get_dataframe_file_url()) - - dataset_df.show(10) - - next_row_id = get_next_row_id(dataset_df, DATASET_DATAFRAME_NAME) - - row_1 = [next_row_id, dataset_name] - rows = [row_1] - - append_rows(dataset_catalog_entry, rows) + init_db() + # # Construct output location + # catalog_dir_url = os.path.join(eva_dir, "catalog") + # + # # Get filesystem path + # catalog_os_path = urlparse(catalog_dir_url).path + # + # # Check if catalog exists + # if os.path.exists(catalog_os_path): + # # Load catalog if it exists + # load_catalog_dataframes(catalog_dir_url, + # self._catalog_dictionary) + # else: + # # Create catalog if it does not exist + # create_catalog_dataframes( + # catalog_dir_url, self._catalog_dictionary) + + def get_bindings(self, database_name, table_name=None, column_name=None): + metadata_id = DataFrameMetadata.get_id_from_name(database_name) + table_id = None + column_id = None + if column_name is not None: + column_id = DataFrameColumn.get_id_from_metadata_id_and_name( + metadata_id, + column_name) + return metadata_id, table_id, column_id + + def get_metadata(self, metadata_id, col_id_list=[]): + metadata = DataFrameMetadata.get(metadata_id) + if len(col_id_list) > 0: + df_columns = DataFrameColumn.get_by_metadata_id_and_id_in( + col_id_list, + metadata_id) + metadata.set_schema( + DataFrameSchema(metadata.get_name(), df_columns)) + return metadata + + # def create_dataset(self, dataset_name: str): + # + # dataset_catalog_entry = \ + # self._catalog_dictionary.get(DATASET_DATAFRAME_NAME) + # + # dataset_df = \ + # load_dataframe(dataset_catalog_entry.get_dataframe_file_url()) + # + # dataset_df.show(10) + # + # next_row_id = get_next_row_id(dataset_df, DATASET_DATAFRAME_NAME) + # + # row_1 = [next_row_id, dataset_name] + # rows = [row_1] + # + # append_rows(dataset_catalog_entry, rows) diff --git a/src/catalog/column_type.py b/src/catalog/column_type.py new file mode 100644 index 0000000000..85ebe568d9 --- /dev/null +++ b/src/catalog/column_type.py @@ -0,0 +1,22 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from enum import Enum + + +class ColumnType(Enum): + INTEGER = 1 + FLOAT = 2 + STRING = 3 + NDARRAY = 4 diff --git a/src/catalog/database.py b/src/catalog/database.py new file mode 100644 index 0000000000..46db7fa056 --- /dev/null +++ b/src/catalog/database.py @@ -0,0 +1,111 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from sqlalchemy import create_engine +from sqlalchemy.orm import sessionmaker, scoped_session +from src.configuration.dictionary import SQLALCHEMY_DATABASE_URI +from sqlalchemy.ext.declarative import declarative_base, declared_attr +from sqlalchemy.exc import DatabaseError +from sqlalchemy_utils import database_exists, create_database, drop_database + +engine = create_engine(SQLALCHEMY_DATABASE_URI) +db_session = scoped_session(sessionmaker(autocommit=False, autoflush=False, + bind=engine)) + + +class CustomBase(object): + """This overrides the default + `_declarative_constructor` constructor. + It skips the attributes that are not present + for the model, thus if a dict is passed with some + unknown attributes for the model on creation, + it won't complain for `unkwnown field`s. + """ + + def __init__(self, **kwargs): + cls_ = type(self) + for k in kwargs: + if hasattr(cls_, k): + setattr(self, k, kwargs[k]) + else: + continue + + """ + Set default tablename + """ + @declared_attr + def __tablename__(cls): + return cls.__name__.lower() + + """ + Add and try to flush. + """ + + def save(self): + db_session.add(self) + self._flush() + return self + + """ + Update and try to flush. + """ + + def update(self, **kwargs): + for attr, value in kwargs.items(): + if hasattr(self, attr): + setattr(self, attr, value) + return self.save() + + """ + Delete and try to flush. + """ + + def delete(self): + db_session.delete(self) + self._flush() + + """ + Try to flush. If an error is raised, + the session is rollbacked. + """ + + def _flush(self): + try: + db_session.flush() + except DatabaseError: + db_session.rollback() + + +BaseModel = declarative_base(cls=CustomBase, constructor=None) +BaseModel.query = db_session.query_property() + + +def init_db(): + """ + Create database if doesn't exist and + create all tables. + """ + if not database_exists(engine.url): + create_database(engine.url) + BaseModel.metadata.create_all(bind=engine) + + +def drop_db(): + """ + Drop all of the record from tables and the tables + themselves. + Drop the database as well. + """ + BaseModel.metadata.drop_all(bind=engine) + drop_database(engine.url) diff --git a/src/catalog/df_schema.py b/src/catalog/df_schema.py new file mode 100644 index 0000000000..e12463764e --- /dev/null +++ b/src/catalog/df_schema.py @@ -0,0 +1,42 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import List + +from src.catalog.models.df_column import DataFrameColumn +from src.catalog.utils import Utils + + +class DataFrameSchema(object): + _name = None + _column_list = [] + _petastorm_schema = None + + def __init__(self, name: str, column_list: List[DataFrameColumn]): + + self._name = name + self._column_list = column_list + self._petastorm_schema = Utils.get_petastorm_schema(self._name, + self._column_list) + + def __str__(self): + schema_str = "SCHEMA:: (" + self._name + ")\n" + for column in self._column_list: + schema_str += str(column) + + return schema_str + + def get_petastorm_schema(self): + return self._petastorm_schema diff --git a/src/catalog/models/__init__.py b/src/catalog/models/__init__.py new file mode 100644 index 0000000000..e9978151f4 --- /dev/null +++ b/src/catalog/models/__init__.py @@ -0,0 +1,14 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/src/catalog/models/df_column.py b/src/catalog/models/df_column.py new file mode 100644 index 0000000000..a3eab51e24 --- /dev/null +++ b/src/catalog/models/df_column.py @@ -0,0 +1,89 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json +from enum import Enum +from typing import List + +from sqlalchemy import Column, String, Integer, Boolean + +from src.catalog.database import BaseModel +from src.catalog.column_type import ColumnType +from sqlalchemy.types import Enum + + +class DataFrameColumn(BaseModel): + __tablename__ = 'df_column' + + _id = Column('id', Integer, primary_key=True) + _name = Column('name', String(100)) + _type = Column('type', Enum(ColumnType), default=Enum) + _is_nullable = Column('is_nullable', Boolean, default=False) + _array_dimensions = Column('array_dimensions', String(100), default='[]') + _metadata_id = Column('dataframe_id', Integer) + + def __init__(self, + name: str, + type: ColumnType, + is_nullable: bool = False, + array_dimensions: List[int] = []): + self._name = name + self._type = type + self._is_nullable = is_nullable + self._array_dimensions = array_dimensions + + def get_name(self): + return self._name + + def get_type(self): + return self._type + + def is_nullable(self): + return self._is_nullable + + def get_array_dimensions(self): + return json.loads(self._array_dimensions) + + def set_array_dimensions(self, array_dimensions): + self._array_dimensions = str(array_dimensions) + + def __str__(self): + column_str = "\tColumn: (%s, %s, %s, " % (self._name, + self._type.name, + self._is_nullable) + + column_str += "[" + column_str += ', '.join(['%d'] * len(self._array_dimensions)) \ + % tuple(self._array_dimensions) + column_str += "] " + column_str += ")\n" + + return column_str + + @classmethod + def get_id_from_metadata_id_and_name(cls, metadata_id, name): + result = DataFrameColumn.query\ + .with_entities(DataFrameColumn._id)\ + .filter(DataFrameColumn._metadata_id == metadata_id, + DataFrameColumn._name == name)\ + .one() + return result + + @classmethod + def get_by_metadata_id_and_id_in(cls, id_list, metadata_id): + result = DataFrameColumn.query\ + .filter(DataFrameColumn._metadata_id == metadata_id, + DataFrameColumn._id.in_(id_list))\ + .all() + return result diff --git a/src/catalog/models/df_metadata.py b/src/catalog/models/df_metadata.py new file mode 100644 index 0000000000..cd50a82f27 --- /dev/null +++ b/src/catalog/models/df_metadata.py @@ -0,0 +1,70 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from sqlalchemy import Column, String, Integer + +from src.catalog.database import BaseModel + + +class DataFrameMetadata(BaseModel): + __tablename__ = 'df_metadata' + + _id = Column('id', Integer, primary_key=True) + _name = Column('name', String) + _file_url = Column('file_url', String) + + def __init__(self, dataframe_file_url, dataframe_schema): + self._file_url = dataframe_file_url + self._dataframe_schema = dataframe_schema + self._dataframe_petastorm_schema = \ + dataframe_schema.get_petastorm_schema() + self._dataframe_pyspark_schema = \ + self._dataframe_petastorm_schema.as_spark_schema() + + def set_schema(self, schema): + self._dataframe_schema = schema + self._dataframe_petastorm_schema = \ + schema.get_petastorm_schema() + self._dataframe_pyspark_schema = \ + self._dataframe_petastorm_schema.as_spark_schema() + + def get_id(self): + return self._id + + def get_dataframe_file_url(self): + return self._file_url + + def get_dataframe_schema(self): + return self._dataframe_schema + + def get_dataframe_petastorm_schema(self): + return self._dataframe_petastorm_schema + + def get_dataframe_pyspark_schema(self): + return self._dataframe_pyspark_schema + + @classmethod + def get_id_from_name(cls, name): + result = DataFrameMetadata.query \ + .with_entities(DataFrameMetadata._id) \ + .filter(DataFrameMetadata._name == name).one() + return result + + @classmethod + def get(cls, metadata_id): + result = DataFrameMetadata.query \ + .with_entities(DataFrameMetadata._id) \ + .filter(DataFrameMetadata._id == metadata_id) \ + .one() + return result diff --git a/src/catalog/schema.py b/src/catalog/schema.py deleted file mode 100644 index 773623e688..0000000000 --- a/src/catalog/schema.py +++ /dev/null @@ -1,149 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from enum import Enum -from typing import List - -import numpy as np - -from src.utils.logging_manager import LoggingManager -from src.utils.logging_manager import LoggingLevel - -from pyspark.sql.types import IntegerType, FloatType, StringType - -from petastorm.codecs import ScalarCodec -from petastorm.codecs import NdarrayCodec -from petastorm.unischema import Unischema, UnischemaField - - -class ColumnType(Enum): - INTEGER = 1 - FLOAT = 2 - STRING = 3 - NDARRAY = 4 - - -class Column(object): - - _name = None - _type = 0 - _is_nullable = False - _array_dimensions = [] - - def __init__(self, name: str, - type: ColumnType, - is_nullable: bool = False, - array_dimensions: List[int] = []): - self._name = name - self._type = type - self._is_nullable = is_nullable - self._array_dimensions = array_dimensions - - def get_name(self): - return self._name - - def get_type(self): - return self._type - - def is_nullable(self): - return self._is_nullable - - def get_array_dimensions(self): - return self._array_dimensions - - def __str__(self): - column_str = "\tColumn: (%s, %s, %s, " % (self._name, - self._type.name, - self._is_nullable) - - column_str += "[" - column_str += ', '.join(['%d'] * len(self._array_dimensions))\ - % tuple(self._array_dimensions) - column_str += "] " - column_str += ")\n" - - return column_str - - -def get_petastorm_column(column): - - column_type = column.get_type() - column_name = column.get_name() - column_is_nullable = column.is_nullable() - column_array_dimensions = column.get_array_dimensions() - - # Reference: - # https://github.com/uber/petastorm/blob/master/petastorm/ - # tests/test_common.py - - if column_type == ColumnType.INTEGER: - petastorm_column = UnischemaField(column_name, - np.int32, - (), - ScalarCodec(IntegerType()), - column_is_nullable) - elif column_type == ColumnType.FLOAT: - petastorm_column = UnischemaField(column_name, - np.float64, - (), - ScalarCodec(FloatType()), - column_is_nullable) - elif column_type == ColumnType.STRING: - petastorm_column = UnischemaField(column_name, - np.string_, - (), - ScalarCodec(StringType()), - column_is_nullable) - elif column_type == ColumnType.NDARRAY: - petastorm_column = UnischemaField(column_name, - np.uint8, - column_array_dimensions, - NdarrayCodec(), - column_is_nullable) - else: - LoggingManager().log("Invalid column type: " + str(column_type), - LoggingLevel.ERROR) - - return petastorm_column - - -class Schema(object): - - _schema_name = None - _column_list = [] - _petastorm_schema = None - - def __init__(self, schema_name: str, column_list: List[Column]): - - self._schema_name = schema_name - self._column_list = column_list - - petastorm_column_list = [] - for _column in self._column_list: - petastorm_column = get_petastorm_column(_column) - petastorm_column_list.append(petastorm_column) - - self._petastorm_schema = Unischema(self._schema_name, - petastorm_column_list) - - def __str__(self): - schema_str = "SCHEMA:: (" + self._schema_name + ")\n" - for column in self._column_list: - schema_str += str(column) - - return schema_str - - def get_petastorm_schema(self): - return self._petastorm_schema diff --git a/src/catalog/utils.py b/src/catalog/utils.py new file mode 100644 index 0000000000..34bf879430 --- /dev/null +++ b/src/catalog/utils.py @@ -0,0 +1,80 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +from petastorm.codecs import NdarrayCodec +from petastorm.codecs import ScalarCodec +from petastorm.unischema import Unischema +from petastorm.unischema import UnischemaField +from pyspark.sql.types import IntegerType, FloatType, StringType + +from src.catalog.column_type import ColumnType +from src.utils.logging_manager import LoggingLevel +from src.utils.logging_manager import LoggingManager + + +class Utils(object): + + @staticmethod + def get_petastorm_column(df_column): + + column_type = df_column.get_type() + column_name = df_column.get_name() + column_is_nullable = df_column.is_nullable() + column_array_dimensions = df_column.get_array_dimensions() + + # Reference: + # https://github.com/uber/petastorm/blob/master/petastorm/ + # tests/test_common.py + + petastorm_column = None + if column_type == ColumnType.INTEGER: + petastorm_column = UnischemaField(column_name, + np.int32, + (), + ScalarCodec(IntegerType()), + column_is_nullable) + elif column_type == ColumnType.FLOAT: + petastorm_column = UnischemaField(column_name, + np.float64, + (), + ScalarCodec(FloatType()), + column_is_nullable) + elif column_type == ColumnType.STRING: + petastorm_column = UnischemaField(column_name, + np.string_, + (), + ScalarCodec(StringType()), + column_is_nullable) + elif column_type == ColumnType.NDARRAY: + petastorm_column = UnischemaField(column_name, + np.uint8, + column_array_dimensions, + NdarrayCodec(), + column_is_nullable) + else: + LoggingManager().log("Invalid column type: " + str(column_type), + LoggingLevel.ERROR) + + return petastorm_column + + @staticmethod + def get_petastorm_schema(name, column_list): + petastorm_column_list = [] + for _column in column_list: + petastorm_column = Utils.get_petastorm_column(_column) + petastorm_column_list.append(petastorm_column) + + petastorm_schema = Unischema(name, petastorm_column_list) + return petastorm_schema diff --git a/src/configuration/dictionary.py b/src/configuration/dictionary.py index 0739fb3097..6a5751f3c9 100644 --- a/src/configuration/dictionary.py +++ b/src/configuration/dictionary.py @@ -14,4 +14,5 @@ # limitations under the License. CATALOG_DIR = "catalog" -DATASET_DATAFRAME_NAME = "dataset" \ No newline at end of file +DATASET_DATAFRAME_NAME = "dataset" +SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:root@localhost/eva_catalog' diff --git a/test/catalog/test_schema.py b/test/catalog/test_schema.py index fd362b03e1..2c42ca6008 100644 --- a/test/catalog/test_schema.py +++ b/test/catalog/test_schema.py @@ -14,9 +14,9 @@ # limitations under the License. import unittest -from src.catalog.schema import ColumnType -from src.catalog.schema import Column -from src.catalog.schema import Schema +from src.catalog.column_type import ColumnType +from src.catalog.df_schema import DataFrameSchema +from src.catalog.models.df_column import DataFrameColumn class SchemaTests(unittest.TestCase): @@ -25,14 +25,14 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_schema(self): - schema_name = "foo" - column_1 = Column("frame_id", ColumnType.INTEGER, False) - column_2 = Column("frame_data", ColumnType.NDARRAY, False, [28, 28]) - column_3 = Column("frame_label", ColumnType.INTEGER, False) + column_1 = DataFrameColumn("frame_id", ColumnType.INTEGER, False) + column_2 = DataFrameColumn("frame_data", ColumnType.NDARRAY, False, + [28, 28]) + column_3 = DataFrameColumn("frame_label", ColumnType.INTEGER, False) - schema = Schema(schema_name, - [column_1, column_2, column_3]) + schema = DataFrameSchema(schema_name, + [column_1, column_2, column_3]) self.assertEqual(schema._column_list[0].get_name(), "frame_id") From 15671b9c741638b981c7f169e9e2a937a39df902 Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 30 Jan 2020 02:09:07 -0500 Subject: [PATCH 57/82] Fixed test case using mock --- src/catalog/models/df_column.py | 2 +- src/configuration/dictionary.py | 2 +- test/catalog/test_catalog_manager.py | 18 ++++++++++++++---- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/src/catalog/models/df_column.py b/src/catalog/models/df_column.py index a3eab51e24..fd46591963 100644 --- a/src/catalog/models/df_column.py +++ b/src/catalog/models/df_column.py @@ -41,7 +41,7 @@ def __init__(self, self._name = name self._type = type self._is_nullable = is_nullable - self._array_dimensions = array_dimensions + self._array_dimensions = str(array_dimensions) def get_name(self): return self._name diff --git a/src/configuration/dictionary.py b/src/configuration/dictionary.py index 6a5751f3c9..6fa744baa2 100644 --- a/src/configuration/dictionary.py +++ b/src/configuration/dictionary.py @@ -15,4 +15,4 @@ CATALOG_DIR = "catalog" DATASET_DATAFRAME_NAME = "dataset" -SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:root@localhost/eva_catalog' +SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:fafa@localhost/eva_catalog' diff --git a/test/catalog/test_catalog_manager.py b/test/catalog/test_catalog_manager.py index 129075b7b2..d4d411199b 100644 --- a/test/catalog/test_catalog_manager.py +++ b/test/catalog/test_catalog_manager.py @@ -13,9 +13,11 @@ # See the License for the specific language governing permissions and # limitations under the License. import unittest +import mock import logging from src.catalog.catalog_manager import CatalogManager +from src.configuration.configuration_manager import ConfigurationManager from src.spark.session import Session @@ -29,21 +31,29 @@ class CatalogManagerTests(unittest.TestCase): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) + # @mock.patch.object(ConfigurationManager, + # 'get_value') def setUp(self): suppress_py4j_logging() + # mocked_cm.return_value = 'abc' def tearDown(self): self.session = Session() self.session.stop() - def test_catalog_manager_singleton_pattern(self): + + @mock.patch('src.catalog.catalog_manager.init_db') + @mock.patch('src.catalog.catalog_manager.ConfigurationManager') + def test_catalog_manager_singleton_pattern(self, mocked_cm, mocked_db): + mocked_cm.get_value('core', 'location').return_value = 'abc' + mocked_cm.get_value.assert_called_once_with('core', 'location') x = CatalogManager() y = CatalogManager() self.assertEqual(x, y) - x.create_dataset("foo") - x.create_dataset("bar") - x.create_dataset("baz") + # x.create_dataset("foo") + # x.create_dataset("bar") + # x.create_dataset("baz") if __name__ == '__main__': From d321493a0fdbc780176616821389c00ebfd0ea6d Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 30 Jan 2020 02:17:26 -0500 Subject: [PATCH 58/82] Added mock to environment.yml --- environment.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/environment.yml b/environment.yml index 40b389546c..bd29d020df 100644 --- a/environment.yml +++ b/environment.yml @@ -23,6 +23,7 @@ dependencies: - sqlalchemy - pymysql - sqlalchemy-utils + - mock - pip: - antlr4-python3-runtime==4.8 - petastorm From ce6de29ea28b3e8a794e813706faf46c1f83d62e Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 30 Jan 2020 12:00:54 -0500 Subject: [PATCH 59/82] Modified binding call --- src/catalog/catalog_manager.py | 38 ++++++++++++++++++++-------- src/catalog/models/df_column.py | 6 ++--- test/catalog/test_catalog_manager.py | 5 ++-- 3 files changed, 33 insertions(+), 16 deletions(-) diff --git a/src/catalog/catalog_manager.py b/src/catalog/catalog_manager.py index aa89675422..642dd96bd4 100644 --- a/src/catalog/catalog_manager.py +++ b/src/catalog/catalog_manager.py @@ -14,6 +14,7 @@ # limitations under the License. import os +from typing import List, Tuple from src.catalog.database import init_db from src.catalog.df_schema import DataFrameSchema @@ -61,19 +62,36 @@ def bootstrap_catalog(self): # create_catalog_dataframes( # catalog_dir_url, self._catalog_dictionary) - def get_bindings(self, database_name, table_name=None, column_name=None): - metadata_id = DataFrameMetadata.get_id_from_name(database_name) - table_id = None - column_id = None - if column_name is not None: - column_id = DataFrameColumn.get_id_from_metadata_id_and_name( + def get_table_bindings(self, database_name: str, table_name: str, + column_names: List[str]) -> Tuple[int, List[int]]: + """ + This method fetches bindings for strings + :param database_name: currently not in use + :param table_name: the table that is being referred to + :param column_names: the column names of the table for which + bindings are required + :return: returns metadat_id of table and a list of column ids + """ + metadata_id = DataFrameMetadata.get_id_from_name(table_name) + column_ids = [] + if column_names is not None: + column_ids = DataFrameColumn.get_id_from_metadata_id_and_name_in( metadata_id, - column_name) - return metadata_id, table_id, column_id + column_names) + return metadata_id, column_ids - def get_metadata(self, metadata_id, col_id_list=[]): + def get_metadata(self, metadata_id: int, + col_id_list: List[int] = None) -> DataFrameMetadata: + """ + This method returns the metadata object given a metadata_id, + when requested by the executor. It will further be used by storage + engine for retrieving the dataframe. + :param metadata_id: metadata id of the table + :param col_id_list: optional column ids of the table referred + :return: + """ metadata = DataFrameMetadata.get(metadata_id) - if len(col_id_list) > 0: + if col_id_list is not None: df_columns = DataFrameColumn.get_by_metadata_id_and_id_in( col_id_list, metadata_id) diff --git a/src/catalog/models/df_column.py b/src/catalog/models/df_column.py index fd46591963..cfcc707a54 100644 --- a/src/catalog/models/df_column.py +++ b/src/catalog/models/df_column.py @@ -72,12 +72,12 @@ def __str__(self): return column_str @classmethod - def get_id_from_metadata_id_and_name(cls, metadata_id, name): + def get_id_from_metadata_id_and_name_in(cls, metadata_id, column_names): result = DataFrameColumn.query\ .with_entities(DataFrameColumn._id)\ .filter(DataFrameColumn._metadata_id == metadata_id, - DataFrameColumn._name == name)\ - .one() + DataFrameColumn._name.in_(column_names))\ + .all() return result @classmethod diff --git a/test/catalog/test_catalog_manager.py b/test/catalog/test_catalog_manager.py index d4d411199b..6910ee8414 100644 --- a/test/catalog/test_catalog_manager.py +++ b/test/catalog/test_catalog_manager.py @@ -12,12 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import logging import unittest + import mock -import logging from src.catalog.catalog_manager import CatalogManager -from src.configuration.configuration_manager import ConfigurationManager from src.spark.session import Session @@ -41,7 +41,6 @@ def tearDown(self): self.session = Session() self.session.stop() - @mock.patch('src.catalog.catalog_manager.init_db') @mock.patch('src.catalog.catalog_manager.ConfigurationManager') def test_catalog_manager_singleton_pattern(self, mocked_cm, mocked_db): From 1b383b69406fa64fb68d82d54584bbf426d1344c Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 16:46:51 -0500 Subject: [PATCH 60/82] Added support for catalog bindings in the tuple expression --- src/expression/tuple_value_expression.py | 41 ++++++++++++++++-------- 1 file changed, 27 insertions(+), 14 deletions(-) diff --git a/src/expression/tuple_value_expression.py b/src/expression/tuple_value_expression.py index 177e24e653..1a16576d3a 100644 --- a/src/expression/tuple_value_expression.py +++ b/src/expression/tuple_value_expression.py @@ -17,24 +17,38 @@ class TupleValueExpression(AbstractExpression): - def __init__(self, col_idx: int = None, col_name: str = None): - # setting return type to be invalid not sure if that is correct - # no child so that is okay + def __init__(self, col_name: str = None, table_name: str = None): super().__init__(ExpressionType.TUPLE_VALUE, rtype=ExpressionReturnType.INVALID) self._col_name = col_name - # todo - self._table_name = None - self._col_idx = col_idx + self._table_name = table_name + self._table_metadata_id = None + self._col_metadata_id = None + + @property + def table_metadata_id(self) -> int: + return self._table_metadata_id - # def evaluate(AbstractTuple tuple1, AbstractTuple tuple2): + @property + def col_metadata_id(self) -> int: + return self._column_metadata_id - # don't know why are we getting 2 tuples - # comments added to abstract class, - # maybe we should move to *args - - # assuming tuple1 to be valid + @table_metadata_id.setter + def table_metadata_id(self, id: int): + self._table_metadata_id = id + @col_metadata_id.setter + def col_metadata_id(self, id: int): + self._column_metadata_id = id + + @property + def table_name(self) -> str: + return self._table_name + + @property + def col_name(self) -> str: + return self._col_name + # remove this once doen with tuple class def evaluate(self, *args): tuple1 = None @@ -44,5 +58,4 @@ def evaluate(self, *args): tuple1 = args[0] return tuple1[(self._col_idx)] - # ToDo - # implement other boilerplate functionality + \ No newline at end of file From 9807506ea745fb2f137baef9b47c29776ee4b598 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 17:00:38 -0500 Subject: [PATCH 61/82] 1. Parsed statement converted to Logical Operator. 2. Logical plan tree built using operators. 3. Util functionality for binding tables, columns with catalog. --- src/optimizer/operators.py | 75 +++++++++++++++++++ src/optimizer/optimizer_utils.py | 56 ++++++++++++++ src/optimizer/statement_to_opr_convertor.py | 82 +++++++++++++++++++++ 3 files changed, 213 insertions(+) create mode 100644 src/optimizer/operators.py create mode 100644 src/optimizer/optimizer_utils.py create mode 100644 src/optimizer/statement_to_opr_convertor.py diff --git a/src/optimizer/operators.py b/src/optimizer/operators.py new file mode 100644 index 0000000000..ffb03eb3d2 --- /dev/null +++ b/src/optimizer/operators.py @@ -0,0 +1,75 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from enum import IntEnum, unique +from typing import List +from src.parser.table_ref import TableRef + + +@unique +class OperatorType(IntEnum): + """ + Manages enums for all the operators supported + """ + LOGICALGET = 1, + LOGICALFILTER = 2, + LOGICALPROJECT = 3, + + +class Operator: + """Base class for logital plan of operators + + Arguments: + op_type: {OperatorType} -- {the opr type held by this node} + children: {List} -- {the list of operator children for this node} + """ + + def __init__(self, op_type: OperatorType, children: List): + self._type = op_type + self._children = children + + def append_child(self, child: Operator): + if self._children is None: + self._children = [] + + self._children.append(child) + + @property + def children(self): + return self._children + + @property + def type(self): + return self._type + + +class LogicalGet(Operator): + def __init__(self, video: TableRef, catalog_entry: 'type', + children: List = None): + super().__init__(OperatorType.LOGICALGET, children) + self._video = video + self._catalog_entry = catalog_entry + + +class LogicalFilter(Operator): + def __init__(self, predicate: 'AbstractExpression', children: List = None): + super().__init__(OperatorType.LOGICALFILTER, children) + self._predicate = predicate + + +class LogicalProject(Operator): + def __init__(self, target_list: List['AbstractExpression'], + children: List = None): + super().__init__(OperatorType.LOGICALPROJECT, children) + self._target_list = target_list diff --git a/src/optimizer/optimizer_utils.py b/src/optimizer/optimizer_utils.py new file mode 100644 index 0000000000..f450f01e1f --- /dev/null +++ b/src/optimizer/optimizer_utils.py @@ -0,0 +1,56 @@ +from src.parser.table_ref import TableInfo +from src.catalog.catalog_manager import CatalogManager +from typing import List +from src.expression.tuple_value_expression import ExpressionType + + +def bind_table_ref(video_info: TableInfo) -> int: + """Grab the metadata id from the catalog for + input video + + Arguments: + video_info {TableInfo} -- [input parsed video info] + Return: + catalog_entry for input table + """ + + catalog = CatalogManager() + catalog_entry_id, _ = catalog.get_table_bindings(video_info.database_name, + video_info.table_name, + None) + return catalog_entry_id + + +def bind_columns_expr(target_columns: List['AbstractExpression']): + if target_columns is None: + return + + for column_exp in target_columns: + child_count = column_exp.get_children_count() + for i in range(child_count): + bind_columns_expr([column_exp.get_child(i)]) + + if column_exp.etype == ExpressionType.TUPLE_VALUE: + bind_tuple_value_expr(column_exp) + + +def bind_tuple_value_expr(expr: 'AbstractExpression'): + catalog = CatalogManager() + table_id, column_ids = catalog.get_table_bindings(None, + expr.table_name, + expr.col_name) + expr.table_metadata_id = table_id + expr.col_metadata_id = column_ids.pop() + + +def bind_predicate_expr(predicate: 'AbstractExpression'): + # This function will be expanded as we add support for + # complex predicate expressions and sub select predicates + + child_count = predicate.get_children_count() + for i in range(child_count): + bind_predicate_expr(predicate.get_child(i)) + + if predicate.etype == ExpressionType.TUPLE_VALE: + bind_tuple_value_expr(predicate) + \ No newline at end of file diff --git a/src/optimizer/statement_to_opr_convertor.py b/src/optimizer/statement_to_opr_convertor.py new file mode 100644 index 0000000000..8e2c27dcff --- /dev/null +++ b/src/optimizer/statement_to_opr_convertor.py @@ -0,0 +1,82 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from src.optimizer.operators import LogicalGet, LogicalFilter, LogicalProject +from src.parser.eva_statement import EvaStatement +from src.parser.select_statement import SelectStatement +from src.optimizer.optimizer_utils import (bind_table_ref, bind_columns_expr, + bind_predicate_expr) + + +class StatementToPlanConvertor(): + def __init__(self): + self._plan = None + + def visit_table_ref(self, video: 'TableRef'): + """Bind table ref object and convert to Logical get operator + + Arguments: + video {TableRef} -- [Input table ref object created by the parser] + """ + catalog_vid_metadata_id = bind_table_ref(video.info) + + get_opr = LogicalGet(video, catalog_vid_metadata_id) + self._plan = get_opr + + def visit_select(self, statement: EvaStatement): + """convertor for select statement + + Arguments: + statement {EvaStatement} -- [input select statement] + """ + # Create a logical get node + video = statement.from_table + if video is not None: + self.visit_table_ref(video) + + # Filter Operator + predicate = statement.where_clause + if predicate is not None: + # Binding the expression + bind_predicate_expr(predicate) + filter_opr = LogicalFilter(predicate) + filter_opr.append_child(self._plan) + self._plan = filter_opr + + # Projection operator + select_columns = statement.target_list + + # ToDO + # add support for SELECT STAR + if select_columns is not None: + # Bind the columns using catalog + bind_columns_expr(select_columns) + projection_opr = LogicalProject(select_columns) + projection_opr.append_child(self._plan) + self._plan = projection_opr + + def visit(self, statement: EvaStatement): + """Based on the instance of the statement the corresponding + visit is called. + The logic is hidden from client. + + Arguments: + statement {EvaStatement} -- [Input statement] + """ + if isinstance(statement, SelectStatement): + self.visit_select(statement) + + @property + def plan(self): + return self._plan From 902a2678e223af6fef453310791da31b37266011 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 17:01:51 -0500 Subject: [PATCH 62/82] Directory renamed to optimizer --- src/query_optimizer/__init__.py | 14 - src/query_optimizer/qo_minimum.py | 475 -------------- src/query_optimizer/qo_template.py | 43 -- src/query_optimizer/query_optimizer.py | 595 ------------------ src/query_optimizer/query_optimizer.py.bak | 510 --------------- src/query_optimizer/tests/__init__.py | 14 - .../tests/query_optimizer_test_pytest.py.bak | 65 -- 7 files changed, 1716 deletions(-) delete mode 100644 src/query_optimizer/__init__.py delete mode 100644 src/query_optimizer/qo_minimum.py delete mode 100644 src/query_optimizer/qo_template.py delete mode 100644 src/query_optimizer/query_optimizer.py delete mode 100644 src/query_optimizer/query_optimizer.py.bak delete mode 100644 src/query_optimizer/tests/__init__.py delete mode 100644 src/query_optimizer/tests/query_optimizer_test_pytest.py.bak diff --git a/src/query_optimizer/__init__.py b/src/query_optimizer/__init__.py deleted file mode 100644 index e9978151f4..0000000000 --- a/src/query_optimizer/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/src/query_optimizer/qo_minimum.py b/src/query_optimizer/qo_minimum.py deleted file mode 100644 index 60ea2b22da..0000000000 --- a/src/query_optimizer/qo_minimum.py +++ /dev/null @@ -1,475 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -This file implements the minimum query optimizer -TODO: Currently there seems to be a importing issue that I am not sure how to -solve -Error Message: query_optimizer is not a package in line __ -We will add support for pyparse -We need to fix a bug / make sure the outputs are correct - -@Jaeho Bang -""" - -from itertools import product - -import numpy as np - -from src import constants -from query_optimizer.qo_template import QOTemplate - - -class QOMinimum(QOTemplate): - - def __init__(self): - # later add support for pyparsing.. could definitely help with - # parsing operations - self.operators = ["!=", ">=", "<=", "=", "<", ">"] - self.separators = ["||", "&&"] - - def executeQueries(self, queries: list): - - synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", - "c=red", "c=white", "c=black", "c=silver", - "s>40", "s>50", "s>60", "s<65", "s<70", - "i=pt335", "i=pt211", "i=pt342", "i=pt208", - "o=pt335", "o=pt211", "o=pt342", "o=pt208"] - - synthetic_pp_stats = { - "t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, - "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, - "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, - "t=suv": {"none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, - "t=sedan": {"none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, - "t=truck": {"none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, - - "c=red": {"none/svm": {"R": 0.131, "C": 0.011, "A": 0.951}}, - "c=white": {"none/svm": {"R": 0.212, "C": 0.012, "A": 0.942}}, - "c=black": {"none/svm": {"R": 0.133, "C": 0.013, "A": 0.953}}, - "c=silver": {"none/svm": {"R": 0.214, "C": 0.014, "A": 0.944}}, - - "s>40": {"none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, - "s>50": {"none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, - - "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, - "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, - - "s<65": {"none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, - "s<70": {"none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, - - "o=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, - - "o=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, - - "o=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, - - "o=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}, - - "i=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, - - "i=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, - - "i=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, - - "i=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}} - - # TODO: We will need to convert the queries/labels into "car, bus, - # van, others". This is how the dataset defines things - - label_desc = { - "t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], - "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], - "c": [constants.DISCRETE, ["white", "red", "black", "silver"]], - "i": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]], - "o": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]]} - - print("Running Query Optimizer Demo...") - - execution_plans = [] - for query in queries: - execution_plans.append( - self.run(query, synthetic_pp_list, synthetic_pp_stats, - label_desc)) - - return execution_plans - - def run(self, query, pp_list, pp_stats, label_desc, k=3, - accuracy_budget=0.9): - """ - - :param query: query of interest ex) TRAF-20 - :param pp_list: list of pp_descriptions - queries that are available - :param pp_stats: this will be dictionary where keys are "pca/ddn", - it will have statistics saved which are R ( - reduction_rate), C (cost_to_train), A (accuracy) - :param k: number of different PPs that are in any expression E - :return: selected PPs to use for reduction - """ - query_transformed, query_operators = self._wrangler(query, label_desc) - # query_transformed is a comprehensive list of transformed queries - return self._compute_expression([query_transformed, query_operators], - pp_list, pp_stats, k, accuracy_budget) - - def _findParenthesis(self, query): - - start = [] - end = [] - query_copy = query - index = query_copy.find("(") - while index != -1: - start.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find("(") - - query_copy = query - index = query_copy.find(")") - while index != -1: - end.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find(")") - - return [start, end] - - def _parseQuery(self, query): - """ - Each sub query will be a list - There will be a separator in between - :param query: - :return: - """ - - query_parsed = [] - query_subs = query.split(" ") - query_operators = [] - for query_sub in query_subs: - if query_sub == "||" or query_sub == "&&": - query_operators.append(query_sub) - else: - - if True not in [operator in self.operators for operator in - query_sub]: - return [], [] - for operator in self.operators: - query_sub_list = query_sub.split(operator) - if isinstance(query_sub_list, list) and len( - query_sub_list) > 1: - query_parsed.append( - [query_sub_list[0], operator, query_sub_list[1]]) - break - # query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] - # query_operators ex: ["||", "||", "&&"] - return query_parsed, query_operators - - def _logic_reverse(self, str): - if str == "=": - return "!=" - elif str == "!=": - return "=" - elif str == ">": - return "<=" - elif str == ">=": - return "<" - elif str == "<": - return ">=" - elif str == "<=": - return ">" - - def convertL2S(self, parsed_query, query_ops): - final_str = "" - index = 0 - for sub_parsed_query in parsed_query: - if len(parsed_query) >= 2 and index < len(query_ops): - final_str += ''.join(sub_parsed_query) + " " + query_ops[ - index] + " " - index += 1 - else: - final_str += ''.join(sub_parsed_query) - return final_str - - def _wrangler(self, query, label_desc): - """ - import itertools - iterables = [ [1,2,3,4], [88,99], ['a','b'] ] - for t in itertools.product(*iterables): - print t - - Different types of checks are performed - 1. not equals check (f(C) != v) - 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) - 3. Range check (v1 <= f(C) <= v2) - special type of comparison check - 4. No-predicates = when column in finite and discrete, it can still - benefit - ex) 1 <=> type = car U type = truck U type = SUV - :return: transformed query - """ - # TODO: Need to implement range check - - query_parsed, query_operators = self._parseQuery(query) - # query_sorted = sorted(query_parsed) - - query_transformed = [] - equivalences = [] - - for query_sub_list in query_parsed: - subject = query_sub_list[0] - operator = query_sub_list[1] - object = query_sub_list[2] - - assert (subject in label_desc) # Label should be in label - # description dictionary - l_desc = label_desc[subject] - if l_desc[0] == constants.DISCRETE: - equivalence = [self.convertL2S([query_sub_list], [])] - assert (operator == "=" or operator == "!=") - alternate_string = "" - for category in l_desc[1]: - if category != object: - alternate_string += subject + self._logic_reverse( - operator) + category + " && " - # must strip the last ' || ' - alternate_string = alternate_string[:-len(" && ")] - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - elif l_desc[0] == constants.CONTINUOUS: - - equivalence = [self.convertL2S([query_sub_list], [])] - assert (operator == "=" or operator == "!=" or operator == "<" - or operator == "<=" or operator == ">" or operator == - ">=") - alternate_string = "" - if operator == "!=": - alternate_string += subject + ">" + object + " && " + \ - subject + "<" + object - query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(query_tmp) - if operator == "<" or operator == "<=": - object_num = eval(object) - for number in l_desc[1]: - if number > object_num: - alternate_string = subject + operator + str(number) - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - if operator == ">" or operator == ">=": - object_num = eval(object) - for number in l_desc[1]: - if number < object_num: - alternate_string = subject + operator + str(number) - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - equivalences.append(equivalence) - - possible_queries = product(*equivalences) - for q in possible_queries: - query_transformed.append(q) - - return query_transformed, query_operators - - def _compute_expression(self, query_info, pp_list, pp_stats, k, - accuracy_budget): - """ - - def QueryOptimizer(P, {trained PPs}): - P = wrangler(P) - {E} = compute_expressions(P,{trained PP},k) #k is a fixed - constant which limits number of individual PPs - in the final expression - for E in {E}: - Explore_PP_accuracy_budget(E) # Paper says dynamic program - Explore_PP_Orderings(E) #if k is small, any number of orders - can be explored - Compute_cost_vs_red_rate(E) #arithmetic over individual c, - a and r[a] numbers - return E_with_max_c/r - - - 1. p^(P/p) -> PPp - 2. PPp^q -> PPp ^ PPq - 3. PPpvq -> PPp v PPq - 4. p^(P/p) -> ~PP~q - -> we don't need to apply these rules, we simply need to see for each - sub query which PP gives us the best rate - :param query_info: [possible query forms for a given query, operators - that go in between] - :param pp_list: list of pp names that are currently available - :param pp_stats: list of pp models associated with each pp name with - R,C,A values saved - :param k: number of pps we can use at maximum - :return: the list of pps to use that maximizes reduction rate (ATM) - """ - evaluations = [] - evaluation_models = [] - evaluations_stats = [] - query_transformed, query_operators = query_info - # query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] - for possible_query in query_transformed: - evaluation = [] - evaluation_stats = [] - k_count = 0 - op_index = 0 - for query_sub in possible_query: # Even inside query_sub it can - # be divided into query_sub_sub - if k_count > k: # TODO: If you exceed a certain number, - # you just ignore the expression - evaluation = [] - evaluation_stats = [] - continue - query_sub_list, query_sub_operators = self._parseQuery( - query_sub) - evaluation_tmp = [] - evaluation_models_tmp = [] - evaluation_stats_tmp = [] - for i in range(len(query_sub_list)): - query_sub_str = ''.join(query_sub_list[i]) - if query_sub_str in pp_list: - # Find the best model for the pp - - data = self._find_model(query_sub_str, pp_stats, - accuracy_budget) - if data is None: - continue - else: - model, reduction_rate = data - evaluation_tmp.append(query_sub_str) - evaluation_models_tmp.append( - model) # TODO: We need to make sure this is - # the model_name - evaluation_stats_tmp.append(reduction_rate) - k_count += 1 - - reduc_rate = 0 - if len(evaluation_stats_tmp) != 0: - reduc_rate = self._update_stats(evaluation_stats_tmp, - query_sub_operators) - - evaluation.append(query_sub) - evaluation_models.append(evaluation_models_tmp) - evaluation_stats.append(reduc_rate) - op_index += 1 - - evaluations.append(self.convertL2S(evaluation, query_operators)) - evaluations_stats.append( - self._update_stats(evaluation_stats, query_operators)) - - max_index = np.argmax(np.array(evaluations_stats), axis=0) - best_query = evaluations[ - max_index] # this will be something like "t!=bus && t!=truck && - # t!=car" - best_models = evaluation_models[max_index] - best_reduction_rate = evaluations_stats[max_index] - - pp_names, op_names = self._convertQuery2PPOps(best_query) - return [list(zip(pp_names, best_models)), op_names, - best_reduction_rate] - - def _convertQuery2PPOps(self, query): - """ - - :param query: str (t!=car && t!=truck) - :return: - """ - query_split = query.split(" ") - pp_names = [] - op_names = [] - for i in range(len(query_split)): - if i % 2 == 0: - pp_names.append(query_split[i]) - else: - if query_split[i] == "&&": - op_names.append(np.logical_and) - else: - op_names.append(np.logical_or) - - return pp_names, op_names - - # Make this function take in the list of reduction rates and the - # operator lists - - def _update_stats(self, evaluation_stats, query_operators): - if len(evaluation_stats) == 0: - return 0 - final_red = evaluation_stats[0] - assert (len(evaluation_stats) == len(query_operators) + 1) - - for i in range(1, len(evaluation_stats)): - if query_operators[i - 1] == "&&": - final_red = final_red + evaluation_stats[i] - final_red * \ - evaluation_stats[i] - elif query_operators[i - 1] == "||": - final_red = final_red * evaluation_stats[i] - - return final_red - - def _compute_cost_red_rate(self, C, R): - assert (R >= 0 and R <= 1) # R is reduction rate and should be - # between 0 and 1 - if R == 0: - R = 0.000001 - return float(C) / R - - def _find_model(self, pp_name, pp_stats, accuracy_budget): - possible_models = pp_stats[pp_name] - best = [] # [best_model_name, best_model_cost / - # best_model_reduction_rate] - for possible_model in possible_models: - if possible_models[possible_model]["A"] < accuracy_budget: - continue - if best == []: - best = [possible_model, self._compute_cost_red_rate( - possible_models[possible_model]["C"], - possible_models[possible_model]["R"]), possible_models[ - possible_model]["R"]] - else: - alternative_best_cost = self._compute_cost_red_rate( - possible_models[possible_model]["C"], - possible_models[possible_model]["R"]) - if alternative_best_cost < best[1]: - best = [possible_model, alternative_best_cost, - possible_models[possible_model]["R"]] - - if best == []: - return None - else: - return best[0], best[2] - - -if __name__ == "__main__": - # TODO: Support for parenthesis queries - query_list_mod = ["t=suv", "s>60", - "c=white", "c!=white", "o=pt211", "c=white && t=suv", - "s>60 && s<65", "t=sedan || t=truck", - "i=pt335 && o=pt211", - "t=suv && c!=white", "c=white && t!=suv && t!=van", - "t=van && s>60 && s<65", - "t=sedan || t=truck && c!=white", - "i=pt335 && o!=pt211 && o!=pt208", - "t=van && i=pt335 && o=pt211", - "t!=sedan && c!=black && c!=silver && t!=truck", - "t=van && s>60 && s<65 && o=pt211", - "t!=suv && t!=van && c!=red && t!=white", - "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", - "i=pt335 && o=pt211 && t=van && c=red"] - - qo = QOMinimum() - print(qo.executeQueries(query_list_mod)) diff --git a/src/query_optimizer/qo_template.py b/src/query_optimizer/qo_template.py deleted file mode 100644 index e000ba42b7..0000000000 --- a/src/query_optimizer/qo_template.py +++ /dev/null @@ -1,43 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -This file gives interface to all query optimizer modules -If any issues arise please contact jaeho.bang@gmail.com - -@Jaeho Bang -""" - -from abc import ABCMeta, abstractmethod - -""" -Initial Design Thoughts: -Query Optimizer by definition should perform two tasks: -1. analyze Structered Query Language -2. Determine efficient execution mechanisms (plans) - -""" - - -class QOTemplate(metaclass=ABCMeta): - - @abstractmethod - def executeQueries(self, queries: list) -> list: - """ - Query Optimizer by definition should perform two tasks: - 1. Analyze given Structured Query Language (SQL) - 2. Determine efficient execution mechanisms/plans - :param queries: input queries / query - :return: output plans / plan that can be understood by the system - """ diff --git a/src/query_optimizer/query_optimizer.py b/src/query_optimizer/query_optimizer.py deleted file mode 100644 index d85a0d4d7d..0000000000 --- a/src/query_optimizer/query_optimizer.py +++ /dev/null @@ -1,595 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -This file composes the functions that are needed to perform query optimization. -Currently, given a query, it does logical changes to forms that are -sufficient conditions. -Using statistics from Filters module, it outputs the optimal plan (converted -query with models needed to be used). - -To see the query optimizer performance in action, simply run - -python query_optimizer/query_optimizer.py - -@Jaeho Bang - -""" -import os -import socket -# The query optimizer decide how to label the data points -# Load the series of queries from a txt file? -import sys -import threading -from itertools import product - -import numpy as np - -from src import constants - -eva_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -sys.path.append(eva_dir) - - -class QueryOptimizer: - """ - TODO: If you have a classifier for =, you can make a classifier for != - TODO: Deal with parenthesis - """ - - def __init__(self, ip_str="127.0.0.1"): - self.ip_str = ip_str - # self.startSocket() - self.operators = ["!=", ">=", "<=", "=", "<", ">"] - self.separators = ["||", "&&"] - - def startSocket(self): - thread = threading.Thread(target=self.inputQueriesFromSocket) - thread.daemon = True - thread.start() - while True: - input = eval(input( - 'Type in your query in the form of __label__ > __number__\n')) - - self.parseInput(input) - - def parseInput(self, input): - """ - TODO: Need to provide query formats that can be used - :param input: string to be parsed - :return: something that the Load() class can understand - """ - - def inputQueriesFromTxt(self, input_path): - """ - TODO: Read the file line by line, use self.parseInput to give back - commands - :param input_path: full directory + file name - :return: method of training the pps - """ - - def inputQueriesFromSocket(self): - sock = socket.socket() - sock.bind(self.ip_str, 123) - sock.listen(3) - print("Waiting on connection") - conn = sock.accept() - print("Client connected") - while True: - m = conn[0].recv(4096) - conn[0].send(m[::-1]) - - sock.shutdown(socket.SHUT_RDWR) - sock.close() - - def _findParenthesis(self, query): - - start = [] - end = [] - query_copy = query - index = query_copy.find("(") - while index != -1: - start.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find("(") - - query_copy = query - index = query_copy.find(")") - while index != -1: - end.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find(")") - - return [start, end] - - def _parseQuery(self, query): - """ - Each sub query will be a list - There will be a separator in between - :param query: - :return: - """ - - query_parsed = [] - query_subs = query.split(" ") - query_operators = [] - for query_sub in query_subs: - if query_sub == "||" or query_sub == "&&": - query_operators.append(query_sub) - else: - - if True not in [operator in self.operators for operator in - query_sub]: - return [], [] - for operator in self.operators: - query_sub_list = query_sub.split(operator) - if isinstance(query_sub_list, list) and len( - query_sub_list) > 1: - query_parsed.append( - [query_sub_list[0], operator, query_sub_list[1]]) - break - # query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] - # query_operators ex: ["||", "||", "&&"] - return query_parsed, query_operators - - def _logic_reverse(self, str): - if str == "=": - return "!=" - elif str == "!=": - return "=" - elif str == ">": - return "<=" - elif str == ">=": - return "<" - elif str == "<": - return ">=" - elif str == "<=": - return ">" - - def convertL2S(self, parsed_query, query_ops): - final_str = "" - index = 0 - for sub_parsed_query in parsed_query: - if len(parsed_query) >= 2 and index < len(query_ops): - final_str += ''.join(sub_parsed_query) + " " + query_ops[ - index] + " " - index += 1 - else: - final_str += ''.join(sub_parsed_query) - return final_str - - def _wrangler(self, query, label_desc): - """ - import itertools - iterables = [ [1,2,3,4], [88,99], ['a','b'] ] - for t in itertools.product(*iterables): - print t - - Different types of checks are performed - 1. not equals check (f(C) != v) - 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) - 3. Range check (v1 <= f(C) <= v2) - special type of comparison check - 4. No-predicates = when column in finite and discrete, it can still - benefit - ex) 1 <=> type = car U type = truck U type = SUV - :return: transformed query - """ - # TODO: Need to implement range check - - query_parsed, query_operators = self._parseQuery(query) - # query_sorted = sorted(query_parsed) - - query_transformed = [] - equivalences = [] - - for query_sub_list in query_parsed: - subject = query_sub_list[0] - operator = query_sub_list[1] - object = query_sub_list[2] - - assert ( - subject in label_desc) # Label should be in label - # description dictionary - l_desc = label_desc[subject] - if l_desc[0] == constants.DISCRETE: - equivalence = [self.convertL2S([query_sub_list], [])] - assert (operator == "=" or operator == "!=") - alternate_string = "" - for category in l_desc[1]: - if category != object: - alternate_string += subject + self._logic_reverse( - operator) + category + " && " - alternate_string = alternate_string[ - :-len(" && ")] # must strip the last ' || ' - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - elif l_desc[0] == constants.CONTINUOUS: - - equivalence = [self.convertL2S([query_sub_list], [])] - assert (operator == "=" or operator == "!=" or operator == "<" - or operator == "<=" or operator == ">" or operator == - ">=") - alternate_string = "" - if operator == "!=": - alternate_string += subject + ">" + object + " && " + \ - subject + "<" + object - query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(query_tmp) - if operator == "<" or operator == "<=": - object_num = eval(object) - for number in l_desc[1]: - if number > object_num: - alternate_string = subject + operator + str(number) - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - if operator == ">" or operator == ">=": - object_num = eval(object) - for number in l_desc[1]: - if number < object_num: - alternate_string = subject + operator + str(number) - # query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - equivalences.append(equivalence) - - possible_queries = product(*equivalences) - for q in possible_queries: - query_transformed.append(q) - - return query_transformed, query_operators - - def _compute_expression(self, query_info, pp_list, pp_stats, k, - accuracy_budget): - """ - - def QueryOptimizer(P, {trained PPs}): - P = wrangler(P) - {E} = compute_expressions(P,{trained PP},k) #k is a fixed - constant which limits number of individual PPs - in the final expression - for E in {E}: - Explore_PP_accuracy_budget(E) # Paper says dynamic program - Explore_PP_Orderings(E) #if k is small, any number of orders - can be explored - Compute_cost_vs_red_rate(E) #arithmetic over individual c, - a and r[a] numbers - return E_with_max_c/r - - - 1. p^(P/p) -> PPp - 2. PPp^q -> PPp ^ PPq - 3. PPpvq -> PPp v PPq - 4. p^(P/p) -> ~PP~q - -> we don't need to apply these rules, we simply need to see for each - sub query which PP gives us the best rate - :param query_info: [possible query forms for a given query, operators - that go in between] - :param pp_list: list of pp names that are currently available - :param pp_stats: list of pp models associated with each pp name with - R,C,A values saved - :param k: number of pps we can use at maximum - :return: the list of pps to use that maximizes reduction rate (ATM) - """ - evaluations = [] - evaluation_models = [] - evaluations_stats = [] - query_transformed, query_operators = query_info - # query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] - for possible_query in query_transformed: - evaluation = [] - evaluation_stats = [] - k_count = 0 - op_index = 0 - for query_sub in possible_query: # Even inside query_sub it can - # be divided into query_sub_sub - if k_count > k: # TODO: If you exceed a certain number, - # you just ignore the expression - evaluation = [] - evaluation_stats = [] - continue - query_sub_list, query_sub_operators = self._parseQuery( - query_sub) - evaluation_tmp = [] - evaluation_models_tmp = [] - evaluation_stats_tmp = [] - for i in range(len(query_sub_list)): - query_sub_str = ''.join(query_sub_list[i]) - if query_sub_str in pp_list: - # Find the best model for the pp - - data = self._find_model(query_sub_str, pp_stats, - accuracy_budget) - if data is None: - continue - else: - model, reduction_rate = data - evaluation_tmp.append(query_sub_str) - evaluation_models_tmp.append( - model) # TODO: We need to make sure this is - # the model_name - evaluation_stats_tmp.append(reduction_rate) - k_count += 1 - - reduc_rate = 0 - if len(evaluation_stats_tmp) != 0: - reduc_rate = self._update_stats(evaluation_stats_tmp, - query_sub_operators) - - evaluation.append(query_sub) - evaluation_models.append(evaluation_models_tmp) - evaluation_stats.append(reduc_rate) - op_index += 1 - - evaluations.append(self.convertL2S(evaluation, query_operators)) - evaluations_stats.append( - self._update_stats(evaluation_stats, query_operators)) - - max_index = np.argmax(np.array(evaluations_stats), axis=0) - best_query = evaluations[ - max_index] # this will be something like "t!=bus && t!=truck && - # t!=car" - best_models = evaluation_models[max_index] - best_reduction_rate = evaluations_stats[max_index] - - pp_names, op_names = self._convertQuery2PPOps(best_query) - return [list(zip(pp_names, best_models)), op_names, - best_reduction_rate] - - def _convertQuery2PPOps(self, query): - """ - - :param query: str (t!=car && t!=truck) - :return: - """ - query_split = query.split(" ") - pp_names = [] - op_names = [] - for i in range(len(query_split)): - if i % 2 == 0: - pp_names.append(query_split[i]) - else: - if query_split[i] == "&&": - op_names.append(np.logical_and) - else: - op_names.append(np.logical_or) - - return pp_names, op_names - - # Make this function take in the list of reduction rates and the operator - # lists - def _update_stats(self, evaluation_stats, query_operators): - if len(evaluation_stats) == 0: - return 0 - final_red = evaluation_stats[0] - assert (len(evaluation_stats) == len(query_operators) + 1) - - for i in range(1, len(evaluation_stats)): - if query_operators[i - 1] == "&&": - final_red = final_red + evaluation_stats[i] - final_red * \ - evaluation_stats[i] - elif query_operators[i - 1] == "||": - final_red = final_red * evaluation_stats[i] - - return final_red - - def _compute_cost_red_rate(self, C, R): - assert ( - R >= 0 and R <= 1) # R is reduction rate and should be - # between 0 and 1 - if R == 0: - R = 0.000001 - return float(C) / R - - def _find_model(self, pp_name, pp_stats, accuracy_budget): - possible_models = pp_stats[pp_name] - best = [] # [best_model_name, best_model_cost / - # best_model_reduction_rate] - for possible_model in possible_models: - if possible_models[possible_model]["A"] < accuracy_budget: - continue - if best == []: - best = [possible_model, self._compute_cost_red_rate( - possible_models[possible_model]["C"], - possible_models[possible_model]["R"]), - possible_models[possible_model]["R"]] - else: - alternative_best_cost = self._compute_cost_red_rate( - possible_models[possible_model]["C"], - possible_models[possible_model]["R"]) - if alternative_best_cost < best[1]: - best = [possible_model, alternative_best_cost, - possible_models[possible_model]["R"]] - - if best == []: - return None - else: - return best[0], best[2] - - def run(self, query, pp_list, pp_stats, label_desc, k=3, - accuracy_budget=0.9): - """ - - :param query: query of interest ex) TRAF-20 - :param pp_list: list of pp_descriptions - queries that are available - :param pp_stats: this will be dictionary where keys are "pca/ddn", - it will have statistics saved which are R ( - reduction_rate), C (cost_to_train), A (accuracy) - :param k: number of different PPs that are in any expression E - :return: selected PPs to use for reduction - """ - query_transformed, query_operators = self._wrangler(query, label_desc) - # query_transformed is a comprehensive list of transformed queries - return self._compute_expression([query_transformed, query_operators], - pp_list, pp_stats, k, accuracy_budget) - - -if __name__ == "__main__": - - query_list = ["t=suv", "s>60", - "c=white", "c!=white", "o=pt211", "c=white && t=suv", - "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", - "t=suv && c!=white", "c=white && t!=suv && t!=van", - "t=van && s>60 && s<65", "c!=white && (t=sedan || t=truck)", - "i=pt335 && o!=pt211 && o!=pt208", - "t=van && i=pt335 && o=pt211", - "t!=sedan && c!=black && c!=silver && t!=truck", - "t=van && s>60 && s<65 && o=pt211", - "t!=suv && t!=van && c!=red && t!=white", - "(i=pt335 || i=pt342) && o!=pt211 && o!=pt208", - "i=pt335 && o=pt211 && t=van && c=red"] - - # TODO: Support for parenthesis queries - query_list_mod = ["t=suv", "s>60", - "c=white", "c!=white", "o=pt211", "c=white && t=suv", - "s>60 && s<65", "t=sedan || t=truck", - "i=pt335 && o=pt211", - "t=suv && c!=white", "c=white && t!=suv && t!=van", - "t=van && s>60 && s<65", - "t=sedan || t=truck && c!=white", - "i=pt335 && o!=pt211 && o!=pt208", - "t=van && i=pt335 && o=pt211", - "t!=sedan && c!=black && c!=silver && t!=truck", - "t=van && s>60 && s<65 && o=pt211", - "t!=suv && t!=van && c!=red && t!=white", - "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", - "i=pt335 && o=pt211 && t=van && c=red"] - - query_list_test = ["c=white && t!=suv && t!=van"] - - synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", - "c=red", "c=white", "c=black", "c=silver", - "s>40", "s>50", "s>60", "s<65", "s<70", - "i=pt335", "i=pt211", "i=pt342", "i=pt208", - "o=pt335", "o=pt211", "o=pt342", "o=pt208"] - - query_list_short = ["t=van && s>60 && o=pt211"] - - synthetic_pp_list_short = ["t=van", "s>60", "o=pt211"] - - # TODO: Might need to change this to a R vs A curve instead of static - # numbers - # TODO: When selecting appropriate PPs, we only select based on reduction - # rate - synthetic_pp_stats_short = { - "t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, - "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, - "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, - - "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, - "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, - - "o=pt211": {"none/dnn": {"R": 0.13, "C": 0.32, "A": 0.99}, - "none/kde": {"R": 0.14, "C": 0.12, "A": 0.93}}} - - synthetic_pp_stats = {"t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, - "pca/dnn": {"R": 0.2, "C": 0.15, - "A": 0.92}, - "none/kde": {"R": 0.15, "C": 0.05, - "A": 0.95}}, - "t=suv": { - "none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, - "t=sedan": { - "none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, - "t=truck": { - "none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, - - "c=red": { - "none/svm": {"R": 0.131, "C": 0.011, - "A": 0.951}}, - "c=white": { - "none/svm": {"R": 0.212, "C": 0.012, - "A": 0.942}}, - "c=black": { - "none/svm": {"R": 0.133, "C": 0.013, - "A": 0.953}}, - "c=silver": { - "none/svm": {"R": 0.214, "C": 0.014, - "A": 0.944}}, - - "s>40": { - "none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, - "s>50": { - "none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, - - "s>60": { - "none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, - "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, - - "s<65": { - "none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, - "s<70": { - "none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, - - "o=pt211": { - "none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, - "A": 0.932}}, - - "o=pt335": { - "none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, - "A": 0.934}}, - - "o=pt342": { - "none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, - "A": 0.935}}, - - "o=pt208": { - "none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, - "A": 0.936}}, - - "i=pt211": { - "none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, - "A": 0.932}}, - - "i=pt335": { - "none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, - "A": 0.934}}, - - "i=pt342": { - "none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, - "A": 0.935}}, - - "i=pt208": { - "none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, - "A": 0.936}}} - - # TODO: We will need to convert the queries/labels into "car, bus, van, - # others". This is how the dataset defines things - - label_desc = {"t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], - "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], - "c": [constants.DISCRETE, - ["white", "red", "black", "silver"]], - "i": [constants.DISCRETE, - ["pt335", "pt342", "pt211", "pt208"]], - "o": [constants.DISCRETE, - ["pt335", "pt342", "pt211", "pt208"]]} - - qo = QueryOptimizer() - - print("Running Query Optimizer Demo...") - - for query in query_list_mod: - print(query, " -> ", ( - qo.run(query, synthetic_pp_list, synthetic_pp_stats, label_desc))) - # print qo.run(query, synthetic_pp_list_short, - # synthetic_pp_stats_short, label_desc) diff --git a/src/query_optimizer/query_optimizer.py.bak b/src/query_optimizer/query_optimizer.py.bak deleted file mode 100644 index 58df4de31d..0000000000 --- a/src/query_optimizer/query_optimizer.py.bak +++ /dev/null @@ -1,510 +0,0 @@ - -# The query optimizer decide how to label the data points -# Load the series of queries from a txt file? -import sys -import os -import socket -import threading -import numpy as np -from itertools import product -from time import sleep - -eva_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -sys.path.append(eva_dir) -import constants - - - -class QueryOptimizer: - """ - TODO: If you have a classifier for =, you can make a classifier for != - TODO: Deal with parenthesis - """ - - def __init__(self, ip_str="127.0.0.1"): - self.ip_str = ip_str - #self.startSocket() - self.operators = ["!=", ">=", "<=", "=", "<", ">"] - self.separators = ["||", "&&"] - - - - def startSocket(self): - thread = threading.Thread(target=self.inputQueriesFromSocket) - thread.daemon = True - thread.start() - while True: - input = input('Type in your query in the form of __label__ > __number__\n') - - self.parseInput(input) - - - def parseInput(self, input): - """ - TODO: Need to provide query formats that can be used - :param input: string to be parsed - :return: something that the Load() class can understand - """ - pass - - - def inputQueriesFromTxt(self, input_path): - """ - TODO: Read the file line by line, use self.parseInput to give back commands - :param input_path: full directory + file name - :return: method of training the pps - """ - pass - - - def inputQueriesFromSocket(self): - sock = socket.socket() - sock.bind(self.ip_str, 123) - sock.listen(3) - print("Waiting on connection") - conn = sock.accept() - print("Client connected") - while True: - m = conn[0].recv(4096) - conn[0].send(m[::-1]) - - sock.shutdown(socket.SHUT_RDWR) - sock.close() - - - def _findParenthesis(self, query): - - start = [] - end = [] - query_copy = query - index = query_copy.find("(") - while index != -1: - start.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find("(") - - query_copy = query - index = query_copy.find(")") - while index != -1: - end.append(index) - query_copy = query_copy[index + 1:] - index = query_copy.find(")") - - return [start, end] - - - def _parseQuery(self, query): - """ - Each sub query will be a list - There will be a separator in between - :param query: - :return: - """ - - - query_parsed = [] - query_subs = query.split(" ") - query_operators = [] - for query_sub in query_subs: - if query_sub == "||" or query_sub == "&&": - query_operators.append(query_sub) - else: - - if True not in [operator in self.operators for operator in query_sub]: - return [],[] - for operator in self.operators: - query_sub_list = query_sub.split(operator) - if type(query_sub_list) is list and len(query_sub_list) > 1: - query_parsed.append([query_sub_list[0], operator, query_sub_list[1]]) - break - #query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] - #query_operators ex: ["||", "||", "&&"] - return query_parsed, query_operators - - - - - def _logic_reverse(self, str): - if str == "=": - return "!=" - elif str == "!=": - return "=" - elif str == ">": - return "<=" - elif str == ">=": - return "<" - elif str == "<": - return ">=" - elif str == "<=": - return ">" - - def convertL2S(self, parsed_query, query_ops): - final_str = "" - index = 0 - for sub_parsed_query in parsed_query: - if len(parsed_query) >= 2 and index < len(query_ops): - final_str += ''.join(sub_parsed_query) + " " + query_ops[index] + " " - index += 1 - else: - final_str += ''.join(sub_parsed_query) - return final_str - - - def _wrangler(self, query, label_desc): - """ - import itertools - iterables = [ [1,2,3,4], [88,99], ['a','b'] ] - for t in itertools.product(*iterables): - print t - - Different types of checks are performed - 1. not equals check (f(C) != v) - 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) - 3. Range check (v1 <= f(C) <= v2) - special type of comparison check - 4. No-predicates = when column in finite and discrete, it can still benefit - ex) 1 <=> type = car U type = truck U type = SUV - :return: transformed query - """ - #TODO: Need to implement range check - - query_parsed, query_operators = self._parseQuery(query) - #query_sorted = sorted(query_parsed) - - query_transformed = [] - equivalences = [] - equivalences_op = [] - - for query_sub_list in query_parsed: - subject = query_sub_list[0] - operator = query_sub_list[1] - object = query_sub_list[2] - - assert(subject in label_desc) # Label should be in label description dictionary - l_desc = label_desc[subject] - if l_desc[0] == constants.DISCRETE: - equivalence = [self.convertL2S([query_sub_list], [])] - assert(operator == "=" or operator == "!=") - alternate_string = "" - for category in l_desc[1]: - if category != object: - alternate_string += subject + self._logic_reverse(operator) + category + " && " - alternate_string = alternate_string[:-len(" && ")] #must strip the last ' || ' - #query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - elif l_desc[0] == constants.CONTINUOUS: - - equivalence = [self.convertL2S([query_sub_list], [])] - assert(operator == "=" or operator == "!=" or operator == "<" - or operator == "<=" or operator == ">" or operator == ">=") - alternate_string = "" - if operator == "!=": - alternate_string += subject + ">" + object + " && " + subject + "<" + object - query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(query_tmp) - if operator == "<" or operator == "<=": - object_num = eval(object) - for number in l_desc[1]: - if number > object_num: - alternate_string = subject + operator + str(number) - #query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - if operator == ">" or operator == ">=": - object_num = eval(object) - for number in l_desc[1]: - if number < object_num: - alternate_string = subject + operator + str(number) - #query_tmp, _ = self._parseQuery(alternate_string) - equivalence.append(alternate_string) - - equivalences.append(equivalence) - - possible_queries = product(*equivalences) - for q in possible_queries: - query_transformed.append( q ) - - return query_transformed, query_operators - - - - def _compute_expression(self, query_info, pp_list, pp_stats, k, accuracy_budget): - """ - - def QueryOptimizer(P, {trained PPs}): - P = wrangler(P) - {E} = compute_expressions(P,{trained PP},k) #k is a fixed constant which limits number of individual PPs - in the final expression - for E in {E}: - Explore_PP_accuracy_budget(E) # Paper says dynamic program - Explore_PP_Orderings(E) #if k is small, any number of orders can be explored - Compute_cost_vs_red_rate(E) #arithmetic over individual c,a and r[a] numbers - return E_with_max_c/r - - - 1. p^(P/p) -> PPp - 2. PPp^q -> PPp ^ PPq - 3. PPpvq -> PPp v PPq - 4. p^(P/p) -> ~PP~q - -> we don't need to apply these rules, we simply need to see for each sub query which PP gives us the best rate - :param query_info: [possible query forms for a given query, operators that go in between] - :param pp_list: list of pp names that are currently available - :param pp_stats: list of pp models associated with each pp name with R,C,A values saved - :param k: number of pps we can use at maximum - :return: the list of pps to use that maximizes reduction rate (ATM) - """ - evaluations = [] - evaluation_models = [] - evaluations_stats = [] - query_transformed, query_operators = query_info - #query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] - for possible_query in query_transformed: - evaluation = [] - evaluation_stats = [] - k_count = 0 - op_index = 0 - for query_sub in possible_query: #Even inside query_sub it can be divided into query_sub_sub - if k_count > k: #TODO: If you exceed a certain number, you just ignore the expression - evaluation = [] - evaluation_stats = [] - continue - query_sub_list, query_sub_operators = self._parseQuery(query_sub) - evaluation_tmp = [] - evaluation_models_tmp = [] - evaluation_stats_tmp = [] - for i in range(len(query_sub_list)): - query_sub_str = ''.join(query_sub_list[i]) - if query_sub_str in pp_list: - #Find the best model for the pp - - data = self._find_model(query_sub_str, pp_stats, accuracy_budget) - if data == None: - continue - else: - model, reduction_rate = data - evaluation_tmp.append(query_sub_str) - evaluation_models_tmp.append(model) #TODO: We need to make sure this is the model_name - evaluation_stats_tmp.append(reduction_rate) - k_count += 1 - - - reduc_rate = 0 - if len(evaluation_stats_tmp) != 0: - reduc_rate = self._update_stats(evaluation_stats_tmp, query_sub_operators) - - evaluation.append(query_sub) - evaluation_models.append(evaluation_models_tmp) - evaluation_stats.append(reduc_rate) - op_index += 1 - - - evaluations.append( self.convertL2S(evaluation, query_operators) ) - evaluations_stats.append( self._update_stats(evaluation_stats, query_operators) ) - - max_index = np.argmax(np.array(evaluations_stats), axis = 0) - best_query = evaluations[max_index] #this will be something like "t!=bus && t!=truck && t!=car" - best_models = evaluation_models[max_index] - best_reduction_rate = evaluations_stats[max_index] - - pp_names, op_names = self._convertQuery2PPOps(best_query) - return [list(zip(pp_names, best_models)), op_names, best_reduction_rate] - - - def _convertQuery2PPOps(self, query): - """ - - :param query: str (t!=car && t!=truck) - :return: - """ - query_split = query.split(" ") - pp_names = [] - op_names = [] - for i in range(len(query_split)): - if i % 2 == 0: - pp_names.append(query_split[i]) - else: - if query_split[i] == "&&": - op_names.append(np.logical_and) - else: - op_names.append(np.logical_or) - - return pp_names, op_names - - - - - #Make this function take in the list of reduction rates and the operator lists - def _update_stats(self, evaluation_stats, query_operators): - if len(evaluation_stats) == 0: - return 0 - final_red = evaluation_stats[0] - assert(len(evaluation_stats) == len(query_operators) + 1) - - for i in range(1, len(evaluation_stats)): - if query_operators[i - 1] == "&&": - final_red = final_red + evaluation_stats[i] - final_red * evaluation_stats[i] - elif query_operators[i - 1] == "||": - final_red = final_red * evaluation_stats[i] - - return final_red - - - - - - def _compute_cost_red_rate(self, C, R): - assert(R >= 0 and R <= 1) #R is reduction rate and should be between 0 and 1 - if R == 0: - R = 0.000001 - return float(C) / R - - def _find_model(self, pp_name, pp_stats, accuracy_budget): - possible_models = pp_stats[pp_name] - best = [] #[best_model_name, best_model_cost / best_model_reduction_rate] - for possible_model in possible_models: - if possible_models[possible_model]["A"] < accuracy_budget: - continue - if best == []: - best = [possible_model, self._compute_cost_red_rate(possible_models[possible_model]["C"], - possible_models[possible_model]["R"]), - possible_models[possible_model]["R"]] - else: - alternative_best_cost = self._compute_cost_red_rate(possible_models[possible_model]["C"], - possible_models[possible_model]["R"]) - if alternative_best_cost < best[1]: - best = [possible_model, alternative_best_cost, possible_models[possible_model]["R"]] - - - if best == []: - return None - else: - return best[0], best[2] - - def run(self, query, pp_list, pp_stats, label_desc, k = 3, accuracy_budget = 0.9): - """ - - :param query: query of interest ex) TRAF-20 - :param pp_list: list of pp_descriptions - queries that are available - :param pp_stats: this will be dictionary where keys are "pca/ddn", - it will have statistics saved which are R (reduction_rate), C (cost_to_train), A (accuracy) - :param k: number of different PPs that are in any expression E - :return: selected PPs to use for reduction - """ - query_transformed, query_operators = self._wrangler(query, label_desc) - #query_transformed is a comprehensive list of transformed queries - return self._compute_expression([query_transformed, query_operators], pp_list, pp_stats, k, accuracy_budget) - - -if __name__ == "__main__": - - - query_list = ["t=suv", "s>60", - "c=white", "c!=white", "o=pt211", "c=white && t=suv", - "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", - "t=suv && c!=white", "c=white && t!=suv && t!=van", - "t=van && s>60 && s<65", "c!=white && (t=sedan || t=truck)", - "i=pt335 && o!=pt211 && o!=pt208", "t=van && i=pt335 && o=pt211", - "t!=sedan && c!=black && c!=silver && t!=truck", - "t=van && s>60 && s<65 && o=pt211", "t!=suv && t!=van && c!=red && t!=white", - "(i=pt335 || i=pt342) && o!=pt211 && o!=pt208", - "i=pt335 && o=pt211 && t=van && c=red"] - - - #TODO: Support for parenthesis queries - query_list_mod = ["t=suv", "s>60", - "c=white", "c!=white", "o=pt211", "c=white && t=suv", - "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", - "t=suv && c!=white", "c=white && t!=suv && t!=van", - "t=van && s>60 && s<65", "t=sedan || t=truck && c!=white", - "i=pt335 && o!=pt211 && o!=pt208", "t=van && i=pt335 && o=pt211", - "t!=sedan && c!=black && c!=silver && t!=truck", - "t=van && s>60 && s<65 && o=pt211", "t!=suv && t!=van && c!=red && t!=white", - "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", - "i=pt335 && o=pt211 && t=van && c=red"] - - query_list_test = ["c=white && t!=suv && t!=van"] - - - synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", - "c=red", "c=white", "c=black", "c=silver", - "s>40", "s>50", "s>60", "s<65", "s<70", - "i=pt335", "i=pt211", "i=pt342", "i=pt208", - "o=pt335", "o=pt211", "o=pt342", "o=pt208"] - - query_list_short = ["t=van && s>60 && o=pt211"] - - - synthetic_pp_list_short = ["t=van", "s>60", "o=pt211"] - - - #TODO: Might need to change this to a R vs A curve instead of static numbers - #TODO: When selecting appropriate PPs, we only select based on reduction rate - synthetic_pp_stats_short = {"t=van" :{ "none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, - "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, - "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, - - "s>60" :{ "none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, - "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, - - "o=pt211" :{ "none/dnn": {"R": 0.13, "C": 0.32, "A": 0.99}, - "none/kde": {"R": 0.14, "C": 0.12, "A": 0.93}} } - - synthetic_pp_stats = {"t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, - "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, - "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, - "t=suv": {"none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, - "t=sedan": {"none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, - "t=truck": {"none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, - - "c=red": {"none/svm": {"R": 0.131, "C": 0.011, "A": 0.951}}, - "c=white": {"none/svm": {"R": 0.212, "C": 0.012, "A": 0.942}}, - "c=black": {"none/svm": {"R": 0.133, "C": 0.013, "A": 0.953}}, - "c=silver": {"none/svm": {"R": 0.214, "C": 0.014, "A": 0.944}}, - - "s>40": {"none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, - "s>50": {"none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, - - "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, - "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, - - "s<65": {"none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, - "s<70": {"none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, - - "o=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, - - "o=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, - - "o=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, - - "o=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}, - - "i=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, - "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, - - "i=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, - "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, - - "i=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, - "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, - - "i=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, - "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}} - - #TODO: We will need to convert the queries/labels into "car, bus, van, others". This is how the dataset defines things - - label_desc = {"t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], - "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], - "c": [constants.DISCRETE, ["white", "red", "black", "silver"]], - "i": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]], - "o": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]]} - - qo = QueryOptimizer() - for query in query_list_mod: - #print qo.run(query, synthetic_pp_list_short, synthetic_pp_stats_short, label_desc) - print(qo.run(query, synthetic_pp_list, synthetic_pp_stats, label_desc)) - - - diff --git a/src/query_optimizer/tests/__init__.py b/src/query_optimizer/tests/__init__.py deleted file mode 100644 index e9978151f4..0000000000 --- a/src/query_optimizer/tests/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/src/query_optimizer/tests/query_optimizer_test_pytest.py.bak b/src/query_optimizer/tests/query_optimizer_test_pytest.py.bak deleted file mode 100644 index c206c61967..0000000000 --- a/src/query_optimizer/tests/query_optimizer_test_pytest.py.bak +++ /dev/null @@ -1,65 +0,0 @@ -from query_optimizer.query_optimizer import QueryOptimizer - -obj=QueryOptimizer() - -def test_parseQuery(): - - #case 1: Simple input/ouput check - predicates,connectors=obj._parseQuery("t>60 && q>=4 && v=car") - if predicates!=[["t",">","60"],["q",">=","4"],["v","=","car"]]: - assert False,"Wrong breakdown of predicates" - if connectors!=["&&","&&"]: - assert False,"Wrong list of connectors" - - # case 2: Case when an extra space is present in the input - predicates, connectors = obj._parseQuery("t>60 && q>=4 && v=car") - if predicates != [["t", ">", "60"], ["q", ">=", "4"], ["v", "=", "car"]]: - assert False, "Wrong breakdown of predicates, can't handle consecutive spaces." - if connectors != ["&&", "&&"]: - assert False, "Wrong list of connectors" - - #case 2: No operator exists - predicates, connectors = obj._parseQuery("t!60") - if predicates != []: - assert False, "Wrong breakdown of predicates" - if connectors != []: - assert False, "Wrong list of connectors" - - predicates, connectors = obj._parseQuery("t>60 && adfsg") - if predicates != []: - assert False, "Wrong breakdown of predicates" - if connectors != []: - assert False, "Wrong list of connectors" - - #case for >> and similar situations, the >> operator should be recognised as > - predicates, connectors = obj._parseQuery("t>>60 && a<45") - print(predicates, connectors) - if predicates != [['t','>','60'],['a','<','45']]: - assert False, "Wrong breakdown of predicates,the >> operator should be recognised as > and likewise for <" - if connectors != ['&&']: - assert False, "Wrong list of connectors" - - #case 2: Check for ordering of execution based on parenthesis code does not handle this yet so now way to test at the moment - predicates, connectors = obj._parseQuery("t>60||(q>=4&&v=car)") - if predicates != [["t", ">", "60"], ["q", ">=", "4"], ["v", "=", "car"]]: - assert False, "Wrong breakdown of predicates" - if connectors != ["&&", "&&"]: - assert False, "Wrong list of connectors" - - assert True - -def test_convertL2S(): - query_string=obj.convertL2S(["t","!=","10"],[]) - if query_string!="t!=10": - assert False,"Wrong output query string" - assert True - - query_string = obj.convertL2S([["t", "!=", "10"],['a','<','5'],['b','=','1003']], ["&&","||"]) - if query_string != "t!=10 && a<5 || b=1003": - assert False, "Wrong output query string" - - #case for paranthesis hasn't been implemented yet when its done need to add test cases for that. - assert True - -test_parseQuery() -test_convertL2S() \ No newline at end of file From 8602bdc8b88cd3820c630936605b20232c2b20a4 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 17:03:42 -0500 Subject: [PATCH 63/82] Obsolete Files removed --- src/optimizer/qo_minimum.py | 475 ++++++++++++++ src/optimizer/qo_template.py | 43 ++ src/optimizer/query_optimizer.py | 595 ++++++++++++++++++ src/optimizer/query_optimizer.py.bak | 510 +++++++++++++++ src/optimizer/test.py | 21 + src/optimizer/tests/__init__.py | 14 + .../tests/query_optimizer_test_pytest.py.bak | 65 ++ 7 files changed, 1723 insertions(+) create mode 100644 src/optimizer/qo_minimum.py create mode 100644 src/optimizer/qo_template.py create mode 100644 src/optimizer/query_optimizer.py create mode 100644 src/optimizer/query_optimizer.py.bak create mode 100644 src/optimizer/test.py create mode 100644 src/optimizer/tests/__init__.py create mode 100644 src/optimizer/tests/query_optimizer_test_pytest.py.bak diff --git a/src/optimizer/qo_minimum.py b/src/optimizer/qo_minimum.py new file mode 100644 index 0000000000..60ea2b22da --- /dev/null +++ b/src/optimizer/qo_minimum.py @@ -0,0 +1,475 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implements the minimum query optimizer +TODO: Currently there seems to be a importing issue that I am not sure how to +solve +Error Message: query_optimizer is not a package in line __ +We will add support for pyparse +We need to fix a bug / make sure the outputs are correct + +@Jaeho Bang +""" + +from itertools import product + +import numpy as np + +from src import constants +from query_optimizer.qo_template import QOTemplate + + +class QOMinimum(QOTemplate): + + def __init__(self): + # later add support for pyparsing.. could definitely help with + # parsing operations + self.operators = ["!=", ">=", "<=", "=", "<", ">"] + self.separators = ["||", "&&"] + + def executeQueries(self, queries: list): + + synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", + "c=red", "c=white", "c=black", "c=silver", + "s>40", "s>50", "s>60", "s<65", "s<70", + "i=pt335", "i=pt211", "i=pt342", "i=pt208", + "o=pt335", "o=pt211", "o=pt342", "o=pt208"] + + synthetic_pp_stats = { + "t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, + "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, + "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, + "t=suv": {"none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, + "t=sedan": {"none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, + "t=truck": {"none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, + + "c=red": {"none/svm": {"R": 0.131, "C": 0.011, "A": 0.951}}, + "c=white": {"none/svm": {"R": 0.212, "C": 0.012, "A": 0.942}}, + "c=black": {"none/svm": {"R": 0.133, "C": 0.013, "A": 0.953}}, + "c=silver": {"none/svm": {"R": 0.214, "C": 0.014, "A": 0.944}}, + + "s>40": {"none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, + "s>50": {"none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, + + "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, + "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, + + "s<65": {"none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, + "s<70": {"none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, + + "o=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, + + "o=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, + + "o=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, + + "o=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}, + + "i=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, + + "i=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, + + "i=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, + + "i=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}} + + # TODO: We will need to convert the queries/labels into "car, bus, + # van, others". This is how the dataset defines things + + label_desc = { + "t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], + "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], + "c": [constants.DISCRETE, ["white", "red", "black", "silver"]], + "i": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]], + "o": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]]} + + print("Running Query Optimizer Demo...") + + execution_plans = [] + for query in queries: + execution_plans.append( + self.run(query, synthetic_pp_list, synthetic_pp_stats, + label_desc)) + + return execution_plans + + def run(self, query, pp_list, pp_stats, label_desc, k=3, + accuracy_budget=0.9): + """ + + :param query: query of interest ex) TRAF-20 + :param pp_list: list of pp_descriptions - queries that are available + :param pp_stats: this will be dictionary where keys are "pca/ddn", + it will have statistics saved which are R ( + reduction_rate), C (cost_to_train), A (accuracy) + :param k: number of different PPs that are in any expression E + :return: selected PPs to use for reduction + """ + query_transformed, query_operators = self._wrangler(query, label_desc) + # query_transformed is a comprehensive list of transformed queries + return self._compute_expression([query_transformed, query_operators], + pp_list, pp_stats, k, accuracy_budget) + + def _findParenthesis(self, query): + + start = [] + end = [] + query_copy = query + index = query_copy.find("(") + while index != -1: + start.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find("(") + + query_copy = query + index = query_copy.find(")") + while index != -1: + end.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find(")") + + return [start, end] + + def _parseQuery(self, query): + """ + Each sub query will be a list + There will be a separator in between + :param query: + :return: + """ + + query_parsed = [] + query_subs = query.split(" ") + query_operators = [] + for query_sub in query_subs: + if query_sub == "||" or query_sub == "&&": + query_operators.append(query_sub) + else: + + if True not in [operator in self.operators for operator in + query_sub]: + return [], [] + for operator in self.operators: + query_sub_list = query_sub.split(operator) + if isinstance(query_sub_list, list) and len( + query_sub_list) > 1: + query_parsed.append( + [query_sub_list[0], operator, query_sub_list[1]]) + break + # query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] + # query_operators ex: ["||", "||", "&&"] + return query_parsed, query_operators + + def _logic_reverse(self, str): + if str == "=": + return "!=" + elif str == "!=": + return "=" + elif str == ">": + return "<=" + elif str == ">=": + return "<" + elif str == "<": + return ">=" + elif str == "<=": + return ">" + + def convertL2S(self, parsed_query, query_ops): + final_str = "" + index = 0 + for sub_parsed_query in parsed_query: + if len(parsed_query) >= 2 and index < len(query_ops): + final_str += ''.join(sub_parsed_query) + " " + query_ops[ + index] + " " + index += 1 + else: + final_str += ''.join(sub_parsed_query) + return final_str + + def _wrangler(self, query, label_desc): + """ + import itertools + iterables = [ [1,2,3,4], [88,99], ['a','b'] ] + for t in itertools.product(*iterables): + print t + + Different types of checks are performed + 1. not equals check (f(C) != v) + 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) + 3. Range check (v1 <= f(C) <= v2) - special type of comparison check + 4. No-predicates = when column in finite and discrete, it can still + benefit + ex) 1 <=> type = car U type = truck U type = SUV + :return: transformed query + """ + # TODO: Need to implement range check + + query_parsed, query_operators = self._parseQuery(query) + # query_sorted = sorted(query_parsed) + + query_transformed = [] + equivalences = [] + + for query_sub_list in query_parsed: + subject = query_sub_list[0] + operator = query_sub_list[1] + object = query_sub_list[2] + + assert (subject in label_desc) # Label should be in label + # description dictionary + l_desc = label_desc[subject] + if l_desc[0] == constants.DISCRETE: + equivalence = [self.convertL2S([query_sub_list], [])] + assert (operator == "=" or operator == "!=") + alternate_string = "" + for category in l_desc[1]: + if category != object: + alternate_string += subject + self._logic_reverse( + operator) + category + " && " + # must strip the last ' || ' + alternate_string = alternate_string[:-len(" && ")] + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + elif l_desc[0] == constants.CONTINUOUS: + + equivalence = [self.convertL2S([query_sub_list], [])] + assert (operator == "=" or operator == "!=" or operator == "<" + or operator == "<=" or operator == ">" or operator == + ">=") + alternate_string = "" + if operator == "!=": + alternate_string += subject + ">" + object + " && " + \ + subject + "<" + object + query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(query_tmp) + if operator == "<" or operator == "<=": + object_num = eval(object) + for number in l_desc[1]: + if number > object_num: + alternate_string = subject + operator + str(number) + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + if operator == ">" or operator == ">=": + object_num = eval(object) + for number in l_desc[1]: + if number < object_num: + alternate_string = subject + operator + str(number) + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + equivalences.append(equivalence) + + possible_queries = product(*equivalences) + for q in possible_queries: + query_transformed.append(q) + + return query_transformed, query_operators + + def _compute_expression(self, query_info, pp_list, pp_stats, k, + accuracy_budget): + """ + + def QueryOptimizer(P, {trained PPs}): + P = wrangler(P) + {E} = compute_expressions(P,{trained PP},k) #k is a fixed + constant which limits number of individual PPs + in the final expression + for E in {E}: + Explore_PP_accuracy_budget(E) # Paper says dynamic program + Explore_PP_Orderings(E) #if k is small, any number of orders + can be explored + Compute_cost_vs_red_rate(E) #arithmetic over individual c, + a and r[a] numbers + return E_with_max_c/r + + + 1. p^(P/p) -> PPp + 2. PPp^q -> PPp ^ PPq + 3. PPpvq -> PPp v PPq + 4. p^(P/p) -> ~PP~q + -> we don't need to apply these rules, we simply need to see for each + sub query which PP gives us the best rate + :param query_info: [possible query forms for a given query, operators + that go in between] + :param pp_list: list of pp names that are currently available + :param pp_stats: list of pp models associated with each pp name with + R,C,A values saved + :param k: number of pps we can use at maximum + :return: the list of pps to use that maximizes reduction rate (ATM) + """ + evaluations = [] + evaluation_models = [] + evaluations_stats = [] + query_transformed, query_operators = query_info + # query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] + for possible_query in query_transformed: + evaluation = [] + evaluation_stats = [] + k_count = 0 + op_index = 0 + for query_sub in possible_query: # Even inside query_sub it can + # be divided into query_sub_sub + if k_count > k: # TODO: If you exceed a certain number, + # you just ignore the expression + evaluation = [] + evaluation_stats = [] + continue + query_sub_list, query_sub_operators = self._parseQuery( + query_sub) + evaluation_tmp = [] + evaluation_models_tmp = [] + evaluation_stats_tmp = [] + for i in range(len(query_sub_list)): + query_sub_str = ''.join(query_sub_list[i]) + if query_sub_str in pp_list: + # Find the best model for the pp + + data = self._find_model(query_sub_str, pp_stats, + accuracy_budget) + if data is None: + continue + else: + model, reduction_rate = data + evaluation_tmp.append(query_sub_str) + evaluation_models_tmp.append( + model) # TODO: We need to make sure this is + # the model_name + evaluation_stats_tmp.append(reduction_rate) + k_count += 1 + + reduc_rate = 0 + if len(evaluation_stats_tmp) != 0: + reduc_rate = self._update_stats(evaluation_stats_tmp, + query_sub_operators) + + evaluation.append(query_sub) + evaluation_models.append(evaluation_models_tmp) + evaluation_stats.append(reduc_rate) + op_index += 1 + + evaluations.append(self.convertL2S(evaluation, query_operators)) + evaluations_stats.append( + self._update_stats(evaluation_stats, query_operators)) + + max_index = np.argmax(np.array(evaluations_stats), axis=0) + best_query = evaluations[ + max_index] # this will be something like "t!=bus && t!=truck && + # t!=car" + best_models = evaluation_models[max_index] + best_reduction_rate = evaluations_stats[max_index] + + pp_names, op_names = self._convertQuery2PPOps(best_query) + return [list(zip(pp_names, best_models)), op_names, + best_reduction_rate] + + def _convertQuery2PPOps(self, query): + """ + + :param query: str (t!=car && t!=truck) + :return: + """ + query_split = query.split(" ") + pp_names = [] + op_names = [] + for i in range(len(query_split)): + if i % 2 == 0: + pp_names.append(query_split[i]) + else: + if query_split[i] == "&&": + op_names.append(np.logical_and) + else: + op_names.append(np.logical_or) + + return pp_names, op_names + + # Make this function take in the list of reduction rates and the + # operator lists + + def _update_stats(self, evaluation_stats, query_operators): + if len(evaluation_stats) == 0: + return 0 + final_red = evaluation_stats[0] + assert (len(evaluation_stats) == len(query_operators) + 1) + + for i in range(1, len(evaluation_stats)): + if query_operators[i - 1] == "&&": + final_red = final_red + evaluation_stats[i] - final_red * \ + evaluation_stats[i] + elif query_operators[i - 1] == "||": + final_red = final_red * evaluation_stats[i] + + return final_red + + def _compute_cost_red_rate(self, C, R): + assert (R >= 0 and R <= 1) # R is reduction rate and should be + # between 0 and 1 + if R == 0: + R = 0.000001 + return float(C) / R + + def _find_model(self, pp_name, pp_stats, accuracy_budget): + possible_models = pp_stats[pp_name] + best = [] # [best_model_name, best_model_cost / + # best_model_reduction_rate] + for possible_model in possible_models: + if possible_models[possible_model]["A"] < accuracy_budget: + continue + if best == []: + best = [possible_model, self._compute_cost_red_rate( + possible_models[possible_model]["C"], + possible_models[possible_model]["R"]), possible_models[ + possible_model]["R"]] + else: + alternative_best_cost = self._compute_cost_red_rate( + possible_models[possible_model]["C"], + possible_models[possible_model]["R"]) + if alternative_best_cost < best[1]: + best = [possible_model, alternative_best_cost, + possible_models[possible_model]["R"]] + + if best == []: + return None + else: + return best[0], best[2] + + +if __name__ == "__main__": + # TODO: Support for parenthesis queries + query_list_mod = ["t=suv", "s>60", + "c=white", "c!=white", "o=pt211", "c=white && t=suv", + "s>60 && s<65", "t=sedan || t=truck", + "i=pt335 && o=pt211", + "t=suv && c!=white", "c=white && t!=suv && t!=van", + "t=van && s>60 && s<65", + "t=sedan || t=truck && c!=white", + "i=pt335 && o!=pt211 && o!=pt208", + "t=van && i=pt335 && o=pt211", + "t!=sedan && c!=black && c!=silver && t!=truck", + "t=van && s>60 && s<65 && o=pt211", + "t!=suv && t!=van && c!=red && t!=white", + "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", + "i=pt335 && o=pt211 && t=van && c=red"] + + qo = QOMinimum() + print(qo.executeQueries(query_list_mod)) diff --git a/src/optimizer/qo_template.py b/src/optimizer/qo_template.py new file mode 100644 index 0000000000..e000ba42b7 --- /dev/null +++ b/src/optimizer/qo_template.py @@ -0,0 +1,43 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file gives interface to all query optimizer modules +If any issues arise please contact jaeho.bang@gmail.com + +@Jaeho Bang +""" + +from abc import ABCMeta, abstractmethod + +""" +Initial Design Thoughts: +Query Optimizer by definition should perform two tasks: +1. analyze Structered Query Language +2. Determine efficient execution mechanisms (plans) + +""" + + +class QOTemplate(metaclass=ABCMeta): + + @abstractmethod + def executeQueries(self, queries: list) -> list: + """ + Query Optimizer by definition should perform two tasks: + 1. Analyze given Structured Query Language (SQL) + 2. Determine efficient execution mechanisms/plans + :param queries: input queries / query + :return: output plans / plan that can be understood by the system + """ diff --git a/src/optimizer/query_optimizer.py b/src/optimizer/query_optimizer.py new file mode 100644 index 0000000000..d85a0d4d7d --- /dev/null +++ b/src/optimizer/query_optimizer.py @@ -0,0 +1,595 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file composes the functions that are needed to perform query optimization. +Currently, given a query, it does logical changes to forms that are +sufficient conditions. +Using statistics from Filters module, it outputs the optimal plan (converted +query with models needed to be used). + +To see the query optimizer performance in action, simply run + +python query_optimizer/query_optimizer.py + +@Jaeho Bang + +""" +import os +import socket +# The query optimizer decide how to label the data points +# Load the series of queries from a txt file? +import sys +import threading +from itertools import product + +import numpy as np + +from src import constants + +eva_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +sys.path.append(eva_dir) + + +class QueryOptimizer: + """ + TODO: If you have a classifier for =, you can make a classifier for != + TODO: Deal with parenthesis + """ + + def __init__(self, ip_str="127.0.0.1"): + self.ip_str = ip_str + # self.startSocket() + self.operators = ["!=", ">=", "<=", "=", "<", ">"] + self.separators = ["||", "&&"] + + def startSocket(self): + thread = threading.Thread(target=self.inputQueriesFromSocket) + thread.daemon = True + thread.start() + while True: + input = eval(input( + 'Type in your query in the form of __label__ > __number__\n')) + + self.parseInput(input) + + def parseInput(self, input): + """ + TODO: Need to provide query formats that can be used + :param input: string to be parsed + :return: something that the Load() class can understand + """ + + def inputQueriesFromTxt(self, input_path): + """ + TODO: Read the file line by line, use self.parseInput to give back + commands + :param input_path: full directory + file name + :return: method of training the pps + """ + + def inputQueriesFromSocket(self): + sock = socket.socket() + sock.bind(self.ip_str, 123) + sock.listen(3) + print("Waiting on connection") + conn = sock.accept() + print("Client connected") + while True: + m = conn[0].recv(4096) + conn[0].send(m[::-1]) + + sock.shutdown(socket.SHUT_RDWR) + sock.close() + + def _findParenthesis(self, query): + + start = [] + end = [] + query_copy = query + index = query_copy.find("(") + while index != -1: + start.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find("(") + + query_copy = query + index = query_copy.find(")") + while index != -1: + end.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find(")") + + return [start, end] + + def _parseQuery(self, query): + """ + Each sub query will be a list + There will be a separator in between + :param query: + :return: + """ + + query_parsed = [] + query_subs = query.split(" ") + query_operators = [] + for query_sub in query_subs: + if query_sub == "||" or query_sub == "&&": + query_operators.append(query_sub) + else: + + if True not in [operator in self.operators for operator in + query_sub]: + return [], [] + for operator in self.operators: + query_sub_list = query_sub.split(operator) + if isinstance(query_sub_list, list) and len( + query_sub_list) > 1: + query_parsed.append( + [query_sub_list[0], operator, query_sub_list[1]]) + break + # query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] + # query_operators ex: ["||", "||", "&&"] + return query_parsed, query_operators + + def _logic_reverse(self, str): + if str == "=": + return "!=" + elif str == "!=": + return "=" + elif str == ">": + return "<=" + elif str == ">=": + return "<" + elif str == "<": + return ">=" + elif str == "<=": + return ">" + + def convertL2S(self, parsed_query, query_ops): + final_str = "" + index = 0 + for sub_parsed_query in parsed_query: + if len(parsed_query) >= 2 and index < len(query_ops): + final_str += ''.join(sub_parsed_query) + " " + query_ops[ + index] + " " + index += 1 + else: + final_str += ''.join(sub_parsed_query) + return final_str + + def _wrangler(self, query, label_desc): + """ + import itertools + iterables = [ [1,2,3,4], [88,99], ['a','b'] ] + for t in itertools.product(*iterables): + print t + + Different types of checks are performed + 1. not equals check (f(C) != v) + 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) + 3. Range check (v1 <= f(C) <= v2) - special type of comparison check + 4. No-predicates = when column in finite and discrete, it can still + benefit + ex) 1 <=> type = car U type = truck U type = SUV + :return: transformed query + """ + # TODO: Need to implement range check + + query_parsed, query_operators = self._parseQuery(query) + # query_sorted = sorted(query_parsed) + + query_transformed = [] + equivalences = [] + + for query_sub_list in query_parsed: + subject = query_sub_list[0] + operator = query_sub_list[1] + object = query_sub_list[2] + + assert ( + subject in label_desc) # Label should be in label + # description dictionary + l_desc = label_desc[subject] + if l_desc[0] == constants.DISCRETE: + equivalence = [self.convertL2S([query_sub_list], [])] + assert (operator == "=" or operator == "!=") + alternate_string = "" + for category in l_desc[1]: + if category != object: + alternate_string += subject + self._logic_reverse( + operator) + category + " && " + alternate_string = alternate_string[ + :-len(" && ")] # must strip the last ' || ' + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + elif l_desc[0] == constants.CONTINUOUS: + + equivalence = [self.convertL2S([query_sub_list], [])] + assert (operator == "=" or operator == "!=" or operator == "<" + or operator == "<=" or operator == ">" or operator == + ">=") + alternate_string = "" + if operator == "!=": + alternate_string += subject + ">" + object + " && " + \ + subject + "<" + object + query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(query_tmp) + if operator == "<" or operator == "<=": + object_num = eval(object) + for number in l_desc[1]: + if number > object_num: + alternate_string = subject + operator + str(number) + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + if operator == ">" or operator == ">=": + object_num = eval(object) + for number in l_desc[1]: + if number < object_num: + alternate_string = subject + operator + str(number) + # query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + equivalences.append(equivalence) + + possible_queries = product(*equivalences) + for q in possible_queries: + query_transformed.append(q) + + return query_transformed, query_operators + + def _compute_expression(self, query_info, pp_list, pp_stats, k, + accuracy_budget): + """ + + def QueryOptimizer(P, {trained PPs}): + P = wrangler(P) + {E} = compute_expressions(P,{trained PP},k) #k is a fixed + constant which limits number of individual PPs + in the final expression + for E in {E}: + Explore_PP_accuracy_budget(E) # Paper says dynamic program + Explore_PP_Orderings(E) #if k is small, any number of orders + can be explored + Compute_cost_vs_red_rate(E) #arithmetic over individual c, + a and r[a] numbers + return E_with_max_c/r + + + 1. p^(P/p) -> PPp + 2. PPp^q -> PPp ^ PPq + 3. PPpvq -> PPp v PPq + 4. p^(P/p) -> ~PP~q + -> we don't need to apply these rules, we simply need to see for each + sub query which PP gives us the best rate + :param query_info: [possible query forms for a given query, operators + that go in between] + :param pp_list: list of pp names that are currently available + :param pp_stats: list of pp models associated with each pp name with + R,C,A values saved + :param k: number of pps we can use at maximum + :return: the list of pps to use that maximizes reduction rate (ATM) + """ + evaluations = [] + evaluation_models = [] + evaluations_stats = [] + query_transformed, query_operators = query_info + # query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] + for possible_query in query_transformed: + evaluation = [] + evaluation_stats = [] + k_count = 0 + op_index = 0 + for query_sub in possible_query: # Even inside query_sub it can + # be divided into query_sub_sub + if k_count > k: # TODO: If you exceed a certain number, + # you just ignore the expression + evaluation = [] + evaluation_stats = [] + continue + query_sub_list, query_sub_operators = self._parseQuery( + query_sub) + evaluation_tmp = [] + evaluation_models_tmp = [] + evaluation_stats_tmp = [] + for i in range(len(query_sub_list)): + query_sub_str = ''.join(query_sub_list[i]) + if query_sub_str in pp_list: + # Find the best model for the pp + + data = self._find_model(query_sub_str, pp_stats, + accuracy_budget) + if data is None: + continue + else: + model, reduction_rate = data + evaluation_tmp.append(query_sub_str) + evaluation_models_tmp.append( + model) # TODO: We need to make sure this is + # the model_name + evaluation_stats_tmp.append(reduction_rate) + k_count += 1 + + reduc_rate = 0 + if len(evaluation_stats_tmp) != 0: + reduc_rate = self._update_stats(evaluation_stats_tmp, + query_sub_operators) + + evaluation.append(query_sub) + evaluation_models.append(evaluation_models_tmp) + evaluation_stats.append(reduc_rate) + op_index += 1 + + evaluations.append(self.convertL2S(evaluation, query_operators)) + evaluations_stats.append( + self._update_stats(evaluation_stats, query_operators)) + + max_index = np.argmax(np.array(evaluations_stats), axis=0) + best_query = evaluations[ + max_index] # this will be something like "t!=bus && t!=truck && + # t!=car" + best_models = evaluation_models[max_index] + best_reduction_rate = evaluations_stats[max_index] + + pp_names, op_names = self._convertQuery2PPOps(best_query) + return [list(zip(pp_names, best_models)), op_names, + best_reduction_rate] + + def _convertQuery2PPOps(self, query): + """ + + :param query: str (t!=car && t!=truck) + :return: + """ + query_split = query.split(" ") + pp_names = [] + op_names = [] + for i in range(len(query_split)): + if i % 2 == 0: + pp_names.append(query_split[i]) + else: + if query_split[i] == "&&": + op_names.append(np.logical_and) + else: + op_names.append(np.logical_or) + + return pp_names, op_names + + # Make this function take in the list of reduction rates and the operator + # lists + def _update_stats(self, evaluation_stats, query_operators): + if len(evaluation_stats) == 0: + return 0 + final_red = evaluation_stats[0] + assert (len(evaluation_stats) == len(query_operators) + 1) + + for i in range(1, len(evaluation_stats)): + if query_operators[i - 1] == "&&": + final_red = final_red + evaluation_stats[i] - final_red * \ + evaluation_stats[i] + elif query_operators[i - 1] == "||": + final_red = final_red * evaluation_stats[i] + + return final_red + + def _compute_cost_red_rate(self, C, R): + assert ( + R >= 0 and R <= 1) # R is reduction rate and should be + # between 0 and 1 + if R == 0: + R = 0.000001 + return float(C) / R + + def _find_model(self, pp_name, pp_stats, accuracy_budget): + possible_models = pp_stats[pp_name] + best = [] # [best_model_name, best_model_cost / + # best_model_reduction_rate] + for possible_model in possible_models: + if possible_models[possible_model]["A"] < accuracy_budget: + continue + if best == []: + best = [possible_model, self._compute_cost_red_rate( + possible_models[possible_model]["C"], + possible_models[possible_model]["R"]), + possible_models[possible_model]["R"]] + else: + alternative_best_cost = self._compute_cost_red_rate( + possible_models[possible_model]["C"], + possible_models[possible_model]["R"]) + if alternative_best_cost < best[1]: + best = [possible_model, alternative_best_cost, + possible_models[possible_model]["R"]] + + if best == []: + return None + else: + return best[0], best[2] + + def run(self, query, pp_list, pp_stats, label_desc, k=3, + accuracy_budget=0.9): + """ + + :param query: query of interest ex) TRAF-20 + :param pp_list: list of pp_descriptions - queries that are available + :param pp_stats: this will be dictionary where keys are "pca/ddn", + it will have statistics saved which are R ( + reduction_rate), C (cost_to_train), A (accuracy) + :param k: number of different PPs that are in any expression E + :return: selected PPs to use for reduction + """ + query_transformed, query_operators = self._wrangler(query, label_desc) + # query_transformed is a comprehensive list of transformed queries + return self._compute_expression([query_transformed, query_operators], + pp_list, pp_stats, k, accuracy_budget) + + +if __name__ == "__main__": + + query_list = ["t=suv", "s>60", + "c=white", "c!=white", "o=pt211", "c=white && t=suv", + "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", + "t=suv && c!=white", "c=white && t!=suv && t!=van", + "t=van && s>60 && s<65", "c!=white && (t=sedan || t=truck)", + "i=pt335 && o!=pt211 && o!=pt208", + "t=van && i=pt335 && o=pt211", + "t!=sedan && c!=black && c!=silver && t!=truck", + "t=van && s>60 && s<65 && o=pt211", + "t!=suv && t!=van && c!=red && t!=white", + "(i=pt335 || i=pt342) && o!=pt211 && o!=pt208", + "i=pt335 && o=pt211 && t=van && c=red"] + + # TODO: Support for parenthesis queries + query_list_mod = ["t=suv", "s>60", + "c=white", "c!=white", "o=pt211", "c=white && t=suv", + "s>60 && s<65", "t=sedan || t=truck", + "i=pt335 && o=pt211", + "t=suv && c!=white", "c=white && t!=suv && t!=van", + "t=van && s>60 && s<65", + "t=sedan || t=truck && c!=white", + "i=pt335 && o!=pt211 && o!=pt208", + "t=van && i=pt335 && o=pt211", + "t!=sedan && c!=black && c!=silver && t!=truck", + "t=van && s>60 && s<65 && o=pt211", + "t!=suv && t!=van && c!=red && t!=white", + "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", + "i=pt335 && o=pt211 && t=van && c=red"] + + query_list_test = ["c=white && t!=suv && t!=van"] + + synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", + "c=red", "c=white", "c=black", "c=silver", + "s>40", "s>50", "s>60", "s<65", "s<70", + "i=pt335", "i=pt211", "i=pt342", "i=pt208", + "o=pt335", "o=pt211", "o=pt342", "o=pt208"] + + query_list_short = ["t=van && s>60 && o=pt211"] + + synthetic_pp_list_short = ["t=van", "s>60", "o=pt211"] + + # TODO: Might need to change this to a R vs A curve instead of static + # numbers + # TODO: When selecting appropriate PPs, we only select based on reduction + # rate + synthetic_pp_stats_short = { + "t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, + "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, + "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, + + "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, + "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, + + "o=pt211": {"none/dnn": {"R": 0.13, "C": 0.32, "A": 0.99}, + "none/kde": {"R": 0.14, "C": 0.12, "A": 0.93}}} + + synthetic_pp_stats = {"t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, + "pca/dnn": {"R": 0.2, "C": 0.15, + "A": 0.92}, + "none/kde": {"R": 0.15, "C": 0.05, + "A": 0.95}}, + "t=suv": { + "none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, + "t=sedan": { + "none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, + "t=truck": { + "none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, + + "c=red": { + "none/svm": {"R": 0.131, "C": 0.011, + "A": 0.951}}, + "c=white": { + "none/svm": {"R": 0.212, "C": 0.012, + "A": 0.942}}, + "c=black": { + "none/svm": {"R": 0.133, "C": 0.013, + "A": 0.953}}, + "c=silver": { + "none/svm": {"R": 0.214, "C": 0.014, + "A": 0.944}}, + + "s>40": { + "none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, + "s>50": { + "none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, + + "s>60": { + "none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, + "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, + + "s<65": { + "none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, + "s<70": { + "none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, + + "o=pt211": { + "none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, + "A": 0.932}}, + + "o=pt335": { + "none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, + "A": 0.934}}, + + "o=pt342": { + "none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, + "A": 0.935}}, + + "o=pt208": { + "none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, + "A": 0.936}}, + + "i=pt211": { + "none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, + "A": 0.932}}, + + "i=pt335": { + "none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, + "A": 0.934}}, + + "i=pt342": { + "none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, + "A": 0.935}}, + + "i=pt208": { + "none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, + "A": 0.936}}} + + # TODO: We will need to convert the queries/labels into "car, bus, van, + # others". This is how the dataset defines things + + label_desc = {"t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], + "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], + "c": [constants.DISCRETE, + ["white", "red", "black", "silver"]], + "i": [constants.DISCRETE, + ["pt335", "pt342", "pt211", "pt208"]], + "o": [constants.DISCRETE, + ["pt335", "pt342", "pt211", "pt208"]]} + + qo = QueryOptimizer() + + print("Running Query Optimizer Demo...") + + for query in query_list_mod: + print(query, " -> ", ( + qo.run(query, synthetic_pp_list, synthetic_pp_stats, label_desc))) + # print qo.run(query, synthetic_pp_list_short, + # synthetic_pp_stats_short, label_desc) diff --git a/src/optimizer/query_optimizer.py.bak b/src/optimizer/query_optimizer.py.bak new file mode 100644 index 0000000000..58df4de31d --- /dev/null +++ b/src/optimizer/query_optimizer.py.bak @@ -0,0 +1,510 @@ + +# The query optimizer decide how to label the data points +# Load the series of queries from a txt file? +import sys +import os +import socket +import threading +import numpy as np +from itertools import product +from time import sleep + +eva_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +sys.path.append(eva_dir) +import constants + + + +class QueryOptimizer: + """ + TODO: If you have a classifier for =, you can make a classifier for != + TODO: Deal with parenthesis + """ + + def __init__(self, ip_str="127.0.0.1"): + self.ip_str = ip_str + #self.startSocket() + self.operators = ["!=", ">=", "<=", "=", "<", ">"] + self.separators = ["||", "&&"] + + + + def startSocket(self): + thread = threading.Thread(target=self.inputQueriesFromSocket) + thread.daemon = True + thread.start() + while True: + input = input('Type in your query in the form of __label__ > __number__\n') + + self.parseInput(input) + + + def parseInput(self, input): + """ + TODO: Need to provide query formats that can be used + :param input: string to be parsed + :return: something that the Load() class can understand + """ + pass + + + def inputQueriesFromTxt(self, input_path): + """ + TODO: Read the file line by line, use self.parseInput to give back commands + :param input_path: full directory + file name + :return: method of training the pps + """ + pass + + + def inputQueriesFromSocket(self): + sock = socket.socket() + sock.bind(self.ip_str, 123) + sock.listen(3) + print("Waiting on connection") + conn = sock.accept() + print("Client connected") + while True: + m = conn[0].recv(4096) + conn[0].send(m[::-1]) + + sock.shutdown(socket.SHUT_RDWR) + sock.close() + + + def _findParenthesis(self, query): + + start = [] + end = [] + query_copy = query + index = query_copy.find("(") + while index != -1: + start.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find("(") + + query_copy = query + index = query_copy.find(")") + while index != -1: + end.append(index) + query_copy = query_copy[index + 1:] + index = query_copy.find(")") + + return [start, end] + + + def _parseQuery(self, query): + """ + Each sub query will be a list + There will be a separator in between + :param query: + :return: + """ + + + query_parsed = [] + query_subs = query.split(" ") + query_operators = [] + for query_sub in query_subs: + if query_sub == "||" or query_sub == "&&": + query_operators.append(query_sub) + else: + + if True not in [operator in self.operators for operator in query_sub]: + return [],[] + for operator in self.operators: + query_sub_list = query_sub.split(operator) + if type(query_sub_list) is list and len(query_sub_list) > 1: + query_parsed.append([query_sub_list[0], operator, query_sub_list[1]]) + break + #query_parsed ex: [ ["t", "=", "van"], ["s", ">", "60"]] + #query_operators ex: ["||", "||", "&&"] + return query_parsed, query_operators + + + + + def _logic_reverse(self, str): + if str == "=": + return "!=" + elif str == "!=": + return "=" + elif str == ">": + return "<=" + elif str == ">=": + return "<" + elif str == "<": + return ">=" + elif str == "<=": + return ">" + + def convertL2S(self, parsed_query, query_ops): + final_str = "" + index = 0 + for sub_parsed_query in parsed_query: + if len(parsed_query) >= 2 and index < len(query_ops): + final_str += ''.join(sub_parsed_query) + " " + query_ops[index] + " " + index += 1 + else: + final_str += ''.join(sub_parsed_query) + return final_str + + + def _wrangler(self, query, label_desc): + """ + import itertools + iterables = [ [1,2,3,4], [88,99], ['a','b'] ] + for t in itertools.product(*iterables): + print t + + Different types of checks are performed + 1. not equals check (f(C) != v) + 2. comparison check (f(C) > v -> f(C) > t, for all t <= v) + 3. Range check (v1 <= f(C) <= v2) - special type of comparison check + 4. No-predicates = when column in finite and discrete, it can still benefit + ex) 1 <=> type = car U type = truck U type = SUV + :return: transformed query + """ + #TODO: Need to implement range check + + query_parsed, query_operators = self._parseQuery(query) + #query_sorted = sorted(query_parsed) + + query_transformed = [] + equivalences = [] + equivalences_op = [] + + for query_sub_list in query_parsed: + subject = query_sub_list[0] + operator = query_sub_list[1] + object = query_sub_list[2] + + assert(subject in label_desc) # Label should be in label description dictionary + l_desc = label_desc[subject] + if l_desc[0] == constants.DISCRETE: + equivalence = [self.convertL2S([query_sub_list], [])] + assert(operator == "=" or operator == "!=") + alternate_string = "" + for category in l_desc[1]: + if category != object: + alternate_string += subject + self._logic_reverse(operator) + category + " && " + alternate_string = alternate_string[:-len(" && ")] #must strip the last ' || ' + #query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + elif l_desc[0] == constants.CONTINUOUS: + + equivalence = [self.convertL2S([query_sub_list], [])] + assert(operator == "=" or operator == "!=" or operator == "<" + or operator == "<=" or operator == ">" or operator == ">=") + alternate_string = "" + if operator == "!=": + alternate_string += subject + ">" + object + " && " + subject + "<" + object + query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(query_tmp) + if operator == "<" or operator == "<=": + object_num = eval(object) + for number in l_desc[1]: + if number > object_num: + alternate_string = subject + operator + str(number) + #query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + if operator == ">" or operator == ">=": + object_num = eval(object) + for number in l_desc[1]: + if number < object_num: + alternate_string = subject + operator + str(number) + #query_tmp, _ = self._parseQuery(alternate_string) + equivalence.append(alternate_string) + + equivalences.append(equivalence) + + possible_queries = product(*equivalences) + for q in possible_queries: + query_transformed.append( q ) + + return query_transformed, query_operators + + + + def _compute_expression(self, query_info, pp_list, pp_stats, k, accuracy_budget): + """ + + def QueryOptimizer(P, {trained PPs}): + P = wrangler(P) + {E} = compute_expressions(P,{trained PP},k) #k is a fixed constant which limits number of individual PPs + in the final expression + for E in {E}: + Explore_PP_accuracy_budget(E) # Paper says dynamic program + Explore_PP_Orderings(E) #if k is small, any number of orders can be explored + Compute_cost_vs_red_rate(E) #arithmetic over individual c,a and r[a] numbers + return E_with_max_c/r + + + 1. p^(P/p) -> PPp + 2. PPp^q -> PPp ^ PPq + 3. PPpvq -> PPp v PPq + 4. p^(P/p) -> ~PP~q + -> we don't need to apply these rules, we simply need to see for each sub query which PP gives us the best rate + :param query_info: [possible query forms for a given query, operators that go in between] + :param pp_list: list of pp names that are currently available + :param pp_stats: list of pp models associated with each pp name with R,C,A values saved + :param k: number of pps we can use at maximum + :return: the list of pps to use that maximizes reduction rate (ATM) + """ + evaluations = [] + evaluation_models = [] + evaluations_stats = [] + query_transformed, query_operators = query_info + #query_transformed = [[["t", "!=", "car"], ["t", "=", "van"]], ... ] + for possible_query in query_transformed: + evaluation = [] + evaluation_stats = [] + k_count = 0 + op_index = 0 + for query_sub in possible_query: #Even inside query_sub it can be divided into query_sub_sub + if k_count > k: #TODO: If you exceed a certain number, you just ignore the expression + evaluation = [] + evaluation_stats = [] + continue + query_sub_list, query_sub_operators = self._parseQuery(query_sub) + evaluation_tmp = [] + evaluation_models_tmp = [] + evaluation_stats_tmp = [] + for i in range(len(query_sub_list)): + query_sub_str = ''.join(query_sub_list[i]) + if query_sub_str in pp_list: + #Find the best model for the pp + + data = self._find_model(query_sub_str, pp_stats, accuracy_budget) + if data == None: + continue + else: + model, reduction_rate = data + evaluation_tmp.append(query_sub_str) + evaluation_models_tmp.append(model) #TODO: We need to make sure this is the model_name + evaluation_stats_tmp.append(reduction_rate) + k_count += 1 + + + reduc_rate = 0 + if len(evaluation_stats_tmp) != 0: + reduc_rate = self._update_stats(evaluation_stats_tmp, query_sub_operators) + + evaluation.append(query_sub) + evaluation_models.append(evaluation_models_tmp) + evaluation_stats.append(reduc_rate) + op_index += 1 + + + evaluations.append( self.convertL2S(evaluation, query_operators) ) + evaluations_stats.append( self._update_stats(evaluation_stats, query_operators) ) + + max_index = np.argmax(np.array(evaluations_stats), axis = 0) + best_query = evaluations[max_index] #this will be something like "t!=bus && t!=truck && t!=car" + best_models = evaluation_models[max_index] + best_reduction_rate = evaluations_stats[max_index] + + pp_names, op_names = self._convertQuery2PPOps(best_query) + return [list(zip(pp_names, best_models)), op_names, best_reduction_rate] + + + def _convertQuery2PPOps(self, query): + """ + + :param query: str (t!=car && t!=truck) + :return: + """ + query_split = query.split(" ") + pp_names = [] + op_names = [] + for i in range(len(query_split)): + if i % 2 == 0: + pp_names.append(query_split[i]) + else: + if query_split[i] == "&&": + op_names.append(np.logical_and) + else: + op_names.append(np.logical_or) + + return pp_names, op_names + + + + + #Make this function take in the list of reduction rates and the operator lists + def _update_stats(self, evaluation_stats, query_operators): + if len(evaluation_stats) == 0: + return 0 + final_red = evaluation_stats[0] + assert(len(evaluation_stats) == len(query_operators) + 1) + + for i in range(1, len(evaluation_stats)): + if query_operators[i - 1] == "&&": + final_red = final_red + evaluation_stats[i] - final_red * evaluation_stats[i] + elif query_operators[i - 1] == "||": + final_red = final_red * evaluation_stats[i] + + return final_red + + + + + + def _compute_cost_red_rate(self, C, R): + assert(R >= 0 and R <= 1) #R is reduction rate and should be between 0 and 1 + if R == 0: + R = 0.000001 + return float(C) / R + + def _find_model(self, pp_name, pp_stats, accuracy_budget): + possible_models = pp_stats[pp_name] + best = [] #[best_model_name, best_model_cost / best_model_reduction_rate] + for possible_model in possible_models: + if possible_models[possible_model]["A"] < accuracy_budget: + continue + if best == []: + best = [possible_model, self._compute_cost_red_rate(possible_models[possible_model]["C"], + possible_models[possible_model]["R"]), + possible_models[possible_model]["R"]] + else: + alternative_best_cost = self._compute_cost_red_rate(possible_models[possible_model]["C"], + possible_models[possible_model]["R"]) + if alternative_best_cost < best[1]: + best = [possible_model, alternative_best_cost, possible_models[possible_model]["R"]] + + + if best == []: + return None + else: + return best[0], best[2] + + def run(self, query, pp_list, pp_stats, label_desc, k = 3, accuracy_budget = 0.9): + """ + + :param query: query of interest ex) TRAF-20 + :param pp_list: list of pp_descriptions - queries that are available + :param pp_stats: this will be dictionary where keys are "pca/ddn", + it will have statistics saved which are R (reduction_rate), C (cost_to_train), A (accuracy) + :param k: number of different PPs that are in any expression E + :return: selected PPs to use for reduction + """ + query_transformed, query_operators = self._wrangler(query, label_desc) + #query_transformed is a comprehensive list of transformed queries + return self._compute_expression([query_transformed, query_operators], pp_list, pp_stats, k, accuracy_budget) + + +if __name__ == "__main__": + + + query_list = ["t=suv", "s>60", + "c=white", "c!=white", "o=pt211", "c=white && t=suv", + "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", + "t=suv && c!=white", "c=white && t!=suv && t!=van", + "t=van && s>60 && s<65", "c!=white && (t=sedan || t=truck)", + "i=pt335 && o!=pt211 && o!=pt208", "t=van && i=pt335 && o=pt211", + "t!=sedan && c!=black && c!=silver && t!=truck", + "t=van && s>60 && s<65 && o=pt211", "t!=suv && t!=van && c!=red && t!=white", + "(i=pt335 || i=pt342) && o!=pt211 && o!=pt208", + "i=pt335 && o=pt211 && t=van && c=red"] + + + #TODO: Support for parenthesis queries + query_list_mod = ["t=suv", "s>60", + "c=white", "c!=white", "o=pt211", "c=white && t=suv", + "s>60 && s<65", "t=sedan || t=truck", "i=pt335 && o=pt211", + "t=suv && c!=white", "c=white && t!=suv && t!=van", + "t=van && s>60 && s<65", "t=sedan || t=truck && c!=white", + "i=pt335 && o!=pt211 && o!=pt208", "t=van && i=pt335 && o=pt211", + "t!=sedan && c!=black && c!=silver && t!=truck", + "t=van && s>60 && s<65 && o=pt211", "t!=suv && t!=van && c!=red && t!=white", + "i=pt335 || i=pt342 && o!=pt211 && o!=pt208", + "i=pt335 && o=pt211 && t=van && c=red"] + + query_list_test = ["c=white && t!=suv && t!=van"] + + + synthetic_pp_list = ["t=suv", "t=van", "t=sedan", "t=truck", + "c=red", "c=white", "c=black", "c=silver", + "s>40", "s>50", "s>60", "s<65", "s<70", + "i=pt335", "i=pt211", "i=pt342", "i=pt208", + "o=pt335", "o=pt211", "o=pt342", "o=pt208"] + + query_list_short = ["t=van && s>60 && o=pt211"] + + + synthetic_pp_list_short = ["t=van", "s>60", "o=pt211"] + + + #TODO: Might need to change this to a R vs A curve instead of static numbers + #TODO: When selecting appropriate PPs, we only select based on reduction rate + synthetic_pp_stats_short = {"t=van" :{ "none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, + "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, + "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, + + "s>60" :{ "none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, + "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, + + "o=pt211" :{ "none/dnn": {"R": 0.13, "C": 0.32, "A": 0.99}, + "none/kde": {"R": 0.14, "C": 0.12, "A": 0.93}} } + + synthetic_pp_stats = {"t=van": {"none/dnn": {"R": 0.1, "C": 0.1, "A": 0.9}, + "pca/dnn": {"R": 0.2, "C": 0.15, "A": 0.92}, + "none/kde": {"R": 0.15, "C": 0.05, "A": 0.95}}, + "t=suv": {"none/svm": {"R": 0.13, "C": 0.01, "A": 0.95}}, + "t=sedan": {"none/svm": {"R": 0.21, "C": 0.01, "A": 0.94}}, + "t=truck": {"none/svm": {"R": 0.05, "C": 0.01, "A": 0.99}}, + + "c=red": {"none/svm": {"R": 0.131, "C": 0.011, "A": 0.951}}, + "c=white": {"none/svm": {"R": 0.212, "C": 0.012, "A": 0.942}}, + "c=black": {"none/svm": {"R": 0.133, "C": 0.013, "A": 0.953}}, + "c=silver": {"none/svm": {"R": 0.214, "C": 0.014, "A": 0.944}}, + + "s>40": {"none/svm": {"R": 0.08, "C": 0.20, "A": 0.8}}, + "s>50": {"none/svm": {"R": 0.10, "C": 0.20, "A": 0.82}}, + + "s>60": {"none/dnn": {"R": 0.12, "C": 0.21, "A": 0.87}, + "none/kde": {"R": 0.15, "C": 0.06, "A": 0.96}}, + + "s<65": {"none/svm": {"R": 0.05, "C": 0.20, "A": 0.8}}, + "s<70": {"none/svm": {"R": 0.02, "C": 0.20, "A": 0.9}}, + + "o=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, + + "o=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, + + "o=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, + + "o=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}, + + "i=pt211": {"none/dnn": {"R": 0.135, "C": 0.324, "A": 0.993}, + "none/kde": {"R": 0.143, "C": 0.123, "A": 0.932}}, + + "i=pt335": {"none/dnn": {"R": 0.134, "C": 0.324, "A": 0.994}, + "none/kde": {"R": 0.144, "C": 0.124, "A": 0.934}}, + + "i=pt342": {"none/dnn": {"R": 0.135, "C": 0.325, "A": 0.995}, + "none/kde": {"R": 0.145, "C": 0.125, "A": 0.935}}, + + "i=pt208": {"none/dnn": {"R": 0.136, "C": 0.326, "A": 0.996}, + "none/kde": {"R": 0.146, "C": 0.126, "A": 0.936}}} + + #TODO: We will need to convert the queries/labels into "car, bus, van, others". This is how the dataset defines things + + label_desc = {"t": [constants.DISCRETE, ["sedan", "suv", "truck", "van"]], + "s": [constants.CONTINUOUS, [40, 50, 60, 65, 70]], + "c": [constants.DISCRETE, ["white", "red", "black", "silver"]], + "i": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]], + "o": [constants.DISCRETE, ["pt335", "pt342", "pt211", "pt208"]]} + + qo = QueryOptimizer() + for query in query_list_mod: + #print qo.run(query, synthetic_pp_list_short, synthetic_pp_stats_short, label_desc) + print(qo.run(query, synthetic_pp_list, synthetic_pp_stats, label_desc)) + + + diff --git a/src/optimizer/test.py b/src/optimizer/test.py new file mode 100644 index 0000000000..89d84ad69b --- /dev/null +++ b/src/optimizer/test.py @@ -0,0 +1,21 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import sys + +print(sys.path) + + +def test(): + print("hi") diff --git a/src/optimizer/tests/__init__.py b/src/optimizer/tests/__init__.py new file mode 100644 index 0000000000..e9978151f4 --- /dev/null +++ b/src/optimizer/tests/__init__.py @@ -0,0 +1,14 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/src/optimizer/tests/query_optimizer_test_pytest.py.bak b/src/optimizer/tests/query_optimizer_test_pytest.py.bak new file mode 100644 index 0000000000..c206c61967 --- /dev/null +++ b/src/optimizer/tests/query_optimizer_test_pytest.py.bak @@ -0,0 +1,65 @@ +from query_optimizer.query_optimizer import QueryOptimizer + +obj=QueryOptimizer() + +def test_parseQuery(): + + #case 1: Simple input/ouput check + predicates,connectors=obj._parseQuery("t>60 && q>=4 && v=car") + if predicates!=[["t",">","60"],["q",">=","4"],["v","=","car"]]: + assert False,"Wrong breakdown of predicates" + if connectors!=["&&","&&"]: + assert False,"Wrong list of connectors" + + # case 2: Case when an extra space is present in the input + predicates, connectors = obj._parseQuery("t>60 && q>=4 && v=car") + if predicates != [["t", ">", "60"], ["q", ">=", "4"], ["v", "=", "car"]]: + assert False, "Wrong breakdown of predicates, can't handle consecutive spaces." + if connectors != ["&&", "&&"]: + assert False, "Wrong list of connectors" + + #case 2: No operator exists + predicates, connectors = obj._parseQuery("t!60") + if predicates != []: + assert False, "Wrong breakdown of predicates" + if connectors != []: + assert False, "Wrong list of connectors" + + predicates, connectors = obj._parseQuery("t>60 && adfsg") + if predicates != []: + assert False, "Wrong breakdown of predicates" + if connectors != []: + assert False, "Wrong list of connectors" + + #case for >> and similar situations, the >> operator should be recognised as > + predicates, connectors = obj._parseQuery("t>>60 && a<45") + print(predicates, connectors) + if predicates != [['t','>','60'],['a','<','45']]: + assert False, "Wrong breakdown of predicates,the >> operator should be recognised as > and likewise for <" + if connectors != ['&&']: + assert False, "Wrong list of connectors" + + #case 2: Check for ordering of execution based on parenthesis code does not handle this yet so now way to test at the moment + predicates, connectors = obj._parseQuery("t>60||(q>=4&&v=car)") + if predicates != [["t", ">", "60"], ["q", ">=", "4"], ["v", "=", "car"]]: + assert False, "Wrong breakdown of predicates" + if connectors != ["&&", "&&"]: + assert False, "Wrong list of connectors" + + assert True + +def test_convertL2S(): + query_string=obj.convertL2S(["t","!=","10"],[]) + if query_string!="t!=10": + assert False,"Wrong output query string" + assert True + + query_string = obj.convertL2S([["t", "!=", "10"],['a','<','5'],['b','=','1003']], ["&&","||"]) + if query_string != "t!=10 && a<5 || b=1003": + assert False, "Wrong output query string" + + #case for paranthesis hasn't been implemented yet when its done need to add test cases for that. + assert True + +test_parseQuery() +test_convertL2S() \ No newline at end of file From 0d4f6dff180eba34cd0ffa124134feff5ac8b75d Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 17:05:05 -0500 Subject: [PATCH 64/82] Removed Obsolete files --- test/test_query_optimizer.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/test/test_query_optimizer.py b/test/test_query_optimizer.py index 6b972c67f3..8329a3d348 100644 --- a/test/test_query_optimizer.py +++ b/test/test_query_optimizer.py @@ -17,10 +17,10 @@ root = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) try: - from src.query_optimizer.query_optimizer import QueryOptimizer + from src.optimizer.query_optimizer import QueryOptimizer except ImportError: sys.path.append(root) - from src.query_optimizer.query_optimizer import QueryOptimizer + from src.optimizer.query_optimizer import QueryOptimizer obj = QueryOptimizer() From 2ef03e9a0ba722d01ede82b865348fdf8f0c6336 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 19:28:16 -0500 Subject: [PATCH 65/82] ToDo: Generate Physical plan without any optimizer --- src/optimizer/__init__.py | 14 ++++++++++++++ src/optimizer/plan_generator.py | 22 ++++++++++++++++++++++ 2 files changed, 36 insertions(+) create mode 100644 src/optimizer/__init__.py create mode 100644 src/optimizer/plan_generator.py diff --git a/src/optimizer/__init__.py b/src/optimizer/__init__.py new file mode 100644 index 0000000000..e9978151f4 --- /dev/null +++ b/src/optimizer/__init__.py @@ -0,0 +1,14 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/src/optimizer/plan_generator.py b/src/optimizer/plan_generator.py new file mode 100644 index 0000000000..84045402bd --- /dev/null +++ b/src/optimizer/plan_generator.py @@ -0,0 +1,22 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ToDo +# We have a logical plan tree in place held by StatementToPlanConvertor class. +# Since we are omitting the optimizer, I am not sure how to proceed further. +# Should we go ahead and write a dummy class that maps logical nodes to physical nodes +# class PlanGenerator: +# """Generates the +# """ \ No newline at end of file From c7f26bb28ac21f53a11494987d02ac0eed722cfb Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 19:51:12 -0500 Subject: [PATCH 66/82] Checkpoint: minor fixes --- src/optimizer/plan_generator.py | 3 ++- src/planner/abstract_plan.py | 2 +- src/planner/seq_scan_plan.py | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/src/optimizer/plan_generator.py b/src/optimizer/plan_generator.py index 84045402bd..d5e8add8c4 100644 --- a/src/optimizer/plan_generator.py +++ b/src/optimizer/plan_generator.py @@ -16,7 +16,8 @@ # ToDo # We have a logical plan tree in place held by StatementToPlanConvertor class. # Since we are omitting the optimizer, I am not sure how to proceed further. -# Should we go ahead and write a dummy class that maps logical nodes to physical nodes +# Should we go ahead and write a dummy class that maps logical +# nodes to physical nodes # class PlanGenerator: # """Generates the # """ \ No newline at end of file diff --git a/src/planner/abstract_plan.py b/src/planner/abstract_plan.py index 6b9f0eb8bf..0ba8e4601f 100644 --- a/src/planner/abstract_plan.py +++ b/src/planner/abstract_plan.py @@ -55,7 +55,7 @@ def parent(self, node: 'AbstractPlan'): self._parent = node @property - def children(self) -> List[AbstractPlan]: + def children(self) -> List['AbstractPlan']: """returns children list of current node Returns: diff --git a/src/planner/seq_scan_plan.py b/src/planner/seq_scan_plan.py index f4456cd5a5..98346ba481 100644 --- a/src/planner/seq_scan_plan.py +++ b/src/planner/seq_scan_plan.py @@ -34,7 +34,7 @@ class SeqScanPlan(AbstractScan): An expression used for filtering """ - def __init__(self, column_ids: List[str], video: TableRef, + def __init__(self, column_ids: List[AbstractExpression], video: TableRef, predicate: AbstractExpression): super().__init__(PlanNodeType.SEQUENTIAL_SCAN_TYPE, column_ids, video, predicate) From 580a45bd5ed7b97647e3b936666d4e8c83fd5473 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 20:02:24 -0500 Subject: [PATCH 67/82] Files renamed --- src/query_optimizer/plan_generator.py | 0 .../statement_to_opr_convertor.py | 66 ------------------- 2 files changed, 66 deletions(-) delete mode 100644 src/query_optimizer/plan_generator.py delete mode 100644 src/query_optimizer/statement_to_opr_convertor.py diff --git a/src/query_optimizer/plan_generator.py b/src/query_optimizer/plan_generator.py deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/src/query_optimizer/statement_to_opr_convertor.py b/src/query_optimizer/statement_to_opr_convertor.py deleted file mode 100644 index c7cd0834bf..0000000000 --- a/src/query_optimizer/statement_to_opr_convertor.py +++ /dev/null @@ -1,66 +0,0 @@ -from src.query_parser.eva_statement import EvaStatement -from src.query_parser.select_statement import SelectStatement -from src.query_planner.abstract_scan_plan import AbstractScan - - -class StatementToPlanConvertor(): - def __init__(self): - self._plan = None - - def visit(self, statement: EvaStatement): - """Based on the instance of the statement the corresponding visit is called. The logic is hidden from client. - - Arguments: - statement {EvaStatement} -- [Input statement] - """ - if isinstance(statement, SelectStatement): - visit_select(statement) - - def visit_select(self, statement: EvaStatement): - """convertor for select statement - - Arguments: - statement {EvaStatement} -- [input select statement] - """ - - #Create a logical get node - video = statement.from_table - if video is not None: - visit_table_ref(video) - - #Filter Operator - predicate = statement.where_clause - if predicate is not None: - #ToDo Binding the expression - filter_opr = LogicalFilter(predicate) - filter_opr.append_child(self._plan) - self._plan = filter_opr - - #Projection operator - select_columns = statement.target_list - #ToDO - # add support for SELECT STAR - if select_columns is not None: - #ToDo Bind the columns using catalog - projection_opr = LogicalProject(select_columns) - projection_opr.append_child(self._plan) - self._plan = projection_opr - - - def visit_table_ref(self, video: TableRef): - """Bind table ref object and convert to Logical get operator - - Arguments: - video {TableRef} -- [Input table ref object created by the parser] - """ - video_data = None - #Call catalog with Table ref details to get hold of the storage DataFrame - #video_data = catalog.get_table_catalog_entry(video.info) - - get_opr = LogicalGet(video_data) - self._plan = get_opr - @property - def plan(self): - return self._plan - - \ No newline at end of file From de0171c8a34210192b4b2df1cfb6e01f2eeb168e Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 20:07:18 -0500 Subject: [PATCH 68/82] Ran Formatter --- src/catalog/df_column.py | 14 ++++++++++++++ src/catalog/df_metadata.py | 14 ++++++++++++++ src/catalog/sql_config.py | 17 ++++++++++++++++- 3 files changed, 44 insertions(+), 1 deletion(-) diff --git a/src/catalog/df_column.py b/src/catalog/df_column.py index 9a51b0000e..2598b6e948 100644 --- a/src/catalog/df_column.py +++ b/src/catalog/df_column.py @@ -1,3 +1,17 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import json from enum import Enum from typing import List diff --git a/src/catalog/df_metadata.py b/src/catalog/df_metadata.py index a07b8a47f4..e7c8bb744b 100644 --- a/src/catalog/df_metadata.py +++ b/src/catalog/df_metadata.py @@ -1,3 +1,17 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from sqlalchemy import Column, String, Integer diff --git a/src/catalog/sql_config.py b/src/catalog/sql_config.py index 76bddf68e5..d8f2240878 100644 --- a/src/catalog/sql_config.py +++ b/src/catalog/sql_config.py @@ -1,3 +1,17 @@ +# coding=utf-8 +# Copyright 2018-2020 EVA +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from sqlalchemy import create_engine from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker @@ -12,7 +26,8 @@ def __new__(cls): return cls._instance def __init__(self): - self.engine = create_engine('mysql+pymysql://root:root@localhost/eva_catalog') + self.engine = create_engine( + 'mysql+pymysql://root:root@localhost/eva_catalog') self.session_factory = sessionmaker(bind=self.engine) self.session = self.session_factory() self.base.metadata.create_all(self.engine) From 043d044ce22d9f041ae24254d18aa8688d41d38a Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 20:44:20 -0500 Subject: [PATCH 69/82] TupleExpression replaced by ConstantExpression --- test/expression/test_arithmetic.py | 49 ++++++-------- test/expression/test_comparison.py | 102 ++++++++++++++--------------- test/expression/test_logical.py | 52 +++++++-------- 3 files changed, 94 insertions(+), 109 deletions(-) diff --git a/test/expression/test_arithmetic.py b/test/expression/test_arithmetic.py index 6da6408556..a62048f6fd 100644 --- a/test/expression/test_arithmetic.py +++ b/test/expression/test_arithmetic.py @@ -16,7 +16,6 @@ from src.expression.abstract_expression import ExpressionType from src.expression.constant_value_expression import ConstantValueExpression -from src.expression.tuple_value_expression import TupleValueExpression from src.expression.arithmetic_expression import ArithmeticExpression @@ -25,57 +24,49 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_addition(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(5) + const_exp1 = ConstantValueExpression(2) + const_exp2 = ConstantValueExpression(5) cmpr_exp = ArithmeticExpression( ExpressionType.ARITHMETIC_ADD, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [5, 2, 3] - # 5+5 = 10 - self.assertEqual(10, cmpr_exp.evaluate(tuple1, None)) + self.assertEqual(7, cmpr_exp.evaluate(None)) def test_subtraction(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(5) + const_exp1 = ConstantValueExpression(5) + const_exp2 = ConstantValueExpression(2) cmpr_exp = ArithmeticExpression( ExpressionType.ARITHMETIC_SUBTRACT, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [5, 2, 3] - # 5-5 = 0 - self.assertEqual(0, cmpr_exp.evaluate(tuple1, None)) + self.assertEqual(3, cmpr_exp.evaluate(None)) def test_multiply(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(5) + const_exp1 = ConstantValueExpression(3) + const_exp2 = ConstantValueExpression(5) cmpr_exp = ArithmeticExpression( ExpressionType.ARITHMETIC_MULTIPLY, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [5, 2, 3] - # 5*5 = 25 - self.assertEqual(25, cmpr_exp.evaluate(tuple1, None)) + self.assertEqual(15, cmpr_exp.evaluate(None)) def test_divide(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(5) + const_exp1 = ConstantValueExpression(5) + const_exp2 = ConstantValueExpression(5) cmpr_exp = ArithmeticExpression( ExpressionType.ARITHMETIC_DIVIDE, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [5, 2, 3] - # 5/5 = 1 - self.assertEqual(1, cmpr_exp.evaluate(tuple1, None)) + self.assertEqual(1, cmpr_exp.evaluate(None)) diff --git a/test/expression/test_comparison.py b/test/expression/test_comparison.py index 87cf0229d7..7f3e689e96 100644 --- a/test/expression/test_comparison.py +++ b/test/expression/test_comparison.py @@ -17,7 +17,6 @@ from src.expression.abstract_expression import ExpressionType from src.expression.comparison_expression import ComparisonExpression from src.expression.constant_value_expression import ConstantValueExpression -from src.expression.tuple_value_expression import TupleValueExpression class ComparisonExpressionsTest(unittest.TestCase): @@ -26,90 +25,89 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_comparison_compare_equal(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(1) cmpr_exp = ComparisonExpression( ExpressionType.COMPARE_EQUAL, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - # ToDo implement a generic tuple class - # to fetch the tuple from table - tuple1 = [[1], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + self.assertEqual([True], cmpr_exp.evaluate(None)) def test_comparison_compare_greater(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(0) cmpr_exp = ComparisonExpression( ExpressionType.COMPARE_GREATER, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [[2], 1, 1] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + self.assertEqual([True], cmpr_exp.evaluate(None)) def test_comparison_compare_lesser(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(2) + const_exp1 = ConstantValueExpression(0) + const_exp2 = ConstantValueExpression(2) cmpr_exp = ComparisonExpression( ExpressionType.COMPARE_LESSER, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tuple1 = [[1], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + self.assertEqual([True], cmpr_exp.evaluate(None)) def test_comparison_compare_geq(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) - - cmpr_exp = ComparisonExpression( + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(1) + const_exp3 = ConstantValueExpression(0) + + cmpr_exp1 = ComparisonExpression( ExpressionType.COMPARE_GEQ, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - # checking greater x>=1 - tuple1 = [[2], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + cmpr_exp2 = ComparisonExpression( + ExpressionType.COMPARE_GEQ, + const_exp1, + const_exp3 + ) # checking equal - tuple2 = [[1], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple2, None)) + self.assertEqual([True], cmpr_exp1.evaluate(None)) + # checking greater equal + self.assertEqual([True], cmpr_exp2.evaluate(None)) def test_comparison_compare_leq(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(2) + const_exp1 = ConstantValueExpression(0) + const_exp2 = ConstantValueExpression(2) + const_exp3 = ConstantValueExpression(2) - cmpr_exp = ComparisonExpression( + cmpr_exp1 = ComparisonExpression( ExpressionType.COMPARE_LEQ, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - # checking lesser x<=1 - tuple1 = [[1], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + cmpr_exp2 = ComparisonExpression( + ExpressionType.COMPARE_LEQ, + const_exp2, + const_exp3 + ) + + # checking lesser + self.assertEqual([True], cmpr_exp1.evaluate(None)) # checking equal - tuple2 = [[2], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple2, None)) + self.assertEqual([True], cmpr_exp2.evaluate(None)) def test_comparison_compare_neq(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(0) + const_exp2 = ConstantValueExpression(1) cmpr_exp = ComparisonExpression( ExpressionType.COMPARE_NEQ, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - # checking not equal x!=1 - tuple1 = [[2], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) - - tuple1 = [[3], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) + self.assertEqual([True], cmpr_exp.evaluate(None)) diff --git a/test/expression/test_logical.py b/test/expression/test_logical.py index ec3f126757..a1d62d087b 100644 --- a/test/expression/test_logical.py +++ b/test/expression/test_logical.py @@ -18,7 +18,6 @@ from src.expression.comparison_expression import ComparisonExpression from src.expression.logical_expression import LogicalExpression from src.expression.constant_value_expression import ConstantValueExpression -from src.expression.tuple_value_expression import TupleValueExpression class LogicalExpressionsTest(unittest.TestCase): @@ -27,66 +26,63 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_logical_and(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(1) comparison_expression_left = ComparisonExpression( ExpressionType.COMPARE_EQUAL, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tpl_exp = TupleValueExpression(1) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(2) + const_exp2 = ConstantValueExpression(1) comparison_expression_right = ComparisonExpression( ExpressionType.COMPARE_GREATER, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) logical_expr = LogicalExpression( ExpressionType.LOGICAL_AND, comparison_expression_left, comparison_expression_right ) - tuple1 = [[1], [2], 3] - self.assertEqual([True], logical_expr.evaluate(tuple1, None)) + self.assertEqual([True], logical_expr.evaluate(None)) - def test_comparison_compare_greater(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + def test_logical_or(self): + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(1) comparison_expression_left = ComparisonExpression( ExpressionType.COMPARE_EQUAL, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(1) + const_exp2 = ConstantValueExpression(2) comparison_expression_right = ComparisonExpression( ExpressionType.COMPARE_GREATER, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) logical_expr = LogicalExpression( ExpressionType.LOGICAL_OR, comparison_expression_left, comparison_expression_right ) - tuple1 = [[1], 2, 3] - self.assertEqual([True], logical_expr.evaluate(tuple1, None)) + self.assertEqual([True], logical_expr.evaluate(None)) def test_logical_not(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) + const_exp1 = ConstantValueExpression(0) + const_exp2 = ConstantValueExpression(1) comparison_expression_right = ComparisonExpression( ExpressionType.COMPARE_GREATER, - tpl_exp, - const_exp + const_exp1, + const_exp2 ) logical_expr = LogicalExpression( ExpressionType.LOGICAL_NOT, None, comparison_expression_right ) - tuple1 = [[1], 2, 3] - self.assertEqual([True], logical_expr.evaluate(tuple1, None)) + self.assertEqual([True], logical_expr.evaluate(None)) From 7d2901674fcd34d8da2b6c9b982938ce06fd84c5 Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 21:07:22 -0500 Subject: [PATCH 70/82] Bug fixed --- src/expression/comparison_expression.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/expression/comparison_expression.py b/src/expression/comparison_expression.py index 944a499ea4..9a1bfb75b4 100644 --- a/src/expression/comparison_expression.py +++ b/src/expression/comparison_expression.py @@ -33,6 +33,8 @@ def evaluate(self, *args): right_values = self.get_child(1).evaluate(*args) # Broadcasting scalars + if not isinstance(left_values, list): + left_values = [left_values] if not isinstance(right_values, list): right_values = [right_values] * len(left_values) # TODO implement a better way to compare value_left and value_right From 6f207c8cf94dcf86f690f24d337ea51fb9a9945c Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 21:08:04 -0500 Subject: [PATCH 71/82] No more relevant test cases --- test/expression/test_expression.py | 63 ------------------------------ 1 file changed, 63 deletions(-) delete mode 100644 test/expression/test_expression.py diff --git a/test/expression/test_expression.py b/test/expression/test_expression.py deleted file mode 100644 index 7c2d3bcda8..0000000000 --- a/test/expression/test_expression.py +++ /dev/null @@ -1,63 +0,0 @@ -# coding=utf-8 -# Copyright 2018-2020 EVA -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import unittest - -from src.expression.abstract_expression import ExpressionType -from src.expression.comparison_expression import ComparisonExpression -from src.expression.constant_value_expression import ConstantValueExpression -from src.expression.tuple_value_expression import TupleValueExpression -from src.models.inference.base_prediction import BasePrediction - - -class ExpressionsTest(unittest.TestCase): - def __init__(self, *args, **kwargs): - super().__init__(*args, **kwargs) - - def test_comparison_compare_equal(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression(1) - - cmpr_exp = ComparisonExpression( - ExpressionType.COMPARE_EQUAL, - tpl_exp, - const_exp - ) - # ToDo implement a generic tuple class - # to fetch the tuple from table - compare = type("compare", (BasePrediction,), { - "value": 1, - "__eq__": lambda s, x: s.value == x - }) - tuple1 = [[compare()], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) - - def test_compare_doesnt_broadcast_when_rhs_is_list(self): - tpl_exp = TupleValueExpression(0) - const_exp = ConstantValueExpression([1]) - - cmpr_exp = ComparisonExpression( - ExpressionType.COMPARE_EQUAL, - tpl_exp, - const_exp - ) - - compare = type("compare", (), {"value": 1, - "__eq__": lambda s, x: s.value == x}) - tuple1 = [[compare()], 2, 3] - self.assertEqual([True], cmpr_exp.evaluate(tuple1, None)) - - -if __name__ == '__main__': - unittest.main() From 5993376e9202cc7fbeaddf8640e564b4661f46ae Mon Sep 17 00:00:00 2001 From: GTK Date: Thu, 30 Jan 2020 21:26:42 -0500 Subject: [PATCH 72/82] Skipping cases beacuse SeqScan API changed; Will fix after we update that --- test/query_executor/test_plan_executor.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/test/query_executor/test_plan_executor.py b/test/query_executor/test_plan_executor.py index 09be31a0c8..30c46d96e5 100644 --- a/test/query_executor/test_plan_executor.py +++ b/test/query_executor/test_plan_executor.py @@ -25,6 +25,7 @@ class PlanExecutorTest(unittest.TestCase): + @unittest.skip("SeqScan Node is updated; Will fix once that is finalized") def test_tree_structure_for_build_execution_tree(self): """ Build an Abastract Plan with nodes: @@ -69,6 +70,7 @@ def test_tree_structure_for_build_execution_tree(self): @patch( 'src.query_executor.disk_based_storage_executor.VideoLoader') + @unittest.skip("SeqScan Node is updated; Will fix once that is finalized") def test_should_return_the_new_path_after_execution(self, mock_class): class_instatnce = mock_class.return_value From 6dc2278d5db656b8df433e0255150053d53b8e2a Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 30 Jan 2020 22:48:00 -0500 Subject: [PATCH 73/82] Fixed bugs --- src/catalog/catalog_manager.py | 22 +++++++++++++++------- src/catalog/models/df_column.py | 6 ++++-- src/catalog/models/df_metadata.py | 11 +++++++---- src/configuration/dictionary.py | 3 ++- test/catalog/test_catalog_manager.py | 5 ++--- 5 files changed, 30 insertions(+), 17 deletions(-) diff --git a/src/catalog/catalog_manager.py b/src/catalog/catalog_manager.py index 642dd96bd4..b3d3cf1f91 100644 --- a/src/catalog/catalog_manager.py +++ b/src/catalog/catalog_manager.py @@ -13,15 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. -import os from typing import List, Tuple from src.catalog.database import init_db from src.catalog.df_schema import DataFrameSchema from src.catalog.models.df_column import DataFrameColumn from src.catalog.models.df_metadata import DataFrameMetadata -from src.configuration.configuration_manager import ConfigurationManager -from src.configuration.dictionary import CATALOG_DIR from src.utils.logging_manager import LoggingLevel from src.utils.logging_manager import LoggingManager @@ -41,10 +38,11 @@ def __new__(cls): def bootstrap_catalog(self): - eva_dir = ConfigurationManager().get_value("core", "location") - output_url = os.path.join(eva_dir, CATALOG_DIR) - LoggingManager().log("Bootstrapping catalog" + str(output_url), - LoggingLevel.INFO) + # eva_dir = ConfigurationManager().get_value("core", "location") + # output_url = os.path.join(eva_dir, CATALOG_DIR) + # LoggingManager().log("Bootstrapping catalog" + str(output_url), + # LoggingLevel.INFO) + LoggingManager().log("Bootstrapping catalog", LoggingLevel.INFO) init_db() # # Construct output location # catalog_dir_url = os.path.join(eva_dir, "catalog") @@ -72,6 +70,7 @@ def get_table_bindings(self, database_name: str, table_name: str, bindings are required :return: returns metadat_id of table and a list of column ids """ + metadata_id = DataFrameMetadata.get_id_from_name(table_name) column_ids = [] if column_names is not None: @@ -115,3 +114,12 @@ def get_metadata(self, metadata_id: int, # rows = [row_1] # # append_rows(dataset_catalog_entry, rows) + + +if __name__ == '__main__': + catalog = CatalogManager() + metadata_id, col_ids = catalog.get_table_bindings(None, 'dataset1', + ['frame', 'color']) + metadata = catalog.get_metadata(1, [1]) + print(metadata.get_dataframe_schema()) + print(metadata_id, col_ids) diff --git a/src/catalog/models/df_column.py b/src/catalog/models/df_column.py index cfcc707a54..0f30f4ab53 100644 --- a/src/catalog/models/df_column.py +++ b/src/catalog/models/df_column.py @@ -64,8 +64,8 @@ def __str__(self): self._is_nullable) column_str += "[" - column_str += ', '.join(['%d'] * len(self._array_dimensions)) \ - % tuple(self._array_dimensions) + column_str += ', '.join(['%d'] * len(self.get_array_dimensions())) \ + % tuple(self.get_array_dimensions()) column_str += "] " column_str += ")\n" @@ -78,6 +78,8 @@ def get_id_from_metadata_id_and_name_in(cls, metadata_id, column_names): .filter(DataFrameColumn._metadata_id == metadata_id, DataFrameColumn._name.in_(column_names))\ .all() + result = [res[0] for res in result] + return result @classmethod diff --git a/src/catalog/models/df_metadata.py b/src/catalog/models/df_metadata.py index cd50a82f27..ce7e767019 100644 --- a/src/catalog/models/df_metadata.py +++ b/src/catalog/models/df_metadata.py @@ -21,8 +21,8 @@ class DataFrameMetadata(BaseModel): __tablename__ = 'df_metadata' _id = Column('id', Integer, primary_key=True) - _name = Column('name', String) - _file_url = Column('file_url', String) + _name = Column('name', String(100)) + _file_url = Column('file_url', String(100)) def __init__(self, dataframe_file_url, dataframe_schema): self._file_url = dataframe_file_url @@ -42,6 +42,9 @@ def set_schema(self, schema): def get_id(self): return self._id + def get_name(self): + return self._name + def get_dataframe_file_url(self): return self._file_url @@ -59,12 +62,12 @@ def get_id_from_name(cls, name): result = DataFrameMetadata.query \ .with_entities(DataFrameMetadata._id) \ .filter(DataFrameMetadata._name == name).one() - return result + return result[0] @classmethod def get(cls, metadata_id): result = DataFrameMetadata.query \ - .with_entities(DataFrameMetadata._id) \ .filter(DataFrameMetadata._id == metadata_id) \ .one() + print(result) return result diff --git a/src/configuration/dictionary.py b/src/configuration/dictionary.py index 6fa744baa2..6cc0a7842c 100644 --- a/src/configuration/dictionary.py +++ b/src/configuration/dictionary.py @@ -13,6 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. +EVA_DIR = "" CATALOG_DIR = "catalog" DATASET_DATAFRAME_NAME = "dataset" -SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:fafa@localhost/eva_catalog' +SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:root@localhost/eva_catalog' diff --git a/test/catalog/test_catalog_manager.py b/test/catalog/test_catalog_manager.py index 6910ee8414..24a85a986f 100644 --- a/test/catalog/test_catalog_manager.py +++ b/test/catalog/test_catalog_manager.py @@ -31,11 +31,8 @@ class CatalogManagerTests(unittest.TestCase): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) - # @mock.patch.object(ConfigurationManager, - # 'get_value') def setUp(self): suppress_py4j_logging() - # mocked_cm.return_value = 'abc' def tearDown(self): self.session = Session() @@ -54,6 +51,8 @@ def test_catalog_manager_singleton_pattern(self, mocked_cm, mocked_db): # x.create_dataset("bar") # x.create_dataset("baz") + # def test_get_bindings(self): + if __name__ == '__main__': From ebd126793a0d686c9209dde5fa815fa1d385f7a9 Mon Sep 17 00:00:00 2001 From: Sanjana Garg Date: Thu, 30 Jan 2020 23:01:33 -0500 Subject: [PATCH 74/82] Fixed test case --- test/catalog/test_catalog_manager.py | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/test/catalog/test_catalog_manager.py b/test/catalog/test_catalog_manager.py index 24a85a986f..e8b5603de0 100644 --- a/test/catalog/test_catalog_manager.py +++ b/test/catalog/test_catalog_manager.py @@ -39,10 +39,7 @@ def tearDown(self): self.session.stop() @mock.patch('src.catalog.catalog_manager.init_db') - @mock.patch('src.catalog.catalog_manager.ConfigurationManager') - def test_catalog_manager_singleton_pattern(self, mocked_cm, mocked_db): - mocked_cm.get_value('core', 'location').return_value = 'abc' - mocked_cm.get_value.assert_called_once_with('core', 'location') + def test_catalog_manager_singleton_pattern(self, mocked_db): x = CatalogManager() y = CatalogManager() self.assertEqual(x, y) @@ -51,8 +48,6 @@ def test_catalog_manager_singleton_pattern(self, mocked_cm, mocked_db): # x.create_dataset("bar") # x.create_dataset("baz") - # def test_get_bindings(self): - if __name__ == '__main__': From e9b96b73de38527f5378fe4647c2c1fcb6db3e5e Mon Sep 17 00:00:00 2001 From: Joy Arulraj Date: Fri, 31 Jan 2020 23:50:05 -0500 Subject: [PATCH 75/82] Update README.md --- README.md | 106 +++++++++++++++++++++++++----------------------------- 1 file changed, 49 insertions(+), 57 deletions(-) diff --git a/README.md b/README.md index bd7cac3813..5e860cc7d1 100644 --- a/README.md +++ b/README.md @@ -15,82 +15,72 @@ EVA is an end-to-end video analytics engine that allows users to query a databas ## Installation -Installation of EVA involves setting a virtual environment using conda and configuring git hooks. -1. Clone the repo +Installation of EVA involves setting a virtual environment using [miniconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) and configuring git hooks. + +1. Clone the repository ```shell -git clone https://github.com/georgia-tech-db/Eva.git +git clone https://github.com/georgia-tech-db/eva.git ``` -2. Install [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) and update path. +2. Install [miniconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) and update the `PATH` environment variable. ```shell -export PATH=~/anaconda3/bin:$PATH +export PATH="$HOME/miniconda/bin:$PATH" ``` -3. Install dependencies in a virtual environment. Dependencies should install with no errors on Ubuntu 16.04 but there are known installation issues with MacOS. +3. Install dependencies in a miniconda virtual environment. Virtual environments keep dependencies in separate sandboxes so you can switch between both `eva` and other Python applications easily and get them running. ```shell -cd Eva/ +cd eva/ conda env create -f environment.yml ``` -4. Run following command to configure git hooks. +4. Activate the `eva` environment. ```shell -git config core.hooksPath .githooks -``` - -## Demos -The following components have demos: - -1. EVA Analytics: A pipeline for loading a dataset, training filters, and outputting the optimal plan. -```commandline - cd - conda activate eva_35 - python pipeline.py -``` -2. EVA Query Optimizer: The optimizer shows converted queries - (Will show converted queries for the original queries) -```commandline - cd - conda activate eva_35 - python query_optimizer/query_optimizer.py -``` -3. Eva Loader (Loads UA-DETRAC dataset) -```commandline - cd - conda activate eva_35 - python loaders/load.py +conda activate eva ``` -4. NEW!!! There are new versions of the loaders and filters. -```commandline - cd - conda activate eva_35 - python loaders/uadetrac_loader.py - python filters/minimum_filter.py +5. Run following command to configure git hooks. +```shell +git config core.hooksPath .githooks ``` -5. EVA storage-system (Video compression and indexing system - *currently in progress*) +## Development / Contributing + +We invite you to help us build the future of visual data management DBMSs. -## Unit Tests -To run unit tests on the system, the following commands can be run: +1. Ensure that all the unit test cases (including the ones you have added) run succesfully. ```shell - conda activate eva_35 pycodestyle --select E test src/loaders - pytest test/ --cov-report= --cov=./ -s -v ``` -## Eva Core -Eva core is consisted of +2. Ensure that the coding style conventions are followed. + +```shell + pycodestyle --select E test src/loaders +``` + +3. Run the formatter script to automatically fix most of the coding style issues. + +```shell + python script/formatting/formatter.py +``` + +Please look up the [contributing guide](https://github.com/georgia-tech-db/eva/blob/master/CONTRIBUTING.md#development) for details. + +## EVA Architecture + +The EVA visual data management system consists of four core components: + +* Query Parser * Query Optimizer -* Filters -* UDFs -* Loaders +* Query Execution Engine (Filters + UDFs) +* Storage Engine (Loaders) #### Query Optimizer The query optimizer converts a given query to the optimal form. -All code related to this module is in */query_optimizer* +Module location: *src/query_optimizer* #### Filters The filters does preliminary filtering to video frames using cheap machine learning models. @@ -105,26 +95,28 @@ The filters below are running: * Random Forest * SVM -All code related to this module is in */filters* +Module location: *src/filters* #### UDFs This module contains all imported deep learning models. Currently, there is no code that performs this task. It is a work in progress. Information of current work is explained in detail [here](src/udfs/README.md). -All related code should be inside */udfs* +Module location: *src/udfs* #### Loaders The loaders load the dataset with the attributes specified in the *Accelerating Machine Learning Inference with Probabilistic Predicates* by Yao et al. -All code related to this module is in */loaders* - -## Eva Storage -Currently a work in progress. Come check back later! +Module location: */loaders* +## Status -## Dataset -__[Dataset info](data/README.md)__ explains detailed information about the datasets +_Technology preview_: currently unsupported, possibly due to incomplete functionality or unsuitability for production use. +## Contributors +See the [people page](https://github.com/georgia-tech-db/eva/graphs/contributors) for the full listing of contributors. +## License +Copyright (c) 2018-2020 [Georgia Tech Database Group](http://db.cc.gatech.edu/) +Licensed under the [Apache License](LICENSE). From 93b2b9c075fcde277efb42072eb3932c9b55813d Mon Sep 17 00:00:00 2001 From: jarulraj Date: Fri, 31 Jan 2020 23:53:37 -0500 Subject: [PATCH 76/82] Updated badges --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index bd7cac3813..d60c0acd2d 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # EVA (Exploratory Video Analytics) -[![Build Status](https://travis-ci.org/georgia-tech-db/Eva.svg?branch=master)](https://travis-ci.com/georgia-tech-db/Eva) -[![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/Eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/Eva?branch=master) +[![Build Status](https://travis-ci.org/georgia-tech-db/eva.svg?branch=master)](https://travis-ci.com/georgia-tech-db/eva) +[![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/eva?branch=master) EVA is an end-to-end video analytics engine that allows users to query a database of videos and return results based on machine learning analysis. From 13910891d54958f53177d06ad1b31583217a748e Mon Sep 17 00:00:00 2001 From: jarulraj Date: Sat, 1 Feb 2020 01:13:26 -0500 Subject: [PATCH 77/82] Refactoring --- src/optimizer/statement_to_opr_convertor.py | 10 ++--- src/parser/create_statement.py | 4 +- src/parser/{eva_parser.py => parser.py} | 8 ++-- ...ql_parser_visitor.py => parser_visitor.py} | 2 +- src/parser/select_statement.py | 4 +- src/parser/{eva_statement.py => statement.py} | 4 +- test/parser/test_parser.py | 24 +++++----- test/parser/test_parser_visitor.py | 44 +++++++++---------- 8 files changed, 50 insertions(+), 50 deletions(-) rename src/parser/{eva_parser.py => parser.py} (85%) rename src/parser/{evaql_parser_visitor.py => parser_visitor.py} (99%) rename src/parser/{eva_statement.py => statement.py} (93%) diff --git a/src/optimizer/statement_to_opr_convertor.py b/src/optimizer/statement_to_opr_convertor.py index 77e3447aab..4dbb36e19f 100644 --- a/src/optimizer/statement_to_opr_convertor.py +++ b/src/optimizer/statement_to_opr_convertor.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. from src.optimizer.operators import LogicalGet, LogicalFilter, LogicalProject -from src.parser.eva_statement import EvaStatement +from src.parser.eva_statement import AbstractStatement from src.parser.select_statement import SelectStatement from src.optimizer.optimizer_utils import (bind_table_ref, bind_columns_expr, bind_predicate_expr) @@ -34,11 +34,11 @@ def visit_table_ref(self, video: 'TableRef'): get_opr = LogicalGet(video, catalog_vid_metadata_id) self._plan = get_opr - def visit_select(self, statement: EvaStatement): + def visit_select(self, statement: AbstractStatement): """convertor for select statement Arguments: - statement {EvaStatement} -- [input select statement] + statement {AbstractStatement} -- [input select statement] """ # Create a logical get node video = statement.from_table @@ -66,13 +66,13 @@ def visit_select(self, statement: EvaStatement): projection_opr.append_child(self._plan) self._plan = projection_opr - def visit(self, statement: EvaStatement): + def visit(self, statement: AbstractStatement): """Based on the instance of the statement the corresponding visit is called. The logic is hidden from client. Arguments: - statement {EvaStatement} -- [Input statement] + statement {AbstractStatement} -- [Input statement] """ if isinstance(statement, SelectStatement): self.visit_select(statement) diff --git a/src/parser/create_statement.py b/src/parser/create_statement.py index 14576b1ade..0c2d6a4a74 100644 --- a/src/parser/create_statement.py +++ b/src/parser/create_statement.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from src.parser.eva_statement import EvaStatement +from src.parser.statement import AbstractStatement from src.parser.types import StatementType from src.expression.abstract_expression import AbstractExpression @@ -21,7 +21,7 @@ from typing import List -class CreateTableStatement(EvaStatement): +class CreateTableStatement(AbstractStatement): """ Create Table Statement constructed after parsing the input query diff --git a/src/parser/eva_parser.py b/src/parser/parser.py similarity index 85% rename from src/parser/eva_parser.py rename to src/parser/parser.py index 59f9286647..e14fde4501 100644 --- a/src/parser/eva_parser.py +++ b/src/parser/parser.py @@ -18,10 +18,10 @@ from src.parser.evaql.evaql_parser import evaql_parser from src.parser.evaql.evaql_lexer import evaql_lexer -from src.parser.evaql_parser_visitor import EvaQLParserVisitor +from src.parser.parser_visitor import ParserVisitor -class EvaQLParser(object): +class Parser(object): """ Parser for eva; based on EVAQL grammar """ @@ -30,11 +30,11 @@ class EvaQLParser(object): def __new__(cls): if cls._instance is None: - cls._instance = super(EvaQLParser, cls).__new__(cls) + cls._instance = super(Parser, cls).__new__(cls) return cls._instance def __init__(self): - self._visitor = EvaQLParserVisitor() + self._visitor = ParserVisitor() def parse(self, query_string: str) -> list: lexer = evaql_lexer(InputStream(query_string)) diff --git a/src/parser/evaql_parser_visitor.py b/src/parser/parser_visitor.py similarity index 99% rename from src/parser/evaql_parser_visitor.py rename to src/parser/parser_visitor.py index 29ebf1de74..b1aa544a80 100644 --- a/src/parser/evaql_parser_visitor.py +++ b/src/parser/parser_visitor.py @@ -36,7 +36,7 @@ from src.catalog.df_column import DataframeColumn -class EvaQLParserVisitor(evaql_parserVisitor): +class ParserVisitor(evaql_parserVisitor): def visitRoot(self, ctx: evaql_parser.RootContext): for child in ctx.children: diff --git a/src/parser/select_statement.py b/src/parser/select_statement.py index 2dec631380..6e56c29195 100644 --- a/src/parser/select_statement.py +++ b/src/parser/select_statement.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from src.parser.eva_statement import EvaStatement +from src.parser.statement import AbstractStatement from src.parser.types import StatementType from src.expression.abstract_expression import AbstractExpression @@ -21,7 +21,7 @@ from typing import List -class SelectStatement(EvaStatement): +class SelectStatement(AbstractStatement): """ Select Statement constructed after parsing the input query diff --git a/src/parser/eva_statement.py b/src/parser/statement.py similarity index 93% rename from src/parser/eva_statement.py rename to src/parser/statement.py index 71fb39404f..746d0b4757 100644 --- a/src/parser/eva_statement.py +++ b/src/parser/statement.py @@ -16,9 +16,9 @@ from src.parser.types import StatementType -class EvaStatement: +class AbstractStatement: """ - Base class for all the EvaStatement + Base class for all Statements Attributes ---------- diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 8dcff113a1..386a6041e2 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -15,10 +15,10 @@ import unittest -from src.parser.eva_parser import EvaQLParser -from src.parser.eva_statement import EvaStatement +from src.parser.parser import Parser +from src.parser.statement import AbstractStatement -from src.parser.eva_statement import StatementType +from src.parser.statement import StatementType from src.parser.select_statement import SelectStatement @@ -31,7 +31,7 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_create_statement(self): - parser = EvaQLParser() + parser = Parser() single_queries = [] single_queries.append( @@ -45,12 +45,12 @@ def test_create_statement(self): self.assertIsInstance(eva_statement_list, list) self.assertEqual(len(eva_statement_list), 1) self.assertIsInstance( - eva_statement_list[0], EvaStatement) + eva_statement_list[0], AbstractStatement) print(eva_statement_list[0]) def test_single_statement_queries(self): - parser = EvaQLParser() + parser = Parser() single_queries = [] single_queries.append("SELECT CLASS FROM TAIPAI;") @@ -67,10 +67,10 @@ def test_single_statement_queries(self): self.assertIsInstance(eva_statement_list, list) self.assertEqual(len(eva_statement_list), 1) self.assertIsInstance( - eva_statement_list[0], EvaStatement) + eva_statement_list[0], AbstractStatement) def test_multiple_statement_queries(self): - parser = EvaQLParser() + parser = Parser() multiple_queries = [] multiple_queries.append("SELECT CLASS FROM TAIPAI \ @@ -83,12 +83,12 @@ def test_multiple_statement_queries(self): self.assertIsInstance(eva_statement_list, list) self.assertEqual(len(eva_statement_list), 2) self.assertIsInstance( - eva_statement_list[0], EvaStatement) + eva_statement_list[0], AbstractStatement) self.assertIsInstance( - eva_statement_list[1], EvaStatement) + eva_statement_list[1], AbstractStatement) def test_select_statement(self): - parser = EvaQLParser() + parser = Parser() select_query = "SELECT CLASS, REDNESS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 300 ) OR REDNESS > 500;" eva_statement_list = parser.parse(select_query) @@ -122,7 +122,7 @@ def test_select_statement_class(self): Class: SelectStatement''' select_stmt_new = SelectStatement() - parser = EvaQLParser() + parser = Parser() select_query_new = "SELECT CLASS, REDNESS FROM TAIPAI \ WHERE (CLASS = 'VAN' AND REDNESS < 400 ) OR REDNESS > 700;" diff --git a/test/parser/test_parser_visitor.py b/test/parser/test_parser_visitor.py index 62a05dda22..66ffe0f5d7 100644 --- a/test/parser/test_parser_visitor.py +++ b/test/parser/test_parser_visitor.py @@ -18,7 +18,7 @@ from unittest import mock from unittest.mock import MagicMock, call -from src.parser.evaql_parser_visitor import EvaQLParserVisitor +from src.parser.parser_visitor import ParserVisitor from src.parser.evaql.evaql_parser import evaql_parser from src.expression.abstract_expression import ExpressionType @@ -28,12 +28,12 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_should_query_specification_visitor(self): - EvaQLParserVisitor.visit = MagicMock() - mock_visit = EvaQLParserVisitor.visit + ParserVisitor.visit = MagicMock() + mock_visit = ParserVisitor.visit mock_visit.side_effect = ["columns", {"from": ["tables"], "where": "predicates"}] - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() ctx = MagicMock() child_1 = MagicMock() child_1.getRuleIndex.return_value = evaql_parser.RULE_selectElements @@ -50,7 +50,7 @@ def test_should_query_specification_visitor(self): self.assertEqual(expected.where_clause, "predicates") self.assertEqual(expected.target_list, "columns") - @mock.patch.object(EvaQLParserVisitor, 'visit') + @mock.patch.object(ParserVisitor, 'visit') def test_from_clause_visitor(self, mock_visit): mock_visit.side_effect = ["tables", "predicates"] @@ -60,7 +60,7 @@ def test_from_clause_visitor(self, mock_visit): whereExpr = MagicMock() ctx.whereExpr = whereExpr - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() expected = visitor.visitFromClause(ctx) mock_visit.assert_has_calls([call(tableSources), call(whereExpr)]) @@ -69,7 +69,7 @@ def test_from_clause_visitor(self, mock_visit): def test_logical_operator(self): ctx = MagicMock() - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() self.assertEqual( visitor.visitLogicalOperator(ctx), @@ -87,7 +87,7 @@ def test_logical_operator(self): def test_comparison_operator(self): ctx = MagicMock() - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() self.assertEqual( visitor.visitComparisonOperator(ctx), @@ -114,9 +114,9 @@ def test_comparison_operator(self): # Function: visitFullColumnName # ''' # ctx = MagicMock() - # visitor = EvaQLParserVisitor() - # EvaQLParserVisitor.visit = MagicMock() - # EvaQLParserVisitor.visit.return_value = None + # visitor = ParserVisitor() + # ParserVisitor.visit = MagicMock() + # ParserVisitor.visit.return_value = None # with self.assertWarns(SyntaxWarning, msg='Column Name Missing'): # visitor.visitFullColumnName(ctx) @@ -125,9 +125,9 @@ def test_comparison_operator(self): # Function: visitTableName # ''' # ctx = MagicMock() - # visitor = EvaQLParserVisitor() - # EvaQLParserVisitor.visit = MagicMock() - # EvaQLParserVisitor.visit.return_value = None + # visitor = ParserVisitor() + # ParserVisitor.visit = MagicMock() + # ParserVisitor.visit.return_value = None # with self.assertWarns(SyntaxWarning, msg='Invalid from table'): # visitor.visitTableName(ctx) @@ -136,7 +136,7 @@ def test_logical_expression(self): Function : visitLogicalExpression ''' ctx = MagicMock() - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() # Test for no children ctx.children = [] @@ -160,12 +160,12 @@ def test_visit_string_literal_none(self): ''' Testing when string literal is None Function: visitStringLiteral ''' - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() ctx = MagicMock() ctx.STRING_LITERAL.return_value = None - EvaQLParserVisitor.visitChildren = MagicMock() - mock_visit = EvaQLParserVisitor.visitChildren + ParserVisitor.visitChildren = MagicMock() + mock_visit = ParserVisitor.visitChildren visitor.visitStringLiteral(ctx) mock_visit.assert_has_calls([call(ctx)]) @@ -176,7 +176,7 @@ def test_visit_constant(self): Function: visitConstant ''' ctx = MagicMock() - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() ctx.REAL_LITERAL.return_value = '5' expected = visitor.visitConstant(ctx) self.assertEqual( @@ -187,10 +187,10 @@ def test_visit_query_specification_base_exception(self): ''' Testing Base Exception error handling Function: visitQuerySpecification ''' - EvaQLParserVisitor.visit = MagicMock() - EvaQLParserVisitor.visit + ParserVisitor.visit = MagicMock() + ParserVisitor.visit - visitor = EvaQLParserVisitor() + visitor = ParserVisitor() ctx = MagicMock() child_1 = MagicMock() child_2 = MagicMock() From 7f564cc1a927b87599d42ad79708d47683990096 Mon Sep 17 00:00:00 2001 From: jarulraj Date: Sat, 1 Feb 2020 02:41:48 -0500 Subject: [PATCH 78/82] Added support for NDARRAY in create table --- src/catalog/column_type.py | 9 ++-- src/catalog/utils.py | 2 +- src/parser/evaql/evaql_lexer.g4 | 7 ++- src/parser/evaql/evaql_parser.g4 | 15 +++--- src/parser/parser_visitor.py | 92 ++++++++++++++++++++++++++------ test/parser/test_parser.py | 4 +- 6 files changed, 96 insertions(+), 33 deletions(-) diff --git a/src/catalog/column_type.py b/src/catalog/column_type.py index 85ebe568d9..c9df5528c7 100644 --- a/src/catalog/column_type.py +++ b/src/catalog/column_type.py @@ -16,7 +16,8 @@ class ColumnType(Enum): - INTEGER = 1 - FLOAT = 2 - STRING = 3 - NDARRAY = 4 + BOOLEAN = 1 + INTEGER = 2 + FLOAT = 3 + TEXT = 4 + NDARRAY = 5 diff --git a/src/catalog/utils.py b/src/catalog/utils.py index 34bf879430..eff8296f83 100644 --- a/src/catalog/utils.py +++ b/src/catalog/utils.py @@ -51,7 +51,7 @@ def get_petastorm_column(df_column): (), ScalarCodec(FloatType()), column_is_nullable) - elif column_type == ColumnType.STRING: + elif column_type == ColumnType.TEXT: petastorm_column = UnischemaField(column_name, np.string_, (), diff --git a/src/parser/evaql/evaql_lexer.g4 b/src/parser/evaql/evaql_lexer.g4 index 881f1c8314..361fecaa63 100644 --- a/src/parser/evaql/evaql_lexer.g4 +++ b/src/parser/evaql/evaql_lexer.g4 @@ -64,7 +64,6 @@ SET: 'SET'; SHUTDOWN: 'SHUTDOWN'; SOME: 'SOME'; TABLE: 'TABLE'; -TEXT: 'TEXT'; TRUE: 'TRUE'; UNIQUE: 'UNIQUE'; UNKNOWN: 'UNKNOWN'; @@ -92,10 +91,11 @@ ACTION_CLASSICATION: 'ACTION_CLASSICATION'; // DATA TYPE Keywords -SMALLINT: 'SMALLINT'; +BOOLEAN: 'BOOLEAN'; INTEGER: 'INTEGER'; FLOAT: 'FLOAT'; -VARCHAR: 'VARCHAR'; +TEXT: 'TEXT'; +NDARRAY: 'NDARRAY'; // Group function Keywords @@ -111,7 +111,6 @@ FCOUNT: 'FCOUNT'; // Common Keywords, but can be ID AUTO_INCREMENT: 'AUTO_INCREMENT'; -BOOLEAN: 'BOOLEAN'; COLUMNS: 'COLUMNS'; HELP: 'HELP'; TEMPTABLE: 'TEMPTABLE'; diff --git a/src/parser/evaql/evaql_parser.g4 b/src/parser/evaql/evaql_parser.g4 index e3ecec90c7..8ad8650dc8 100644 --- a/src/parser/evaql/evaql_parser.g4 +++ b/src/parser/evaql/evaql_parser.g4 @@ -337,13 +337,11 @@ constant // Data Types dataType - : TEXT - lengthOneDimension? #stringDataType - | INTEGER - lengthOneDimension? UNSIGNED? #dimensionDataType - | FLOAT - lengthTwoDimension? UNSIGNED? #dimensionDataType - | BOOLEAN #simpleDataType + : BOOLEAN #simpleDataType + | TEXT lengthOneDimension? #dimensionDataType + | INTEGER UNSIGNED? #integerDataType + | FLOAT lengthTwoDimension? UNSIGNED? #dimensionDataType + | NDARRAY lengthDimensionList #dimensionDataType ; lengthOneDimension @@ -354,6 +352,9 @@ lengthTwoDimension : '(' decimalLiteral ',' decimalLiteral ')' ; +lengthDimensionList + : '(' (decimalLiteral ',')* decimalLiteral ')' + ; // Common Lists diff --git a/src/parser/parser_visitor.py b/src/parser/parser_visitor.py index b1aa544a80..cdf0c28bc7 100644 --- a/src/parser/parser_visitor.py +++ b/src/parser/parser_visitor.py @@ -118,36 +118,96 @@ def visitCreateDefinitions( def visitColumnDeclaration( self, ctx: evaql_parser.ColumnDeclarationContext): - data_type = self.visit(ctx.columnDefinition()) + + data_type, dimensions = self.visit(ctx.columnDefinition()) column_name = self.visit(ctx.uid()) - column = DataframeColumn(column_name, data_type) + column = DataframeColumn(column_name, data_type, + array_dimensions=dimensions) return column def visitColumnDefinition(self, ctx: evaql_parser.ColumnDefinitionContext): - data_type = self.visit(ctx.dataType()) - return data_type + + data_type, dimensions = self.visit(ctx.dataType()) + return data_type, dimensions + + def visitSimpleDataType(self, ctx: evaql_parser.SimpleDataTypeContext): + + data_type = None + dimensions = [] + + if ctx.BOOLEAN() is not None: + data_type = ColumnType.BOOLEAN + + return data_type, dimensions + + def visitIntegerDataType(self, ctx: evaql_parser.IntegerDataTypeContext): + + data_type = None + dimensions = [] + + if ctx.INTEGER() is not None: + data_type = ColumnType.INTEGER + elif ctx.UNSIGNED() is not None: + data_type = ColumnType.INTEGER + + return data_type, dimensions def visitDimensionDataType( self, ctx: evaql_parser.DimensionDataTypeContext): + data_type = None + dimensions = [] - column_type = None if ctx.FLOAT() is not None: - column_type = ColumnType.FLOAT - elif ctx.INTEGER() is not None: - column_type = ColumnType.INTEGER - elif ctx.UNSIGNED() is not None: - column_type = ColumnType.INTEGER + data_type = ColumnType.FLOAT + dimensions = self.visit(ctx.lengthTwoDimension()) + elif ctx.TEXT() is not None: + data_type = ColumnType.TEXT + dimensions = self.visit(ctx.lengthOneDimension()) + elif ctx.NDARRAY() is not None: + data_type = ColumnType.NDARRAY + dimensions = self.visit(ctx.lengthDimensionList()) + + return data_type, dimensions + + def visitLengthOneDimension( + self, ctx: evaql_parser.LengthOneDimensionContext): + dimensions = [] + + if ctx.decimalLiteral() is not None: + dimensions = [self.visit(ctx.decimalLiteral())] + + return dimensions + + def visitLengthTwoDimension( + self, ctx: evaql_parser.LengthTwoDimensionContext): + first_decimal = self.visit(ctx.decimalLiteral(0)) + second_decimal = self.visit(ctx.decimalLiteral(1)) + + print(first_decimal, second_decimal) + dimensions = [first_decimal, second_decimal] + return dimensions + + def visitLengthDimensionList( + self, ctx: evaql_parser.LengthDimensionListContext): + dimensions = [] + dimension_index = 0 + for child in ctx.children: + decimal_literal = ctx.decimalLiteral(dimension_index) + if decimal_literal is not None: + decimal = self.visit(decimal_literal) + dimensions.append(decimal) + dimension_index = dimension_index + 1 - return column_type + return dimensions - def visitStringDataType(self, ctx: evaql_parser.StringDataTypeContext): + def visitDecimalLiteral(self, ctx: evaql_parser.DecimalLiteralContext): - column_type = None - if ctx.TEXT() is not None: - column_type = ColumnType.STRING + decimal = None + if ctx.DECIMAL_LITERAL() is not None: + decimal = int(str(ctx.DECIMAL_LITERAL())) - return column_type + return decimal ################################################################## # SELECT STATEMENT diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 386a6041e2..ea9a0e4364 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -37,7 +37,9 @@ def test_create_statement(self): single_queries.append( """CREATE TABLE IF NOT EXISTS Persons ( Frame_ID INTEGER, - Frame_Data TEXT + Frame_Data TEXT(10), + Frame_Value FLOAT(1000, 201), + Frame_Array NDARRAY (5, 100, 2432, 4324, 100) );""") for query in single_queries: From 20bfd6578791f1dea115a0305710ea538424910e Mon Sep 17 00:00:00 2001 From: jarulraj Date: Wed, 5 Feb 2020 14:29:14 -0500 Subject: [PATCH 79/82] Checkpoint --- test/parser/test_parser.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 093ccc9d69..76f026f4ea 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -55,8 +55,6 @@ def test_single_statement_queries(self): parser = Parser() single_queries = [] - single_queries.append( - "CREATE TABLE IF NOT EXISTS Persons (Frame_ID INTEGER);") single_queries.append("SELECT CLASS FROM TAIPAI;") single_queries.append("SELECT CLASS FROM TAIPAI WHERE CLASS = 'VAN';") single_queries.append("SELECT CLASS,REDNESS FROM TAIPAI \ From 40875acd63e124ddf5c75ac01641e91117148a8e Mon Sep 17 00:00:00 2001 From: jarulraj Date: Thu, 6 Feb 2020 12:52:59 -0500 Subject: [PATCH 80/82] Fixed parser issue (due to TupleValueExpression) --- src/demo.py | 4 ++-- src/expression/tuple_value_expression.py | 12 +++++----- src/parser/parser.py | 18 ++++++++------ test/expression/test_aggregation.py | 30 ++++++++++++------------ test/parser/test_parser.py | 2 ++ 5 files changed, 36 insertions(+), 30 deletions(-) diff --git a/src/demo.py b/src/demo.py index 07edcef83a..3500e44d00 100644 --- a/src/demo.py +++ b/src/demo.py @@ -38,7 +38,6 @@ def default(self, query): else: try: - # Connect and Query from Eva parser = EvaFrameQLParser() eva_statement = parser.parse(query) @@ -52,7 +51,8 @@ def default(self, query): input_video = [] for filename in glob.glob('data/sample_video/*.jpg'): im = Image.open(filename) - im_copy = im.copy() # to handle 'too many open files' error + # to handle 'too many open files' error + im_copy = im.copy() input_video.append(im_copy) im.close() diff --git a/src/expression/tuple_value_expression.py b/src/expression/tuple_value_expression.py index 875e990eb2..5b9869f422 100644 --- a/src/expression/tuple_value_expression.py +++ b/src/expression/tuple_value_expression.py @@ -17,14 +17,15 @@ class TupleValueExpression(AbstractExpression): - def __init__(self, col_id: int, col_name: str = None, - table_name: str = None): + def __init__(self, col_name: str = None, table_name: str = None, + col_idx: int = -1): super().__init__(ExpressionType.TUPLE_VALUE, rtype=ExpressionReturnType.INVALID) self._col_name = col_name self._table_name = table_name self._table_metadata_id = None - self._col_metadata_id = col_id + self._col_metadata_id = None + self._col_idx = col_idx @property def table_metadata_id(self) -> int: @@ -52,9 +53,8 @@ def col_name(self) -> str: # remove this once doen with tuple class def evaluate(self, *args): - tuple1 = None if args is None: # error Handling pass - tuple1 = args[0] - return tuple1[(self._col_metadata_id)] + given_tuple = args[0] + return given_tuple[(self._col_idx)] diff --git a/src/parser/parser.py b/src/parser/parser.py index 6ffca758d9..c76bc504d6 100644 --- a/src/parser/parser.py +++ b/src/parser/parser.py @@ -25,27 +25,30 @@ class MyErrorListener(ErrorListener): # Reference - # https://stackoverflow.com/questions/33847547/ - # antlr4-terminate-on-lexer-parser-error-python + # https://www.antlr.org/api/Java/org/antlr/v4/runtime/BaseErrorListener.html def __init__(self): super(MyErrorListener, self).__init__() def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e): - raise Exception("Oh no!!") + error_str = "ERROR: Syntax error - Line" + str(line) + ": Col " +\ + str(column) + " - " + str(msg) + raise Exception(error_str) def reportAmbiguity(self, recognizer, dfa, startIndex, stopIndex, exact, ambigAlts, configs): - raise Exception("Oh no!!") + error_str = "ERROR: Ambiguity -" + str(configs) + raise Exception(error_str) def reportAttemptingFullContext(self, recognizer, dfa, startIndex, stopIndex, conflictingAlts, configs): - error_str = "ERROR: Attempting full context -" + str(configs) + error_str = "ERROR: Attempting Full Context -" + str(configs) raise Exception(error_str) def reportContextSensitivity(self, recognizer, dfa, startIndex, stopIndex, prediction, configs): - raise Exception("Oh no!!") + error_str = "ERROR: Context Sensitivity -" + str(configs) + raise Exception(error_str) class Parser(object): @@ -70,7 +73,8 @@ def parse(self, query_string: str) -> list: stream = CommonTokenStream(lexer) parser = evaql_parser(stream) - parser._listeners = [self._error_listener] + # Attach error listener for debugging parser errrors + # parser._listeners = [self._error_listener] tree = parser.root() diff --git a/test/expression/test_aggregation.py b/test/expression/test_aggregation.py index a0a5f186dd..7d3d3995cd 100644 --- a/test/expression/test_aggregation.py +++ b/test/expression/test_aggregation.py @@ -25,51 +25,51 @@ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def test_aggregation_sum(self): - columnName = TupleValueExpression(0) + columnName = TupleValueExpression(col_idx=0) aggr_expr = AggregationExpression( ExpressionType.AGGREGATION_SUM, None, columnName ) - tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] - self.assertEqual(6, aggr_expr.evaluate(tuple1, None)) + tuples = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(6, aggr_expr.evaluate(tuples, None)) def test_aggregation_count(self): - columnName = TupleValueExpression(0) + columnName = TupleValueExpression(col_idx=0) aggr_expr = AggregationExpression( ExpressionType.AGGREGATION_COUNT, None, columnName ) - tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] - self.assertEqual(3, aggr_expr.evaluate(tuple1, None)) + tuples = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(3, aggr_expr.evaluate(tuples, None)) def test_aggregation_avg(self): - columnName = TupleValueExpression(0) + columnName = TupleValueExpression(col_idx=0) aggr_expr = AggregationExpression( ExpressionType.AGGREGATION_AVG, None, columnName ) - tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] - self.assertEqual(2, aggr_expr.evaluate(tuple1, None)) + tuples = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(2, aggr_expr.evaluate(tuples, None)) def test_aggregation_min(self): - columnName = TupleValueExpression(0) + columnName = TupleValueExpression(col_idx=0) aggr_expr = AggregationExpression( ExpressionType.AGGREGATION_MIN, None, columnName ) - tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] - self.assertEqual(1, aggr_expr.evaluate(tuple1, None)) + tuples = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(1, aggr_expr.evaluate(tuples, None)) def test_aggregation_max(self): - columnName = TupleValueExpression(0) + columnName = TupleValueExpression(col_idx=0) aggr_expr = AggregationExpression( ExpressionType.AGGREGATION_MAX, None, columnName ) - tuple1 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] - self.assertEqual(3, aggr_expr.evaluate(tuple1, None)) \ No newline at end of file + tuples = [[1, 2, 3], [2, 3, 4], [3, 4, 5]] + self.assertEqual(3, aggr_expr.evaluate(tuples, None)) \ No newline at end of file diff --git a/test/parser/test_parser.py b/test/parser/test_parser.py index 76f026f4ea..47941a807b 100644 --- a/test/parser/test_parser.py +++ b/test/parser/test_parser.py @@ -72,6 +72,8 @@ def test_single_statement_queries(self): self.assertIsInstance( eva_statement_list[0], AbstractStatement) + print(eva_statement_list[0]) + def test_multiple_statement_queries(self): parser = Parser() From b54e66509117d5ad30619c8118e02a2775838595 Mon Sep 17 00:00:00 2001 From: Joy Arulraj Date: Thu, 6 Feb 2020 13:08:03 -0500 Subject: [PATCH 81/82] Update README.md --- README.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 1f44e3551c..83eed248c5 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,7 @@ EVA is an end-to-end video analytics engine that allows users to query a databas ## Table of Contents * [Installation](#installation) -* [Demos](#demos) -* [Unit Tests](#unit-tests) -* [Eva Core](#eva-core) -* [Eva Storage](#eva-storage) -* [Dataset](#dataset) - +* [Development](#development) ## Installation @@ -44,7 +39,7 @@ conda activate eva git config core.hooksPath .githooks ``` -## Development / Contributing +## Development We invite you to help us build the future of visual data management DBMSs. @@ -106,7 +101,7 @@ Module location: *src/udfs* #### Loaders The loaders load the dataset with the attributes specified in the *Accelerating Machine Learning Inference with Probabilistic Predicates* by Yao et al. -Module location: */loaders* +Module location: *src/loaders* ## Status From 88c1996966124df6396e78e1fbfef1fdc524d265 Mon Sep 17 00:00:00 2001 From: Joy Arulraj Date: Thu, 6 Feb 2020 13:08:41 -0500 Subject: [PATCH 82/82] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 83eed248c5..7067861745 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ EVA is an end-to-end video analytics engine that allows users to query a databas ## Table of Contents * [Installation](#installation) * [Development](#development) +* [Architecture](#architecture) ## Installation @@ -63,7 +64,7 @@ We invite you to help us build the future of visual data management DBMSs. Please look up the [contributing guide](https://github.com/georgia-tech-db/eva/blob/master/CONTRIBUTING.md#development) for details. -## EVA Architecture +## Architecture The EVA visual data management system consists of four core components: