-
Notifications
You must be signed in to change notification settings - Fork 474
/
Copy pathpyspark-session-2021-05-05-join.txt
executable file
·85 lines (80 loc) · 2.13 KB
/
pyspark-session-2021-05-05-join.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
PySpark Documentation: Join function in PySpark:
http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.join.html
$ ./bin/pyspark
Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018 02:44:43)
Spark context Web UI available at http://10.0.0.93:4040
Spark context available as 'sc' (master = local[*], app id = local-1620269740798).
SparkSession available as 'spark'.
>>>
>>> x = spark.sparkContext.parallelize([("spark", 1), ("hadoop", 4)])
>>> x.collect()
[
('spark', 1),
('hadoop', 4)
]
>>>
>>> y = spark.sparkContext.parallelize([("spark", 2), ("hadoop", 5)])
>>> y.collect()
[
('spark', 2),
('hadoop', 5)
]
>>>
>>> joined = x.join(y)
>>> joined.collect()
[
('spark', (1, 2)),
('hadoop', (4, 5))
]
>>>
>>>
>>> x = spark.sparkContext.parallelize([("a", 1), ("b", 4), ("c", 4)])
>>> x.collect()
[('a', 1), ('b', 4), ('c', 4)]
>>> y = spark.sparkContext.parallelize([("a", 2), ("a", 3), ("a", 7), ("d", 8)])
>>> y.collect()
[('a', 2), ('a', 3), ('a', 7), ('d', 8)]
>>>
>>> joined = x.join(y)
>>> joined.collect()
[('a', (1, 2)), ('a', (1, 3)), ('a', (1, 7))]
>>>
>>>
>>> joined.count()
3
>>> x = spark.sparkContext.parallelize([("a", 1), ("b", 4), ("b", 5), ("c", 4)]);
>>> x.collect()
[('a', 1), ('b', 4), ('b', 5), ('c', 4)]
>>>
>>> y = spark.sparkContext.parallelize([("a", 2), ("a", 3), ("a", 7), ("b", 61), ("b", 71), ("d", 8)])
>>> y.collect()
[('a', 2), ('a', 3), ('a', 7), ('b', 61), ('b', 71), ('d', 8)]
>>> joined = x.join(y)
>>> joined.collect()
[
('b', (4, 61)),
('b', (4, 71)),
('b', (5, 61)),
('b', (5, 71)),
('a', (1, 2)),
('a', (1, 3)),
('a', (1, 7))
]
>>>
>>>#pyspark.RDD.cartesian
>>>#RDD.cartesian(other)
>>>#Return the Cartesian product of this RDD and another one,
>>>#that is, the RDD of all pairs of elements (a, b) where a is
>>>#in self and b is in other.
>>># Examples
>>>
>>> rdd = spark.sparkContext.parallelize([1, 2])
>>> sorted(rdd.cartesian(rdd).collect())
[(1, 1), (1, 2), (2, 1), (2, 2)]