[REVIEW] Bug ext create column #232

seunghwak · 2019-04-19T16:12:05Z

Closes #186, Closes #189.

…ck allocation, these objects are only used to pass the data encapsulated in Python cudf Series objects to C++ functions expecting (pointers to) C++ gdf_column objects and sizeof(gdf_column) is not large enough to blow stack, no need to involve heap allocation overhead and risk memory leak (if forget to free).

…column_view (stack-allocation for gdf_column)

…using ReadMtxFile, fixed this.

…(no spacing between < and >), cudf is inconsistent in placing a space between > and the name of the variable to be casted, so left this part as is.

…, cudf's gdf_column_view does not properly initialize col_name to nullptr, freeing col_name can result in freeing unallocated memory, this problem should be cleaned up once cudf finishes redesigning cudf::column.

…nter will break strict-aliasing rules [-Wstrict-aliasing])

…- 1 with num_vertices(), this is a better abstraction and less vulnerable to low level changes in class Graph

…d one.

…oved one.

…so, view_transposed_adj_list can better mirror view_adj_list (except for replacing adjList with transposedAdjList)

jwyles · 2019-04-19T21:19:05Z

python/cugraph/graph/c_graph.pyx

        else:
-            value = create_column(self.adj_list_value_col)
+            c_value_col = get_gdf_column_view(self.adj_list_value_col)


Is there a scoping issue here? You are creating an object on the stack inside of a code block and assigning a pointer to it which is used outside of the code block.

283 cdef gdf_column c_value_col

c_value_col is defined in line 238, and it's not used after

291 err = gdf_adj_list_view(<gdf_graph*> graph, 292 &c_offset_col, 293 &c_index_col, 294 c_value_col_ptr)

Please note that the actual column array data are stored in self.adj_list_value_col (a python object), and c_value_col is a temporary C++ wrapper used to pass the column data in python object to a C++ function expecting C++ data type.

It seems to me that c_value_col is still in the scope when used in gdf_adj_list_view.
However, there is also a small consistency issue here as c_offset_col and c_index_col are accessed from the stack objects while a pointer to the stack object is created for the values.

The inconsistency is because value is optional (it can be NULL). I can either use value_col_ptr or replicate the gdf_edge_list_view call for the two different cases (if value_col is None and else). I picked to use value_col_ptr.

jwyles · 2019-04-19T21:22:43Z

python/cugraph/graph/c_graph.pyx

        err = gdf_add_adj_list(g)
        cudf.bindings.cudf_cpp.check_gdf_error(err)

-        col_size_off = g.adjList.offsets.size
-        col_size_ind = g.adjList.indices.size
+        offset_col_size = self.num_vertices() + 1


I think it's better to just take the size of the offsets column directly rather than calling self.num_vertices() it's less complicated.

I may not agree with this. We'd better hide C++ data structure (gdf_graph) internals from the functions using gdf_graph. If we use num_verticies(), it does not matter whether we change gdf_graph internals or not as long as we have num_verticies(). If some change is necessary, we just need to update num_verticies(). We are about to update graph data structures to match NetworkX, and in this case, we need to search the entire code base and replace every g.adjList.offsets.size -1 (and g.transposedAdjList.offset.size if g.adjList is not available and g.transposedAdjList is available) to match the new data structure. This will become more and more problematic as we add more and more graph functions; there basically are more places we need to track.

See Hiding data (and code) in https://herbsutter.com/2013/08/12/gotw-94-solution-aaa-style-almost-always-auto/

Hiding data structure internals is a key principle in object oriented programming.

And cudf folks are working on replacing gdf_column with cudf::column to provide better abstraction (rapidsai/cudf#1443).

We need to do something similar for gdf_graph; and I think eventually most (if not all) gdf_graph member variables should become private and should not be directly accessed outside gdf_graph.

I still disagree. While you are right that it is good to hide the internals of the graph object, the issue is that we aren't looking for the number of vertices in the graph, but for the size of the offsets array. Using self.num_vertices() + 1 to get that is still relying on information on the graph object's internals, just doing it in a more roundabout way.

Thank you for your reply!!!

See "The array IA is of length m + 1." from https://en.wikipedia.org/wiki/Sparse_matrix (here, m is the number of vertices).

self.num_vertices() + 1 (or m + 1 if I use notation from the above link) is from the general knowledge about the CSR format, not from the graph structure internals. Hope this helps to address your concern.

While I don't think this should block the PR, I'm mostly concerned about consistency here. It is now using a mix of direct structure access and accessors. If self.num_vertices()+1 is used to set the size of the offset array, it is best to use a self.num_edges() for indices arrays too or keep the previous, consistent, solution imo.

Yes, I have the same concern. The thing is that we do not have the num_edges() function, and we are planning to do some major restructuring to match the NetworkX API (for the 0.8 release I guess), so I was a bit hesitant to implement the num_edges() function that will be replaced in one month. I decided to defer this work for the 0.8 release but if this inconsistency really bothers (it somewhat bothers me but I can endure if this will last for only short time), I can do the work for this pull request.

Actually, I think I'd better do this now.

I will rename num_vertices() to number_of_nodes() and num_edges() to number_of_edges() to follow NetworkX (https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.DiGraph.number_of_nodes.html and https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.number_of_edges.html).

This will not be too much work and not gonna be entirely replaced.

I can also live with that as long as these changes are tracked in 0.8 release ;)

jwyles · 2019-04-19T21:24:43Z

python/cugraph/graph/c_graph.pyx

        err = gdf_add_transposed_adj_list(g)
        cudf.bindings.cudf_cpp.check_gdf_error(err)

-        off_size = g.transposedAdjList.offsets.size
-        ind_size = g.transposedAdjList.indices.size
+        offset_col_size = self.num_vertices() + 1


Same here, getting the size of the column directly is simpler.

I disagree for the same reason.

jwyles · 2019-04-19T21:26:59Z

python/cugraph/graph/c_graph.pyx

-        off_size = g.transposedAdjList.offsets.size
-        ind_size = g.transposedAdjList.indices.size
+        offset_col_size = self.num_vertices() + 1
+        inex_col_size = g.transposedAdjList.indices.size


I think you may have meant 'index_col_size' here.

Great catch!!! I made a fix.

Iroy30 · 2019-04-22T13:04:22Z

python/cugraph/jaccard/wjaccard_wrapper.pyx

+    cdef gdf_column c_weight_col
+    cdef gdf_column c_first_col
+    cdef gdf_column c_second_col
+    cdef gdf_column c_indices_col


Should this be c_index_col

Thanks for finding this!!! and I fixed this.

Iroy30

Looks good to me

This PR addresses issues mentioned in rapidsai/raft#221 -- Adds grid stride based fusedL2NN kernel, this gives approx 1.85x speed up over previous version of this kernel. -- Adds support in pairwise dist base class to work for any input size by adding support for grid stride based work distribution. Authors: - Mahesh Doijade (https://github.com/mdoijade) Approvers: - Thejaswi. N. S (https://github.com/teju85) - Divye Gala (https://github.com/divyegala) - Alex Fender (https://github.com/afender) URL: rapidsai/raft#232

After the merge of rapidsai#232, a few different tests failed in rapidsai/cuml#3891, given the timing I think it'd be best to target 232 (again) to 21.08 after triaging the issues. Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Divye Gala (https://github.com/divyegala) - Brad Rees (https://github.com/BradReesWork) URL: rapidsai/raft#246

seunghwak added 16 commits April 19, 2019 09:01

revmoed trailing spaces

97b9216

replaced create_column (heap-allocation for gdf_column) wiht get_gdf_…

ff98880

…column_view (stack-allocation for gdf_column)

ReadMtxFile is renamed to read_mtx_file, but the comments were still …

b088414

…using ReadMtxFile, fixed this.

updated indentation inside cython typecasting operator to match cudf …

971c1bd

…(no spacing between < and >), cudf is inconsistent in placing a space between > and the name of the variable to be casted, so left this part as is.

updated comments on get_gdf_column_view

d5f37e6

fixed a bug (not properly freeing valid) in gdf_col_delete

5b37c84

removed tab space in algorithms.h

57daf68

fixed a warning in louvain_wrapper.pyx (dereferencing type-punned poi…

dea90dc

…nter will break strict-aliasing rules [-Wstrict-aliasing])

replaced adjList.offsets.size - 1 and transposedAdjList.offsets.size …

71a6bf7

…- 1 with num_vertices(), this is a better abstraction and less vulnerable to low level changes in class Graph

there were two implementations of num_vertices in class Graph, remove…

839bb3c

…d one.

there were two implementations of delete_adj_list in class Graph, rem…

85adcd8

…oved one.

changed variable names in view_adj_list and view_transposed_adj_list …

ee30f6f

…so, view_transposed_adj_list can better mirror view_adj_list (except for replacing adjList with transposedAdjList)

removed unnecessary imports

2b96d97

updated change log.

ee1f025

seunghwak changed the title ~~Bug ext create column~~ [REVIEW] Bug ext create column Apr 19, 2019

seunghwak marked this pull request as ready for review April 19, 2019 17:20

seunghwak mentioned this pull request Apr 19, 2019

[BUG] Graph.num_vertices is defined twice in c_graph.pyx #235

Closed

afender requested review from jwyles and Iroy30 April 19, 2019 20:27

jwyles reviewed Apr 19, 2019

View reviewed changes

fixed typo in variable name

ab837dd

Iroy30 reviewed Apr 22, 2019

View reviewed changes

BradReesWork requested review from jwyles, afender and Iroy30 April 22, 2019 14:18

BradReesWork added the 3 - Ready for Review label Apr 22, 2019

fixed a typo

65a6310

Iroy30 approved these changes Apr 22, 2019

View reviewed changes

afender approved these changes Apr 22, 2019

View reviewed changes

afender merged commit e008ba3 into rapidsai:branch-0.7 Apr 22, 2019

afender mentioned this pull request Apr 22, 2019

[WIP] edit tests for reading csv #233

Merged

seunghwak deleted the bug_ext_create_column branch April 22, 2019 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Bug ext create column #232

[REVIEW] Bug ext create column #232

seunghwak commented Apr 19, 2019

jwyles Apr 19, 2019

seunghwak Apr 19, 2019

afender Apr 22, 2019

seunghwak Apr 22, 2019

jwyles Apr 19, 2019

seunghwak Apr 19, 2019

seunghwak Apr 19, 2019

jwyles Apr 22, 2019

seunghwak Apr 22, 2019

afender Apr 22, 2019

seunghwak Apr 22, 2019

seunghwak Apr 22, 2019

afender Apr 22, 2019

jwyles Apr 19, 2019

seunghwak Apr 19, 2019

jwyles Apr 19, 2019

seunghwak Apr 19, 2019

Iroy30 Apr 22, 2019

seunghwak Apr 22, 2019

Iroy30 left a comment

[REVIEW] Bug ext create column #232

[REVIEW] Bug ext create column #232

Conversation

seunghwak commented Apr 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Iroy30 left a comment

Choose a reason for hiding this comment