Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/gfql endpoint #615

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Dev/gfql endpoint #615

wants to merge 16 commits into from

Conversation

lmeyerov
Copy link
Contributor

@lmeyerov lmeyerov commented Nov 28, 2024

Some very exciting things happening here -- maybe time to call pygraphistry 1.x.x ??

Remote dataset binding

Bind to remote (lazy):

g1 = graphistry.bind(dataset_id="abc123")

Decouple upload from plot(), and bind the upload:

g_uploaded = g_unuploaded.upload()
print(g_uploaded._dataset_id)
print(g_uploaded._nodes_file_id)
print(g_uploaded._edges_file_id)
print(g_uploaded._url)

Remote GFQL

# + auto-uploads if not already
g2 = g1.chain_remote([....], engine='gpu')

# no need to download results if you just care whether matched vs not
df_meta = g1.chain_remote_shape([...])

Remote Python too!

json_obj = g1.python_remote("""
from graphistry import Plottable

def task(g: Plottable):
  return {'status': True}
""")

Notes

Changes

  • Now tracks dataset_id, nodes_file_id, edges_file_id, and url, including clearing them out whenever the nodes/edges gets re-bound with new data

  • plot(render= expanded from bool to Union[bool, Literal["auto", "g", "ipython", "databricks", "browser"]] to make more predictable

:param api_token: Optional JWT token. If not provided, refreshes JWT and uses that.
:type api_token: Optional[str]

:param dataset_id: Optional dataset_id. If not provided, will uplaod current data, store that dataset_id, and run GFQL against that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment references gfql here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could clarify that if a dataset_id exists on the plottable it will be reused, as opposed to reuploading every time python_remote is called with it omitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a bit problematic in that this would trigger re-upload, and not sure if that's 'right':

g2 = g1....

# auto-upload 1
res1 = g2.python_remote(script1)

# auto-upload 2
res2 = g2.python_remote(script2)

A few ideas:

  • inplace update of g2 , breaking purity of v3 = g2.xyz()
  • Some sort of global memoizable trick, so even if g2._dataset_id is not mutated on first upload, we can detect g2 did have a recent upload? Ex: some sort of global weak reference lookup table of last n g objects?
  • Don't worry about it, and encourage users to do an explicit g2 = g1.upload()...

Copy link
Contributor

@mj3cheun mj3cheun Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didnt consider this... this cant happen in python_remote because that never modifies the plottable, but chain_remote definitely does.

my opinion is that perhaps we should avoid self mutation of plottable and instead return a "clone" (with identical metadata etc) of the original plottable with the updated node and edge lists? this should also better mirror the behaviour of non-remote chain anyway?

i dont have strong opinions on autouploading the clone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the python endpoint return a Plottable, JSON, or any? How would I know ahead of time, or how would the python client sniff?

Copy link
Contributor

@mj3cheun mj3cheun Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now the python endpoint returns either a string or a json. which one it returns is completely dependent on the return type of the function defined (by the user) in execute. the python endpoint code will detect the type of the returned value and use the appropriate response format.

you can read the MIME type of the response to get which

Copy link
Contributor Author

@lmeyerov lmeyerov Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it might not be so bad to extend the python endpoint to reuse the gfql endpoint's machinery here, i'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mimetype is cool, that helps!)

Copy link
Contributor Author

@lmeyerov lmeyerov Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on avoiding mutation on the base obj, maybe we do some variants like:

# generic helper
_ : Any = g1.remote_python("...", output_mode=...)

# explicit mypy-friendly
g2 = g1.remote_python_g("...")
g2, g1_bound = g1.remote_python_g2("...")
o = g1.remote_python_json("...")
o, g1_bound = g1.remote_python_json2("...")

I'm tempted to not do the <xyz>2 variants above, and if someone wants to reuse uploads, they have to be explicit:

g1 = g0.upload(...)
# or g1 = graphistry.bind(dataset_id="...")

g2 = g1.remote_python_g(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants