Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update pgx to v4 #820

Merged
merged 2 commits into from
Jan 26, 2022
Merged

chore: update pgx to v4 #820

merged 2 commits into from
Jan 26, 2022

Conversation

iwpnd
Copy link
Member

@iwpnd iwpnd commented Dec 28, 2021

Hey, as discussed I took a look at the pgx upgrade you wanted to do. It involved some refactoring as you expected.

Some things that took me a while were:

  • pgx, pgtype and pgproto3 are now separate packages
  • hstore as an extension does not have static OID, so the OID has to be fetched after connect and then registered as a datatype. It's the first time I encounter something like this, so I'd love to get your opinion here.
  • the pgxpool is instantiated with a connection string. For now I'm creating the string from the config parameters to allow backward compatibility. This now lends itself to allow a connection string in the config.toml.
  • it might be worth rethinking the ConfigTLS() as most of the options are gone now, and fallback connections are handled differently

@iwpnd iwpnd requested review from ARolek and gdey as code owners December 28, 2021 14:23
@coveralls
Copy link

coveralls commented Dec 28, 2021

Pull Request Test Coverage Report for Build 2ad6bcf46-PR-820

  • 94 of 142 (66.2%) changed or added relevant lines in 4 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.06%) to 45.206%

Changes Missing Coverage Covered Lines Changed/Added Lines %
atlas/map.go 0 1 0.0%
provider/postgis/util.go 14 20 70.0%
server/handle_map_layer_zxy.go 1 7 14.29%
provider/postgis/postgis.go 79 114 69.3%
Files with Coverage Reduction New Missed Lines %
provider/postgis/postgis.go 4 56.11%
Totals Coverage Status
Change from base Build b55bca432: 0.06%
Covered Lines: 5484
Relevant Lines: 12131

💛 - Coveralls

@ARolek
Copy link
Member

ARolek commented Dec 28, 2021

Well, this is incredible! I will get a review done this week. Thank you for the contribution!!!

Copy link
Member

@gdey gdey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick visual scan through the code. Nothing major jumps out at me.

Thank you for this!

LGTM

Copy link
Member

@ARolek ARolek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I have done a pass on the code and left some minor comments inline. I will work on some local QA next.

👏

provider/postgis/postgis.go Outdated Show resolved Hide resolved
provider/postgis/postgis.go Outdated Show resolved Hide resolved
provider/postgis/util_internal_test.go Outdated Show resolved Hide resolved
@ARolek
Copy link
Member

ARolek commented Dec 29, 2021

@iwpnd can you rebase the v0.15.x branch onto this one? I updated our CI so it would build some test binaries on PRs. That way if someone wants to help with QA they don't need to build from source.

@iwpnd
Copy link
Member Author

iwpnd commented Dec 31, 2021

How do you do local QA @ARolek? Maybe something I can help you with?

@ARolek
Copy link
Member

ARolek commented Jan 4, 2022

@iwpnd apologies for the slow response here. I had to vacate my family from the Marshall Fire. Our home survived but over 1000 homes are gone. Truly a sad situation.

For local QA I'm planning on spinning up Postgres 13 and Postgres 14 with some datasets to make sure everything works as expected. Did you do any of this type of testing locally?

@iwpnd
Copy link
Member Author

iwpnd commented Jan 4, 2022

@iwpnd apologies for the slow response here. I had to vacate my family from the Marshall Fire. Our home survived but over 1000 homes are gone. Truly a sad situation.

Don't apologize for that. Glad to hear you and your family are okay.

For local QA I'm planning on spinning up Postgres 13 and Postgres 14 with some datasets to make sure everything works as expected. Did you do any of this type of testing locally?

Spun up a pg13/PostGIS3.1 instance with some of our data and it worked. Wouldn't call it testing yet though. No TLS connection or anything involved. When I'm back from vacation I thought about forking an image with the change and run it on our staging cluster with some more live data.

@ARolek
Copy link
Member

ARolek commented Jan 4, 2022

Sounds good! I will report my findings as well once I get back into the swing of things. I appreciate your contributions very much. Enjoy yoru trip!

@ARolek
Copy link
Member

ARolek commented Jan 7, 2022

I have successfully tested this PR against Postgres 14 using the Natural Earth dataset. Looking good! I still need to test the TLS support, but no issues encountered thus far.

@iwpnd
Copy link
Member Author

iwpnd commented Jan 7, 2022

I'll be forking the image today and will try to connect to our pg13.4/3.1 cluster using TLS. I'll keep you posted here.

update: no issues on the TLS connection itself, but I'm getting quite some amount of these:

2022-01-07 16:11:03 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled

@ARolek
Copy link
Member

ARolek commented Jan 7, 2022

@iwpnd I just saw your update and doing some additional testing I'm hitting the same issue. I found the fix. We need to change our error statements from:

return fmt.Errorf("error replacing layer tokens for layer (%v) SQL (%v): %v", layer, sql, err)

To

return fmt.Errorf("error replacing layer tokens for layer (%v) SQL (%v): %w", layer, sql, err)

Note the change of %v -> %w. This wraps the error check in atlas/map.go fails to identify the context.Canceled error:

if err != nil {
	switch {
	case errors.Is(err, context.Canceled):
		// Do nothing if we were cancelled.

	default:
		z, x, y := tile.ZXY()
		// TODO (arolek): should we return an error to the response or just log the error?
		// we can't just write to the response as the waitgroup is going to write to the response as well
		log.Printf("err fetching tile (z: %v, x: %v, y: %v) features: %v", z, x, y, err)
	}
	return
}

Can you please update the errors in the postgis package to use wrapping?

@iwpnd
Copy link
Member Author

iwpnd commented Jan 8, 2022

While this is not the error that is raised on my end, I see what you mean and will address it.

My error occurs in atlas/handle_map_layer_zxy.go and following your example this:

	if err != nil {
		switch err {
		case context.Canceled:
			// TODO: add debug logs
			return
		default:
			errMsg := fmt.Sprintf("error marshalling tile: %v", err)
			log.Error(errMsg)
			http.Error(w, errMsg, http.StatusInternalServerError)
			return
		}
	}

should actually be:

	if err != nil {
		switch {
		case errors.Is(err, context.Canceled):
			// TODO: add debug logs
			return
		default:
			errMsg := fmt.Sprintf("error marshalling tile: %v", err)
			log.Error(errMsg)
			http.Error(w, errMsg, http.StatusInternalServerError)
			return
		}
	}

right? This solved the issue.

Update 1:

Now I'm left with occaisonal:

2022-01-08 11:41:34 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=xyz database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled)

:|

Update 2:
PostgreSQL logs:

database          | 2022-01-08 11:20:15.496 UTC [676] ERROR:  canceling statement due to user request
database          | 2022-01-08 11:20:15.496 UTC [676] STATEMENT:  SELECT ((SELECT ST_AsMVT(q,'all',4096,'geom','gid') AS data FROM (    SELECT ST_AsMVTGeom(geom, ST_MakeEnvelope(13.18359374816483,52.268157368527206,13.359374998140362,52.37559917141592,4326)) as geom,gid,id,name,description,type,ref_id as refId,rootzone_id as rootZoneId,start_at as startAt,end_at as endAt,created_at as createdAt,updated_at as updatedAt FROM geometries WHERE type in ('business', 'policy') AND geom && ST_MakeEnvelope(13.18359374816483,52.268157368527206,13.359374998140362,52.37559917141592,4326)) AS q)) AS data

for those cases

Update 3:

Executing a query by not passing the ctx gets rid of the error, but I do not yet have the necessary understanding as to why and its implication.

rows, err := p.pool.Query(context.Background(), sql)

@ARolek
Copy link
Member

ARolek commented Jan 10, 2022

Executing a query by not passing the ctx gets rid of the error, but I do not yet have the necessary understanding as to why and its implication.

Can you log the error type so we can see what pgx is returning?

log.Printf("error is of type %T, err: %v", err, err)

We might just need to add an additional check if pgx is not wrapping the context.Canceled error correctly.

@iwpnd
Copy link
Member Author

iwpnd commented Jan 11, 2022

Hey @ARolek, sorry for the late reply, I'm back on the grind after vacation.

2022-01-11 09:22:13 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled) (*pgconn.connectError)

@gdey
Copy link
Member

gdey commented Jan 12, 2022

Hey @ARolek, sorry for the late reply, I'm back on the grind after vacation.

2022-01-11 09:22:13 [ERROR] handle_map_layer_zxy.go:169: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled) (*pgconn.connectError)

Looking at this you may need to unwrap the error a bit more. It looks like the connectError properly has an Unwrap method, so one of the errors returned by the error being cased here is not wrapping the context.Cancelled error correctly. I cannot dig further as DialFunc is dynamic to the configuration.

@ARolek
Copy link
Member

ARolek commented Jan 12, 2022

I also found this related issue: jackc/pgx#933. I think what @gdey is saying might be correct, we're going to need to unwrap the errors from the pgx client and check if they're context canceled errors to catch this. Seems not ideal. If we don't handle it, I think we're going to see a lot of these context canceled errors in the logs.

@ARolek
Copy link
Member

ARolek commented Jan 12, 2022

I made a local update to wrap the errors in the messages with %w instead of %v. I'm still seeing a bunch of error logs though:

map.go:330: err fetching tile (z: 7, x: 23, y: 47) features: error running layer (ne_10m_land) SQL (SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, featurecla, min_zoom FROM ne_10m_land WHERE wkb_geometry && ST_MakeEnvelope(-1.2841420750122068e+07,5.00448511519043e+06,-1.2518550742690431e+07,5.32735512262207e+06,3857)): failed to connect to `host=<redacted> user=<redacted> database=natural_earth`: dial error (dial tcp 192.168.5.5:5432: operation was canceled)

"Operation was canceled" does not seem to be a context.Canceled error though. I'm also seeing:

2022/01/12 15:20:02 map.go:330: err fetching tile (z: 9, x: 86, y: 189) features: error running layer (ne_10m_admin_1_label_points) SQL (SELECT ST_AsBinary(wkb_geometry) AS geometry, ogc_fid, name, scalerank FROM ne_10m_admin_1_label_points WHERE wkb_geometry && ST_MakeEnvelope(-1.3307380874483643e+07,5.164697126453858e+06,-1.3226663372625733e+07,5.245414628311767e+06,3857)): failed to connect to `host=<redacted> user=<redacted> database=natural_earth`: failed SASL auth (write tcp 192.168.2.92:60731-><redacted>:5432: i/o timeout)

That does not happen on startup, but I believe as the pool is allocating more connections this error is showing up for each new connection being added to the pool. I'm still investigating but want to share my notes.

@iwpnd
Copy link
Member Author

iwpnd commented Jan 13, 2022

jackc/pgx#1109

Apparently we need to check rows.Err() after rows.Next(), maybe a pointer. Because:

Not all errors can be detected when Query returns. You must read the results of the query and then check rows.Err() after that.

Regarding the TLS connection. There's been a major change in v4 as to how fallback connections and TLS is handled. As I noted in my initial comment, I only skip the verification in most cases.

@ARolek
Copy link
Member

ARolek commented Jan 13, 2022

Apparently we need to check rows.Err() after rows.Next(), maybe a pointer. Because:

I have reviewed the code and we are doing this already. It's usually the last statement in the method:

return rows.Err()

@iwpnd
Copy link
Member Author

iwpnd commented Jan 14, 2022

Good morning @ARolek! I took a look at your changes and they were not sufficient to quiet the errors. Both the connection and context canceled was spammed continuously:

2022-01-14 08:24:53 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled
2022-01-14 08:24:53 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: failed to connect to `host=localhost user=samson database=geometries`: dial error (dial tcp 127.0.0.1:5433: operation was canceled)
2022-01-14 08:24:57 [ERROR] handle_map_layer_zxy.go:168: error marshalling tile: timeout: context canceled

I took your advice and did two things. First I made sure that i handle the error in dbconfig.AfterConnect properly. Next up, I took a look at the switch statement in handle_map_layer_xyz where the error was originating and use the same logic that you did in atlas/map.go which quieted the errors entirely. Can you please take a look at it too?

Why exactly is the context canceled here anyways, and what does it entail?

@ARolek
Copy link
Member

ARolek commented Jan 14, 2022

Why exactly is the context canceled here anyways, and what does it entail?

Good question. So context Canceleing is really important in this codebase as it accounts for the situation when users pan the map before a tile has been fully fetched, encoded and returned to the client. Without proper context handling, we're processing requests that are no longer needed.

@gdey gdey self-requested a review January 14, 2022 18:53
Comment on lines +169 to +170
case strings.Contains(err.Error(), "operation was canceled"):
// do nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to get away from this if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gdey yeah, I know, but we can't in this instance. I left a code comment around this situation in map.go:

// the underlying net.Dial function is not properly reporting
// context.Cancel errors. Becuase of this, a string check on the error is performed.
// there's an open issue for this and it appears it will be fixed eventually
// but for now we have this check to avoid unnecessary logs
// https://github.com/golang/go/issues/36208
case strings.Contains(err.Error(), "operation was canceled"):

// is not constant, so we lookup the OID once per provider and store it.
// Extensions have to be registered for every new connection.

if !hstore.hasInit {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I like this hasInit check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pat yourself on the back for that, it was your idea 😄

provider/postgis/postgis.go Outdated Show resolved Hide resolved
@@ -201,38 +207,43 @@ func BuildDBConfig(cs string) (*pgxpool.Config, error) {
"application_name": "tegola",
}

var hasInit bool
var hstore hstoreOID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we persist this value outside of the AfterConnect call, then it will only be set on the first connection. If I understand you correctly, the hstore registration needs to happen for every connection. If that's the case, then we should move this var to inside the AfterConnect function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your concern.
The idea is that I only query the database once (for the first connection) for the OID of hstore, and save the OID value per provider in the hstore var.
Now on the next connection the query is skipped, however the datatype is registered with the OID of the hstore var.
If I were to add a log line into the if !hstore.hasInit I'd only see it once, yet registering the datatype happens over and over again.

Copy link
Member

@gdey gdey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Updating Postgres driver to pgx v4
* refactored providers/postgis to use the pgx4 client.
Support for Postgres versions > 12 is now possible.

* provider/postgis: Properly wrap errors in messages by
moving from using %v -> %w when returning errors
in messages.

* Added error string check for context.Canceled.
The underlying net.Dial function is not properly
reporting context.Cancel errors. Becuase of this,
a string check on the error is performed. There's
an open issue for this and it appears it will be
fixed eventually but for now we have this check
to avoid unnecessary logs.

Related issue: golang/go#36208

* added ctxErr() check thewill check if the supplied
context has an error (i.e. context canceled) and if so,
return that error, else return the supplied error.
This is useful as not all of Go's stdlib has adopted
error wrapping so context.Canceled errors are not always
easy to capture.

closes #748
@ARolek ARolek merged commit 2d6046c into go-spatial:v0.15.x Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants