-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add process which produces a JSON cache of all public feeds #67
Comments
This would be really great. We should consider sharding it into 10-25
chunks to enable parallel download and incremental build of the map.
Important to enable gzip compression to speed loading; json consisting of
mostly base-10 numbers is incredibly compressible. protobuf might be
better overall, but zlib+json is really easy and probably good enough. If
server-side CPU is a limitation for serving gzip, we can cache
already-gzipped versions like we do for plume binary serving in PlumePGH.
…On Fri, Mar 25, 2022 at 12:45 PM Chris Bartley ***@***.***> wrote:
Sites like environmentaldata.org suffer from painfully slow load times as
they try to load all of ESDR's public feeds. It might be nice to have a
JSON cache of all public feeds which gets updated regularly (every 1-5
minutes?) and contains some essential, but minimal info about each feed.
Perhaps the following:
- id
- name
- latitude
- longitude
- lastUpload / maxTimeSecs
- exposure?
A query like this is a decent start:
select productId,
id,
name,
latitude,
longitude,
UNIX_TIMESTAMP(lastUpload) as lastUploadSecs,
maxTimeSecs,
deviceIdfrom Feedswhere isPublic = 1order by productId, deviceId, id desc;
Ideas to consider:
- Store the JSON under ESDR's public directory, maybe in some
subdirectory denoting it as a cache.
- Multiple versions, with differing amounts of info.
- Abbreviated field names in the interest of file size OR using some
more compact format such as an array of arrays.
- Group by product ID?
- Sort by productId asc, deviceId asc, feedId desc...so that the most
recent feed for a device comes first?
- Also generate separate JSON files per product?
Possible JSON format:
{
"version" : 1,
"fields" : ["id", "name", "latitude", "longitude", "lastUploadSecs", "maxTimeSecs", "deviceId"],
"feedsByProductId" : {
"1" : [
[26087, "West Mifflin ACHD", 40.363144, -79.864837, 1576762626, 1576686600, 26017],
[59665, "Pittsburgh ACHD", 40.4656, -79.9611, 1648222891, 1648218600, 56652]
],
"8" : [
[4268, "CREATE Lab Test", 40.44382296127876, -79.94647309184074, 1635191877, 1635189738.36, 4260],
[4277, "Red Room", 40.34107763959465, -80.069620013237, 1484140498, 1484084287, 4265]
],
"9" : [
[4245, "Wifi Speck 6", 40.443738, -79.946481, 1422565308, 1422565433, 4230],
[4313, "Outdoors", 40.50156314996969, -80.06125688552856, 1432395167, 1431359910, 4291]
]
}
}
—
Reply to this email directly, view it on GitHub
<#67>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPJ46O4MIHF7U3UY5AOVDVBXUTDANCNFSM5RUVIQJA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Compression is on by default with all ESDR responses, but yes. I like the idea of pre-gzipping. We could also pre-gzip and configure things so that they get served by Apache without having to proxy to Node.
Suggestions for sharding strategies welcome and appreciated. My first thought was to shard by product ID (because some projects care about only one or a select few products), but it gets kinda dumb quickly...almost half of the products have fewer than 10 feeds, many with only 1. We currently have 113645 public feeds, but 90% of them are PurpleAir (but that will drop once the conversion to purpleair_v2 is done). Maybe settle on a fixed number of shards--or a fixed max number of feeds per shard (5k? 10K?)--along with a "table of contents" JSON which tells you which shard(s) to fetch for which product ID. PurpleAir would be sharded into multiple, whereas some other single shard would contain everything other than the top three in the table below (a total of 4338 feeds).
|
Sites like environmentaldata.org suffer from painfully slow load times as they try to load all of ESDR's public feeds. It might be nice to have a JSON cache of all public feeds which gets updated regularly (every 1-5 minutes?) and contains some essential, but minimal info about each feed.
Perhaps the following:
A query like this is a decent start:
Ideas to consider:
Possible JSON format:
The text was updated successfully, but these errors were encountered: