Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to enumerate databases #31

Closed
inexorabletash opened this issue Aug 10, 2015 · 51 comments
Closed

API to enumerate databases #31

inexorabletash opened this issue Aug 10, 2015 · 51 comments

Comments

@inexorabletash
Copy link
Member

Moved from: https://www.w3.org/Bugs/Public/show_bug.cgi?id=16137

Discussions:

http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1528.html
http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1537.html

Critical bit: how do we do this such that the answer isn't obsolete as soon as it's given, since there are no transactions over the set-of-databases.

@inexorabletash
Copy link
Member Author

FWIW, based on early feedback when IDB was shiny and new, Chrome implemented/shipped a prefixed API for this:

partial interface IDBFactory {
  IDBRequest webkitGetDatabaseNames();
};

Result is a DOMStringList (see #28) of the names, purely a snapshot at the time the request is processed.

@aliams
Copy link

aliams commented Jul 6, 2017

I think we should keep this around given that it's useful for IDB libraries and diagnostic purposes. I recommend that we pursue a standardized non-prefixed version of this API e.g. getDatabaseNames().

@Gilead
Copy link

Gilead commented Jul 7, 2017

I reckon no attempt needs to be made to ensure the results remain valid after the function call returns values. I would suggest leaving it up to the application author to deal with it.

That said, a separate function like hasDatabase(name) would be useful, with the proviso that other open/create requests for the same name would be queued until after the current code finishes dealing with IndexedDB (I'm not familiar with the implementation enough to suggest a specific mechanism for this).

@pwnall
Copy link

pwnall commented Jul 27, 2017

@Gilead The only mechanism I can think of would be to hold onto a lock for the database, which is what you get with indexedDB.open.

@aliams
Copy link

aliams commented Jul 27, 2017

It could be treated just like another transaction (such as versionchange, readwrite, readonly) in the sense that no other transactions can execute while getDatabaseNames is running.

@pwnall
Copy link

pwnall commented Jul 27, 2017

@aliams Sorry if I misunderstood your suggestion, but merely serializing the execution of getDatabaseNames doesn't seem sufficient to me. Another transaction could sneak in between the getDatabaseNames transaction and a subsequent indexedDB.open transaction. This results in an unintuitive behavior where indexedDB.open would error out when given a database returned by getDatabaseNames, because a delete transaction snuck between getDatabaseNames and indexedDB.open.

@aliams
Copy link

aliams commented Jul 27, 2017

That's a fair point @pwnall. Are you thinking that after getDatabaseNames is called, a subsequent indexedDB.open on a database would release the previously set lock? Is that to say that you could only call getDatabaseNames before any .open is called?

@pwnall
Copy link

pwnall commented Jul 27, 2017

@aliams Thank you for the clarifying question! I was (unclearly) trying to point out that an implementation of @Gilead's hasDatabase() suggestion would require the same locking as indexedDB.open, so applications might as well use indexedDB.open instead.

I think it makes sense to specify getDatabaseNames without any consistency guarantees. Specifically, I'd only guarantee that at a quiescent time (there is no IndexedDB connection to any database in the current origin), the result would be consistent with all the successful open / delete requests. Otherwise, the database list might contain databases that were already deleted and might miss databases that were created, since the last quiescent time. The latter condition aims formalize the idea that an implementation will do its best to make the result useful, but we could go for something weaker/stronger if it's easier to test.

@aliams
Copy link

aliams commented Jul 27, 2017

That makes sense to me! I agree that getDatabaseNames shouldn't need to have strong guarantees.

@jancona
Copy link

jancona commented Aug 12, 2017

This issue is for a Chrome extension that is an administrative interface for PouchDB (which is implemented on top of IndexedDB). The removal of webkitGetDatabaseNames() has broken it, because it needs to be able to find a database to administer.

Note that this use case doesn't require any consistency guarantees. FWIW, most relational databases I've worked with don't provide transactional guarantees for these sorts of functions.

@beidson
Copy link

beidson commented Aug 12, 2017

WebKit would be happy to implement getDatabaseNames() as long as there are no consistency guarantees.

@pwnall
Copy link

pwnall commented Aug 15, 2017

@aliams Can your comments above be interpreted to mean that Edge would be interested in implementing as well?

@bevis-tseng Any thoughts on Firefox interest in a standardized getDatabaseNames?

@aliams
Copy link

aliams commented Aug 16, 2017

@pwnall, yes!

@bevis-tseng
Copy link

Shall we at least has some guarantee on getDatabaseNames that an error will be fired or the request will be blocked if there is any versionChange transaction ongoing?
In addition, because there is no guarantee that there won't be another IDBOpenRequest coming right after a getDatabaseNames request which upgrades or deletes a database, it seems more helpful for the developers to validate the result if the versions are returned along with the names in getDatabaseNames.

@aliams
Copy link

aliams commented Aug 17, 2017

I think making it so that it would be blocked on version change transactions makes sense. Would it make sense to expose an array of IDBDatabase objects here? That way we can get the name, the version, and the object store names as well.

@pwnall
Copy link

pwnall commented Aug 17, 2017

It seems like we have 3 proposals.

  1. One extreme: the simplest solution for the use cases we've seen.
partial interface IDBFactory {
  IDBRequest getDatabaseNames();
}

The IDBRequest result here is a list of names (sequence<DOMString> or DOMStringList, I'm assuming?).

  1. The other extreme: full database objects.
partial interface IDBFactory {
  IDBRequest getDatabases();
}

The IDBRequest result here is a sequence<IDBDatabase> (whose connection is considered to be closed?)

  1. A middle ground: dictionaries with metadata. Would currently have the name, or the name + database version.
partial interface IDBFactory {
  IDBRequest getDatabasesInfo();
}

The IDBRequest result here is a sequence<IDBDatabaseInfo>, where:

dictionary IDBDatabaseInfo {
  DOMString name;
  unsigned long long version;
}

For the third alternative, given that the use cases we've seen only need names, we have the option to start out with a dictionary whose only attribute is name, and extend the dictionary later.

The first alternative seems to give more flexibility for implementations. For example, using my proposed weak guarantees, an implementation would be free to block on upgradeneeded, to return a list that doesn't include the database being upgraded, or to return a list that does return the database being upgraded. The other 2 versions would definitely need to block on upgradeneeded.

@aliams If I understand correctly, you prefer the 2nd alternative. What are your thoughts on the 3rd?

@beidson You mentioned you'd be willing to implement the first alternative. What are your thoughts on the newer alternatives? Would they add a significant amount of burden to the implementation work?

@bevis-tseng Do you prefer the 2nd or 3rd alternative?

I'll find time soon to prototype the alternatives and estimate the amount of work, but I'm willing to implement whichever version we agree on. Thank you very much for your thoughts!

(Also, I'm not attached in any way to the names in the 2nd and 3rd proposal, please don't be shy if you have other ideas!)

@beidson
Copy link

beidson commented Aug 17, 2017

I don’t think proposals 2-3 have to block on upgradeneeded any more than #1 does.

Thats what I mean about no consistency guarantee - this is just a snapshot in time, has no guarantee of still being true after the IDBRequest is resolved.

There’s no transaction for this, so the moment the results return they might be invalid by the time you try to use them.

And that’s what we’re fine with.

@bevis-tseng
Copy link

I'd prefer proposal#2.

Regarding "queue a task" in #1 , I think this IDBRequest is very special and is different from all the other IDBRequest which could only be blocked by other requests bound to a specific IDBDatabase connection.
Hence, it is reasonable that, in an implementation, there are multiple queues and each one is only bound to an IDBDatabase connection.
However, for this getDatabase IDBRequest, it's an request in per-origin scope instead of per IDBDatabase which breaks this assumption.
Hence, "queue a task" is not clear enough for this case.
We should at least ensure that there is no version change transaction blocking while running this getDatabase operation, don't we?

@beidson
Copy link

beidson commented Aug 21, 2017

We should at least ensure that there is no version change transaction blocking while running this getDatabase operation, don't we?

I don't think we do...?

I'm only familiar with WebKit's implementation and in that impl this doesn't concern me. I still don't see how others can have implemented things such that they do have a huge concern here.

Here's what we do.

No matter how many unique databases are open with an active request queue, we're still piping everything through only a handful of threads all in one process.

Handling this "getAllDatabases" request will hijack the database i/o thread for the period of time it takes to tally all of the results. During that period of time literally no other database operations will happen (version change or not).

It doesn't matter if any version change transactions are in progress or not. The "getAllDatabases" operation will read the non-committed truth from disk for all in-scope databases.

If you have a database at version 1 and it has an open transaction that will be upgrading it to version 2, it will definitely and deterministically be returned as version 1 in the getAllDatabases call.

Does anybody else implement things so differently that this isn't automatically the case, or at least wouldn't be easy to make happen...?

@beidson
Copy link

beidson commented Aug 21, 2017

An obvious difference in implementation I could see would be if web content processes all each hit the database files on disk themselves and used the application process as the gatekeeper for event dispatching to other connections, etc.

In that style of implementation there's a lot more variability in what happens "If a getAllDatabases call is made while a version change transaction is in progress"

But even in WebKit's implementation, the version change that was in progress before getAllDatabases was issued might complete before the getAllDatabases call actually executes.

It's racy, it's non-deterministic, and I can't think of a use-case of this API where that's not fine.

@bevis-tseng
Copy link

Is there more use case study before adding this API in a rush?
If a snapshot of all database names is what developers want, it is fine that this shouldn't be blocked.
However, if more operations on the returned objects are expected, than more guarantee is needed to support these operations.

@aliams
Copy link

aliams commented Aug 21, 2017

@nolanlawson, any thoughts here from a dev's perspective about the getDatabaseNames design?

@pwnall
Copy link

pwnall commented Aug 21, 2017

I'm starting to think that I made a mistake by reusing the IDBRequest mechanism, and the API would be better served by a Promise. I'd rather not invoke the algorithms related to queuing an IDB request, so we have complete freedom in specifying (the lack of) ordering guarantees.

@beidson -- On your question about why implementations would need to block, I can imagine an implementation where all the IDB metadata for an origin lives in a single table. versionchange would require a write lock on the table, and getAllDatabases would require a read lock on the table.

Can we agree to spec the method in such a way that an implementation that needs to block on versionchange (i.e. not respond until the versionchange completes) can do so, and an implementation that doesn't need to block is free to return inconsistent data? If I'm understanding correctly, it seems like Firefox would fall in the first category, and WebKit + Blink would fall in the second category.

I think we'll need some sort of guarantees, to be able to write WPT tests for this feature. My proposal (roughly the same as before), which I think is a subset of what WebKit offers, is:

  1. If getAllDatabases is called at a quiet time (when no transactions are active) for the origin, it will see the result of all transactions that occurred before the call.
  2. Otherwise, getAllDatabases will get a result that may reflect any subset of committed or uncommitted transactions since the last quiet time for the origin.

@bevis-tseng -- Regarding your comment about use cases and rushing, I'll share my thoughts, and end with a question.

I think we had webkitGetDatabaseNames in Blink and WebKit for a few years now. Based on the comments we got when removing it, developers were using it for debugging (list all databases) and for cleanup (remove all databases for the origin that don't match the expected state). Chrome's DevTools also uses the same API internally for IndexedDB inspection. For whatever it's worth, I don't recall receiving a single developer complaint about the lack of consistency guarantees.

Are there any use cases you'd like to explore, other than the two listed above? Is there any other information that make you comfortable reaching a consensus about this API?

Thank you very much for your comments, everyone! IndexedDB is subtle, and thinking about it can be draining. I am grateful for your replies!

@bevis-tseng
Copy link

If this API is only use for debugging or removing(it makes not much sense to open all the IDBDatabase obects at once), it seems over-designed to make too much guarantee for the consistency by returning all IDBDatabase objects.
(The inconsistency could be handled later when opening/delete an un-existed database.)
In addition, the Promise makes more sense here for GetDatabaseNames().
One minor question is that
Is the version of an IDBDatabase useful for debugging? We could include it if it's useful for debugging.

@beidson
Copy link

beidson commented Aug 21, 2017

On your question about why implementations would need to block, I can imagine an implementation where all the IDB metadata for an origin lives in a single table. versionchange would require a write lock on the table, and getAllDatabases would require a read lock on the table.

Then such implementations could not resolve the getAllDatabases() request until the version change transaction(s) are finished, but the other implementations could. That's fine.

Can we agree to spec the method in such a way that an implementation that needs to block on versionchange (i.e. not respond until the versionchange completes) can do so, and an implementation that doesn't need to block is free to return inconsistent data?

This is what I'm voting for.

As an aside, "inconsistent data" is being thrown around a lot here. I'm thinking the term is wrong for this. It's "racy data" in all implementations, in that whatever is returned by the getAllDatabases call might be partially (or fully) out of date by the time it is received, and even more out of date by the time it is used.

And, again, we're fine with that.

Any attempt to guarantee that the results of getAllDatabases() somehow lock all of the relevant databases from use by all other web contexts, therefore getting rid of the raciness... Is something that would be overtly complex, over engineered, and of dubious value.
(IMHO)

@beidson
Copy link

beidson commented Aug 22, 2017

I prefer getAllDatabases
It just reads better, it's more natural.
"getDatabaseInfo" is bizarre as it's singular.
And "getDatabaseInfos" is just terrible.

Also support a Promise over and IDBRequest.

I'm pretty sure:

partial interface IDBFactory {
    Promise<IDBDatabaseInfo> getDatabasesInfo();
}

Should be:

partial interface IDBFactory {
    Promise<Sequence<IDBDatabaseInfo>> getAllDatabases();
}

@pwnall
Copy link

pwnall commented Aug 22, 2017

Thank you very much for the IDL correction! Updated 😄

@mike-north
Copy link

I've never seen any use case for webkitGetDatabaseNames except for a browser plugin (e.g. the PouchDB Inspector). In terms of consistency guarantees I'd say it's not super important, since it's only needed for debugging.

One potential use case from LinkedIn
As we build out more PWA stuff, there's a need to build up a family of "kill switches", one of which is designed to address situations where corrupt/unsafe data could have been stored on user agents. Essentially we want some of the functionality described in the Clear-Site-Data proposal.

There's a very real possibility that teams may have widely varying database names, and it's helpful to be able to wipe out any potentially harmful data across ALL databases (and AFAIK there's no path forward without the ability to enumerate over db names).

@pwnall
Copy link

pwnall commented Aug 24, 2017

@mike-north Thank you very much for the feedback!

everyone: I intend to put together an explainer and a prototype implementation for Chrome, then ask TAG for a review. Please comment soon if you have any objections to the current API shape, which is:

partial interface IDBFactory {
  Promise<sequence<IDBDatabase>> getAllDatabases();
}

dictionary IDBDatabaseInfo {
  DOMString name;
  unsigned long long version;
}

@aliams
Copy link

aliams commented Aug 24, 2017

That looks good to me! I was just wondering if making it return a promise would seem inconsistent with how the rest of the spec returns IDBRequest objects. Any thoughts on that?

@pwnall
Copy link

pwnall commented Aug 24, 2017

@aliams I agree that returning a Promise is inconsistent with the rest of the IndexedDB API. I proposed the Promise approach to avoid the queueing machinery associated with IDBRequest. If we used IDBRequest we'd have a null source and transaction, and might induce incorrect expectations of ordering.

If I understand correctly, @beidson and @bevis-tseng are in favor of returning a Promise. I am happy to implement either approach, and slightly favor the usability of the Promise approach. Are you opposed to the Promise approach, or just noting the inconsistency?

Thank you for your thoughts!

@aliams
Copy link

aliams commented Aug 24, 2017

I think I would prefer being consistent here from a developer's expectation standpoint, but I wouldn't want us to block on it.

@beidson
Copy link

beidson commented Aug 24, 2017

Promises weren't around when IDB1 was done.
Promises are the future, and this API is currently in the future.

Developers have never had this API before, so introducing it will result in all new code, so there can be no previous assumption that it ever worked like an IDBRequest.

Additionally, IDBRequests are about just that - Indexed Database Requests. Their entire construction is based around a query to a unique instance of an Indexed Database, whereas this call is about querying the state of all indexed databases.

IDBRequest doesn't fit.

@aliams
Copy link

aliams commented Aug 24, 2017

Fair enough - are there any other objections with the proposal from @pwnall above?

@inexorabletash
Copy link
Member Author

SGTM.

Only thought is to simplify the name further to just indexedDB.databases() — analogous with caches.keys() — but that's probably a good thing for the TAG to weigh in on.

@beidson
Copy link

beidson commented Aug 25, 2017

That sgtm too

@lakano
Copy link

lakano commented Oct 24, 2017

Hello!

TLDR: This new feature will break some security design and privacy. To prevent this, I suggest to add an option to hide/unhide a database from your getAllDatabases() function.

Long story:

The publication of the API (8th Jan 2015) don't permit to list databases. Google Chrome have also removed their webkitGetDatabaseNames() to respect your initial API ( https://www.chromestatus.com/features/5725741740195840 ).

A example where it's possible to guess if a user is registered on some websites:

Imagine a family where the boy is gay and registered on a gay website with the family computer. If the website want to store user's private information in a IndexedDB, but secretly with a database name from a derived key of the login/password of the user. With your new feature, it's possible for the father with little JavaScript knowledge to list the databases and to discover that his son is registered on this gay website.

The PWA example

There is a big buzz around PWA, and if it's the future, this mean all websites will store more data inside IndexedDB to prevent offline problem. So, if the website is well designed, each user on a device will have a separate database with random names. With your new feature, it's possible for a JS Coder / XSS attack to connect to theses private database to steal theses information.

An example of security design:

WebCrypto API permit to generate keys, to prevent extraction and to store directly the CryptoKey inside IndexedDB. This is really useful, but this need to be stored inside a unguessable random database name. This random name is only known by the server and sent to the client if the login/password are correct. With your new feature, this break this security design and make WebCrypto sign/verify features useless.

Please, to not break security and privacy, could you add an option to hide/unhide a database from your getAllDatabases() function?

@beidson
Copy link

beidson commented Oct 24, 2017

Google Chrome have also removed their webkitGetDatabaseNames()

Their reason for removing it is quite clearly spelled out at that link and it has nothing to do with security/privacy.

Imagine a family where the boy is gay and registered on a gay website with the family computer. If the website want to store user's private information in a IndexedDB, but secretly with a database name from a derived key of the login/password of the user. With your new feature, it's possible for the father with little JavaScript knowledge to list the databases and to discover that his son is registered on this gay website.

This is pretty much FUD.

Let's say the son registers at "https://www.gaybook.com/" and it stores some information in a local IndexedDB.

Every browser already lets you natively list which websites have data stored on your local machine, and "https://www.gaybook.com/" would show up in that list already.

If this API is added, it doesn't really change the situation. The father now could know ahead of time to open a tab to "https://www.gaybook.com/" so he's executing javascript in the "https://www.gaybook.com/" origin and list all the databases... but he already had the native ability to list databases in the browser.

Any time an attacker has physical access to the machine you've already lost. The only protection for this concern is to use different user accounts, encrypted home directories, etc etc.

There is a big buzz around PWA, and if it's the future, this mean all websites will store more data inside IndexedDB to prevent offline problem. So, if the website is well designed, each user on a device will have a separate database with random names. With your new feature, it's possible for a JS Coder / XSS attack to connect to theses private database to steal theses information.

The bug in this scenario is the XSS attack itself, not the DB feature.

If the website is well enough designed to have truly unguessable names for each user's DB (probably not!), then it's probably also well enough designed to not have this XSS vulnerability.

In the presence of the XSS attack, the user's data is already vulnerable whether or not this convenience method exists.

WebCrypto API permit to generate keys, to prevent extraction and to store directly the CryptoKey inside IndexedDB. This is really useful, but this need to be stored inside a unguessable random database name. This random name is only known by the server and sent to the client if the login/password are correct. With your new feature, this break this security design and make WebCrypto sign/verify features useless.

Again, you are assuming an XSS vulnerability here. You are also assuming "security by obscurity" which should not be a technique used by any truly secure website.


Let's define for a moment that XSS-vulnerable web sites are broken.
Do you have any concerns for attacks that this API would truly enable on a non-broken website, and not simply make a little more convenient?

@pwnall
Copy link

pwnall commented Oct 24, 2017

@lakano Thank you for your perspective!

The threat model for IndexedDB (and other local storage primitives) is that each origin is a principal, so an origin has full access to all the data that it has written. None of the primitives is designed with the thought of having multiple principals in the same origin.

In your first example, the parent can open Dev Tools in any of the major browsers and see all the databases for an origin. The parent can probably look through the browser's history as well.

For your second example, it's really difficult (if not impossible) to design a Web application that supports having mutually distrusting users authenticated at the same time. Again, Dev Tools can most likely be used to extract information belonging to other users. Each user should have at least a separate browser profile. Ideally, users should be separated at the operating system level, so they get separate (optionally encrypted) home directories.

I hope this helps.

@lakano
Copy link

lakano commented Oct 24, 2017

@beidson You're right about XSS attacks, but you surely known there is a lot of websites that are not fully protected by XSS attacks. If we forget XSS attacks, there is still possibility for a JS developer to extract database manually because of this new feature, isn't it?

@lakano
Copy link

lakano commented Oct 24, 2017

@pwnall Ok, your right it's possible to see them from Dev Tools, good point! So it's not secure to store data inside IndexedDB, and PWA should not store any private information (this make them less useful BTW).

@lakano
Copy link

lakano commented Oct 24, 2017

So @beidson / @pwnall : You're right, the Dev Tools already permit to read the database content. Surely the solution is to store the data encrypted, and the CryptoKey wrapped. Thanks for your feedback !

@mike-north
Copy link

For your second example, it's really difficult (if not impossible) to design a Web application that supports having mutually distrusting users authenticated at the same time. Again, Dev Tools can most likely be used to extract information belonging to other users. Each user should have at least a separate browser profile. Ideally, users should be separated at the operating system level, so they get separate (optionally encrypted) home directories.

This brings to mind another use case we have at LinkedIn. If a user explicitly logs out, we would like the ability to reliably enumerate over and destroy all databases, getting rid of potentially private data. Of course we could attempt to "remember" all DB names and enumerate over that list, but that's a human process that could potentially fail.

@pwnall
Copy link

pwnall commented Oct 24, 2017

@mike-north This is definitely something that should be doable. You should (eventually) be able to use the Async Cookies API in a Service Worker and enumerate+clear the databases. You should also (eventually) be able to use an HTTP header to clear the origin's data.

@inexorabletash inexorabletash added this to the V3 milestone Feb 7, 2018
@inexorabletash
Copy link
Member Author

I tossed a spec PR up at #240 for this. Needs tests and an implementation.

@inexorabletash
Copy link
Member Author

Implementation in Chrome behind a flag

Test cases: https://github.com/web-platform-tests/wpt/blob/master/IndexedDB/get-databases.any.js

@asutherland
Copy link
Collaborator

Firefox Gecko implementation will happen on https://bugzilla.mozilla.org/show_bug.cgi?id=934640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests