This lightweight tool helps you get a sense of your application's schema, as well as any outliers to that schema. Particularly useful when you inherit a codebase with data dump and want to quickly learn how the data's structured. Also useful for finding rare keys.
This project is a mod of variety that suits my needs. To give you an idea, I just wanted more precision on the percentage of different types for a unique field through a collection. You can find below an example of the output. I haven't checked if everything (more complex requests) is working though.
We'll make a collection:
db.users.insert({name: "Tom", bio: "A nice guy.", pets: ["monkey", "fish"], someWeirdLegacyKey: "I like Ike!"});
db.users.insert({name: "Dick", bio: "I swordfight.", birthday: new Date("1974/03/14")});
db.users.insert({name: "Harry", pets: "egret", birthday: new Date("1984/03/14")});
db.users.insert({name: "Geneviève", bio: "Ça va?"});
db.users.insert({name: "Jim", someBinData: new BinData(2,"1234")});
So, let's see what we've got here:
$ mongo test --eval "var collection = 'users'" variety.js
...
{
"_id" : {
"key" : "pets"
},
"value" : [
{
"totalOccurrences" : 1,
"type" : "String",
"percentContaining" : 50
},
{
"totalOccurrences" : 1,
"type" : "Array",
"percentContaining" : 50
}
],
"totalOccurrencesOfField" : 2,
"percentContaining" : 40
}
("test" is the database containing the collection we are analyzing.)
Looks like the field "pets" is present in only 40% of the documents. Thus, half of these documents is of the type Array and half is of the type String. You have to be careful when fields are not of the same type.
Results are stored for future use in a varietyResults database.
(The rest of the Readme is not up to date since I modified a few things in the behavior of Variety)
Tailing the log is great for this. Mongo provides a "percent complete" measurement for you. These operations can take a long time on huge collections.
Perhaps you have a really large collection, and you can't wait a whole day for Variety's results.
Perhaps you want to ignore a collection's oldest documents, and only see what the collection's documents' structures have been looking like, as of late.
One can apply a "limit" constraint, which analyzes only the newest documents in a collection (unless sorting), like so:
$ mongo test --eval "var collection = 'users', limit = 1" variety.js
Let's examine the results closely:
{ "_id" : { "key" : "_id" }, "value" : { "type" : "ObjectId" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "name" }, "value" : { "type" : "String" }, "totalOccurrences" : 5, "percentContaining" : 100 }
{ "_id" : { "key" : "someBinData" }, "value" : { "type" : "BinData" }, "totalOccurrences" : 1, "percentContaining" : 20 }
We are only examining the last document here ("limit = 1"). It belongs to Geneviève, and only contains the _id, name and bio fields. So it makes sense these are the only three keys.
But how can totalOccurrences still reach 4? "limit" specifies how many documents to search for keys. Then, the tool calculates totalOccurrences and percentContaining from all the collection's documents, even those outside the "limit". This tradeoff is meant to give the most bang for our buck, when using "limit" and learning about a collection.
Perhaps you have a potentially very deep nested object structure, and you don't want to see more than a few levels deep in the analysis.
One can apply a "maxDepth" constraint, which limits the depth variety will recursively search to find new objects.
db.users.insert({name:"Walter", someNestedObject:{a:{b:{c:{d:{e:1}}}}}});
The default will traverse all the way to the bottom of that structure:
$ mongo test --eval "var collection = 'users'" variety.js
...
{ "_id" : { "key" : "someNestedObject" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a.b" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a.b.c" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a.b.c.d" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a.b.c.d.e" }, "value" : { "types" : [ "Number" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
$ mongo test --eval "var collection = 'users', maxDepth = 3" variety.js
...
{ "_id" : { "key" : "someNestedObject" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
{ "_id" : { "key" : "someNestedObject.a.b" }, "value" : { "types" : [ "Object" ] }, "totalOccurrences" : 1, "percentContaining" : 16.66666666666666 }
As you can see, variety only traversed three levels deep.
Perhaps you have a large collection, or you only care about some subset of the documents.
One can apply a "query" contraint, which takes a standard Mongo query object, to filter the set of documents required before analysis.
$ mongo test --eval "var collection = 'users', query = {'caredAbout':true}" variety.js
Perhaps you want to analyze a subset of documents sorted in an order other than creation order, say, for example, sorted by when documents were updated.
One can apply a "sort" constraint, which analyzes documents in the specified order like so:
$ mongo test --eval "var collection = 'users', sort = { updated_at : -1 }" variety.js
First of all, your father is a great guy. Moving on...
A Mongo collection does not enforce a predefined schema like a relational database table. Still, documents in real-world collections nearly always have large sections for which the format of the data is the same. In other words, there is a schema to the majority of collections, it's just enforced by the application, rather than by the database system. And this schema is allowed to be a bit fuzzy, in the same way that a given table column might not be required in all rows, but to a much greater degree of flexibility. So we examine what percent of documents in the collection contain a key, and we get a feel for, among other things, how crucial that key is to the proper functioning of the application.
Absolutely none, except MongoDB. Written in 100% JavaScript. (mongod's "noscripting" may not be set to true, and 'strict mode' must be disabled.)
Please report any bugs and feature requests on the Github issue tracker. I will read all reports!
I accept pull requests from forks. Very grateful to accept contributions from folks.
- Wes Freeman ([software/chess blog] (http://wes.skeweredrook.com))
- James Cropcho (original creator of Variety) ([Twitter] (https://twitter.com/Cropcho))
Additional special thanks to Gaëtan Voyer-Perraul ([@gatesvp] (https://twitter.com/#!/@gatesvp)) and Kristina Chodorow ([@kchodorow] (https://twitter.com/#!/kchodorow)) for answering other people's questions about how to do this on Stack Overflow, thereby providing me with the initial seed of code which grew into this tool.
Much thanks also, to Kyle Banker ([@Hwaet] (https://twitter.com/#!/hwaet)) for writing an unusually good book on MongoDB, which has taught me everything I know about it so far.
I have every reason to believe this tool will not corrupt your data or harm your computer. But if I were you, I would not use it in a production environment.
Released by Maypop Inc, © 2012-2014, under the [MIT License] (http://www.opensource.org/licenses/MIT).