Google Datastore and the shift from a RDBMS 8

Posted by ben Sun, 13 Apr 2008 23:23:47 GMT

So many random musings and theories on Google App Engine, I won’t bother musing about it myself, except to mention that Ian Bicking put together instructions for running Pylons on it. These also work fine for using the latest Pylons 0.9.7 beta.

I got Beaker, the session and caching WSGI middleware that Pylons uses, running fine on Google now, using Google Datastore as the backend. Diving into the Datastore docs to get a grip on what’s the best way to implement it shed some light on the transition any developer thinking about writing data-backed apps for GAE (Google App Engine) will need to tackle.

Some notes on terminology, Google has Entities, Kinds, and Properties. These correspond roughly to Rows, Tables, and Columns in RDBMS-speak. Kinds can also be called classes, because in the Python API, you create a class and inherit from the appropriate datastore class. Entities may also be referred to as instances, since performing a query returns a list of objects (instances).

Sessions and Datastore

First, regarding sessions. Beaker will now let a Pylons app use normal sessions on GAE, the real question is, should you?

The Google User API makes it trivial to get currently logged in user, and the datastore comes with a property type for a ‘table’ that is specifically made for a Google user account reference. So with just one short command, you can have an entity from the Datastore that corresponds to a given user, ie:

userpref = UserPrefs.all().filter('user =', users.get_current_user()).get()

The Datastore is blindingly fast for reads and queries, so there’s a compelling reason to ignore sessions altogether and just fetch the appropriate preferences or what-have-you. This leaves people with the normal reason for wanting more, ie, a session, “But wait, I want to stash other little things with the user when they run around my app!”. Not a problem.

Google’s Datastore has an Expando class for entities that lets you dynamically add properties of various types. It’s like having a RDBMS where you can just add columns to each row, on the fly. The dynamic_properties() entity method makes it easy upon pulling an object, to see what dynamic properties were already assigned.

As far as I’m concerned, this pretty much mitigates the need for a session system. If you didn’t want to require user login, you could always make a little session ID yourself, and keep that on the UserPrefs table as a separate property, then query on that.

Rethinking how you store/query/insert data

Going slowly through all the Datastore docs and especially reading some of the performance information people were drumming up on the GAE mail list brought up a number of issues with how people with RDBMS backgrounds approached Datastore. Many of the table layouts I saw pasted on the mail list were clearly written for how an RDBMS works, with sometimes significant work required to adapt it to deal with Datastore.

A little background might help understand this difference. Google Datastore is implemented on top of BigTable, which is described briefly in the paper as a “sparse, distributed, persistent multi-demensional sorted map”. One of the other descriptions I heard in a talk on data storage techniques at FOO Camp from a Google developer was, “think of a BigTable table as a spreadsheet, except with pretty much as many columns as you want”.

This brings about a fairly big shift in thinking for the developer who grew up on an RDBMS. The fairly normalized organization of data written without regard to massively distributed data stores suddenly becomes a rather big problem. Consider a few of the ‘limitations’ of Datastore that will jump right out at you:

  • You cannot query across relations
  • You cannot retrieve more than 1000 rows in a query
  • Writes are much much slower than you’re used to (a developer on the mail list said 50 inserts with 2 fields each almost ate up the 3 seconds allowed for a web request)
  • There are zero database functions available
  • There is no “GROUP BY…”, which doesn’t matter much if you read the prior bullet point
  • Transactions can only be wrapped around entities in the same entity group (ie, the same section of the distributed database)
  • Referential integrity only sort of exists
  • No triggers, no views, no constraints
  • No GIS Polygon types, or anything beyond just a GeoPoint (Odd, considering that Google has so much mapping stuff)

Then of course, a few of the new things that might leave you scratching your head, quite happy, or both:

  • Keys for an entity may have ancestors (ancestors aren’t relations, they’re different and have to do with Entity Groups, which determine what you can do in a transaction, wheeee!)
  • An Entity Group doesn’t have to all be of the same Kind, its more of an instruction to Datastore to keep these near each other when distributed
  • Key’s can be made before the entity, just so you can make descendent entities of the key, then make the ancestor
  • The handy ListProperty, when used in a query, will let you use the conditional argument and apply it to every item in the list (sort of like an uber ‘IN (...)’ query, except it can also find all the data where a member in the list was <, >, or = to something else)
  • Making more Entity groups is a good idea when you frequently need a batch of “these few things” for a request, especially if you need to alter them all at once in a transaction
  • Normalizing is frequently bad since you can’t query across relations, dynamic properties make it easy to heavily denormalize. If you do normalize some data and its for the same batch of ‘things you always need at once’, use Entity groups. Or use a ReferenceProperty if its merely something related you may occasionally hit.
  • The ReferenceProperty() does not have to refer to a known kind, you can decide on the fly what datastore classes to reference if not specified when declaring the ReferenceProperty
  • Many to Many relations aren’t what you think, now you could have a ListProperty() of ReferenceProperty()’s, which may or may not all refer to instances of the same class
  • A query may return entities of different kinds, if querying for entities of a given ancestor

(There’s probably a bunch more as well, these were some of the obvious ones that jumped out at me)

The end result of this, is that the standard way a developer writes out the table schema for a RDBMS should be dumped almost entirely when considering an app using Google Datastore. Storing data and using Google Datastore isn’t difficult, but it is a pretty hefty paradigm shift, especially if you’ve never left RDBMS-land. This is not a trivial change to make in approaching your data.

I rather enjoyed working with these new ways of tackling data, and the possibilities opened by the ways it lets me store and refer to data in many ways goes beyond the traditional RDBMS. In the short term though, I doubt I’ll be making any GAE app’s until there’s an alternative implementation thats production ready… I just can’t handle the lock-in.

And of course, please note any corrections or inaccuracies in the comments.

Where's the Capistrano knock-off for us Python web devs? 19

Posted by ben Thu, 10 Apr 2008 01:27:24 GMT

Rails, and Ruby in general has had Capistrano for awhile now to help with the task of deployment and automating builds for servers, and even clusters of servers. Where is something like this for Python?

Now, before people note that I could easily use Capistrano for my Python project, I should note that it is rather annoying having to install yet another language. On the other hand, given that I will likely only need to install it on my development machine (which running OSX already has Ruby… and gems), it doesn’t seem too horrible to just use Capistrano and be done with it.

However, Capistrano doesn’t quite manage the Python egg’s, and the task isn’t exactly trivial. zc.buildout, which I previously ranted about due to odd docs does the management pretty well. It even results in a rather consistent build experience no matter where it occurs. Two commands, and boom, the app is ready to go.

Unfortunately, life isn’t quite that easy. When something does go wrong with buildout, trying to track it down can be exceptionally hairy. Having a tool so ‘magical’ as I’ve heard some describe it, carries its own penalties when things fail. Buildout also fails to automate the task of deploying the app itself to the other machine, which is still a manual process. It does manage egg’s rather well, though it does some very odd mangling of sys.path to accomplish this in every script.

I don’t need something as full featured as Capistrano, but I’d love to see something that has no more requirements than I’m already depending on (Python), that can handle the task of easily automating deployment of a Python application – including ensuring all the proper versions of the eggs I want are used – on a remote *nix machine. I recall seeing a post (I think by Jeff Rush) awhile back, on a system just like this that he unfortunately never released. Vellum also looks like it could be hacked further to do this task…

Is there some build/deployment tool that is just Python that I’ve missed? Something that will let me setup a script for some commands on how to deploy my app on another server and setup (hopefully in a virtualenv) the webapp so its ready-to-run (and optionally restart it/migrate the db/etc :)?

MarkMail now indexing Pylons-discuss/devel 4

Posted by ben Wed, 09 Apr 2008 16:55:57 GMT

I’m thrilled to announce that MarkMail now indexes the Pylons-dicuss and Pylons-devel mail lists. For those looking for a great way to search and browser the Pylons mail lists, the MarkMail interface is top-notch.

For those looking for detailed Pylons docs…. there’s some very exciting developments in the documentation front coming up shortly that will make you rather happy. :)

Sacrificing readability for automated doc tests 4

Posted by ben Fri, 04 Apr 2008 23:19:23 GMT

I’ve tried several times in the past to try out zc.buildout, a fairly neat sounding package that automates the buildout process for a Python app. The promise of fairly easy to write recipes that can setup external processes like nginx in addition to ensuring my webapp is put together with all the things it needs sounded great.

It occurred to me that the docs definitely didn’t help at all. In fact, they’re noticeably bizarre unless you actually realize why they’re written the way they are. Here’s a sample of the zc.buildout docs about how to make a new buildout and bootstrapping.

You’ll notice that it almost looks like command line interactions of some sort are occurring, yet the author of the docs is clearly at an interactive Python prompt. Note that none of the commands shown there will work if you copy them into your Python interpreter, nor is there any indication what you would need to do to get such commands available. As a user trying to follow the docs, that leaves me wondering… am I supposed to be in a Python interpreter? What do these variables get expanded to so that I can do that at my shell prompt? Why can’t you just give me the damn command line I’m supposed to run so I can copy/paste???

Yes, it definitely got me a bit frustrated. I believe the only logical reason the docs were in this bizarre fashion is so that they could be automatically doc-tested. Its a shame that the result of this is docs that make me want to close the web page as soon as I stumble upon the ‘samples’, since there’s no way I can handle wading through the command line abstractions.

Doctests can be useful, but turning command line interactions into a Python interactive session is a massive readability issue. People know and recognize command line interactions, lets stick with them please.

Pylons @PyCon 2008 Wrap-up 2

Posted by ben Sat, 22 Mar 2008 05:22:21 GMT

Sprints!

Wow, they were great. A lot got done, quite a bit more than I was expecting. I’d like to give a big shout-out to the Pylons sprinters (in random order, sorry if I missed anyone):

  • Karen Lo
  • Mike Orr
  • David Montgomery
  • Mike Verdone
  • Wes Devauld
  • Ian Bicking
  • Phil Jenvey
  • and many more TG2 sprinters

WebHelpers saw some significant gains, including the addition of a literal object for safe HTML escaping, and default Mako auto-escaping in Pylons 0.9.7.

There will definitely be more Pylons/TG2 sprints in the future, in multiple locations around the country.

Tutorials

Ouch… setuptools came through, unfortunately the network didn’t, nor did my first built egg of Pylons (unfortunately I got a new laptop recently that was missing some things…). Mark Ramm and I did get the show under way, but the delay in starting up definitely affected how far the tutorial attendee’s made it in the basic Wiki project and proved frustrating for all.

I made it through all the Pylons information I had in the Mastering Pylons/TG2 section, but unfortunately had no time left to go into more detail on various aspects of it in a more hands-on style. I definitely won’t be doing any more tutorials without ensuring lots of USB thumb-drives are handy with virtualenv Pylons/TG2 ready to go.

Sessions and the Conference

They seemed ok, several of them seemed very, “And this person is giving this talk…. why?”. Note to session presenters, honesty is great but I really do wonder why you’re presenting when your opening remarks are, “What I’m about to show you, anyone who’s used this already knows, so there’s nothing to see here.” Show us something with some zing!

The lightning talks, despite a first day of some awkward sponsored ones, were actually the highlight for me. I almost spit water laughing on the “Speech Recognition does war”, followed up by “His base of God would have I done now” (Originally intended to be ‘Speech recognition does work. oh god, what have I done now?’). Unfortunately Ian Bicking failed to give us an update on ZhangoPyloGears, I can only imagine it has fully attained consciousness and freed itself of his clutches.

Coming up…

Pylons 0.9.7 is close at hand, just waiting for some final bits of WebHelpers to drop into place and it should be ready to pop. And one last side-note, I think I finally found the documentation tool of my dreams… Sphinx!

Older posts: 1 2 3 ... 6