Tuesday, July 15, 2008

Google App Engine Followup

Following up my last rambling post with some GAE deficiencies that I have found.

Inability to communicate with external servers in meaningful ways

Often times when building an application it is necessary to communicate with another server. A classic example would be an email server. While the client connects to CRUD email the server would need to connect to other servers to deliver new email. Granted, you can use GAE's facilities to do this for you but if you wanted to invent "Email2.0" and need to directly poll servers or otherwise communicate with them you can pretty much fuggetaboutit.

That being said, you can make HTTP/S connections (80/443 only). However these connections have to be made within the context of a user request to the server (youch!). Worse, you can't time out these connections. So you could potentially hang your user indefinitely without any abiltity to control this from the GAE side. So, these just are not useful. I'm not sure why they have limited connections in this way. I'm guessing security concerns.

Cost of running routine database maintenance seems very high

I'm not a rocket scientist (OK, you didn't need that pointed out for you :P). It is very difficult for me to get an app up and running well. My preference is to build, tinker, build, tinker, etc. Some people like to start with requirements, move on to use cases, create a design doc and on until they have an app ready for the world. I prefer to start small, move it into production and continually add. That's just "the way I roll." That being said I find that the GFS/Big Table/Datastore on GAE doesn't fit well into the way I work. For example, API calls are supposed to be scarce. For the free/beta version you are limited to 2.5m API calls/day. That seems like a lot but I will often build some type of monitor utility that will continually scan my db and do things like massage rankings, update FK-relationships and various other tasks. In my apache+postgres/mysql world that isn't a big deal. When load is light I can fire off the maintenance apps and they can have their way with the db. No can do with GAE. And this is for 2 reasons:

  1. There is no way to run a chron job on GAE. Worse, even if you could fire off a simulated job by touching a URL the engine will limit your processing time to that which is reasonable for a typical user request. Probably < 1 min but definitely < 5 mins.
  2. Maintenance utils would burn through the API credits. Let's say I have a modest app with 100k records that I massage every hour or so with a ranking. 100k * 24h = 2.4m api calls per day and my users haven't even done anything yet.

There are some ways that I could work within the constraints of the app engine but for me, for now, it makes more sense to stick with LAMP on some rented hardware. I'll check back again in a few months and see if any of this gets addressed.

GWT notes

On the plus side I was able to get GWT working with GAE rather effortlessly. Pretty straightforward if you don't mind rolling your own RPCs. I think the RPCs out of the box are pretty bloated anyway. FWIW, I still find it strange that Google picked Java over Python for GWT. They should just buy out the Pyjamas guys and integrate it already.

Thursday, July 3, 2008

Google App Engine

I've been tinkering around with Google App Engine (GAE) and I'm pretty impressed. It's clearly designed to compete with Amazon's Elastic Compute Cloud (EC2) and hopefully we'll see even more scalable app containers come on the scene. I kind of wonder if BEA will get into the mix anytime soon.

First, the good.

Getting a hello-world app up and running on GAE is a breeze. You can download the SDK, follow their tutorial and have something going in just a few minutes. Even growing your data model is a dream. Just create a class extending one of their datastore base classes (eg db.Model), use or extend the ready-made datastore properties and that's it. No worrying about mappings, creating tables, etc. All of that is handled automagically. The way the data models work seems very much like Django (which, apparently runs like a champ inside of GAE).

Most of the Python 2.5 library is available. A few things are missing (namely file/system-related libs) but I haven't really run into problems with that yet. I could see where it might be a pain in the ass later but hopefully I can just code around those problems or delegate out to EC2 or some other host. That would be ideal. I haven't really tested the interoperability yet but there appears to be at least a way to another server via http.

On to the bad.

I don't have a whole heck of a lot to add here (at least not yet). One thing that is a little difficult for me is the way relationships are handled. It's a little different from what I'm used to. It's sort of possible to model things like they would be done for an RDBMS but it definitely cuts across the grain. Even using the "officially-sanctioned" methods of dealing with relationships seems to be a bit problematic. For example, I would like to create something like a dicussion board. But how would that be handled. I'm thinking along these lines:

Board has many Forumss has many Threadss has many Posts

That's a pretty straightforward model. But when I start calculating datastore calls (because you are limited here) I get something like this (guessing numbers):

  • select Board (1)
  • select Forum (10)
  • select Thread (25)
  • select Post (15)

So just a single pass for a single user to view the posts in a single thread would generate 1+10+25+15=51 datastore calls. Figure that the average user probably clicks on what 5 or 10 threads? Let's just say 10. So that's 510 datastore calls burned per user for the dicussion board. We have 2.5 million datastore APIs we can use per day so 2.5 million / 510 = 4902 users we can support per day.

That's plenty of room for a pet-project, just-my-buddies kind of application but if you're thinking bigger then you're going to fall down pretty fast. Of course when GAE goes live you can pay to go beyond 2.5m API calls but I think the best approach is to rethink the design so that less calls are required. Having some kind of custom-built cache that works well with the datastore (which I think is BigTable underneath) is probably the best bet. Let's consider an approach where we cache all (10) forums into a single call, all (25) threads into a single call and all (15) posts into one more call. So now we're down to just 3 API calls (the initial board call would just call our first 10-forum block) * 10 interesting threads so roughly 30 API calls per uesr. So now we could potentially serve 2.5m/30 = 83,333 customers served. Now we're talking! Factor in memcache and you could see even more improvement.

So back to "the bad". The reason this is bad is that I really haven't thought about caches like this before. So I find myself optimizing earlier than I should be out of fear of scalability problems. OK, so that's probably a limitation of me and not GAE.

Why Blog?

I've had several blogs over the years. I even had one before they were called blogs (we just called them "pages" then). So why-the-frig am I starting a new one? There are several reasons. I'll list a few:
  • Sometimes i just want to write a note to myself and a blog is good place to do that.
  • Writing is a skill and I do not do enough of it lately. My hope is that I'll come here often and dump/organize/reconsider ideas and in the process improve my general as well as developer-related writing skills (yeah, good luck with that).
  • Why not? I mean it's so freaking easy to do these days. I setup up this Blogger account in like zero minutes.

Another question I have for myself is, "who is my audience?" Just me, I'm thinking. No one is going to read the drivel I put in here. There are lots of places for devs to read other real devs (I suck, my skills are weak, such is life). Here are some I read:

I guess I read some others but I can't think of them at the moment. It's times like this I wish I had some switches in my brain where, you know, it would actually work. Sheesh.

So that's it. First post done. I'll add another in a few but this pretty much sums up the beginning of this here place.