storage node

Cloud Node Project

For this final project, you will implement one node in a storage cloud.  Your node will interoperate with your classmates' nodes to create a lightweight storage cloud.

In order for everyone's nodes to work together, each node must follow a strict protocol which is detailed here.  If you implement something incorrectly, you will confuse other nodes in the cloud.

I will have one reference node running at virtual40.cs.missouri.edu:8006 that you can examine to verify that your node works as specified.  Additionally, I will have a separate "Tracker" node atvirtual40.cs.missouri.edu:12345 which will manage all other nodes in the cloud.  You are not required to implement a tracker, but your node must talk to mine.

Tracker Protocol

Wake Up Node

When your node first initializes, it must announce its presence to the Tracker by sending a POST request to http://tracker:port/wakeup with form-value "node" set to the URL of your node.  This tells the Tracker that your node is online and can be contacted at the specified URL.

Where Is Bucket

A client will sometimes need to query the Tracker to determine where a given bucket is stored.  This is done with a GET request to /whereis/<bucketid>.  The response header will contain a Location: field with the URL of the bucket.  The response status code will be 301 "Permanently Moved" if the bucket exists, or 404 if not.  Notice that if you point your browser at /whereis/<bucket>, the browser will automatically redirect to the bucket's actual location.
  • 301 Permanently Moved: the bucket exists at the specified Location
  • 404 Not Found: the bucket does not exist

Where Is Resource

If a client asks /whereis/<bucket>/<resource>, the Tracker will respond in the same way as Where Is Bucket, except the redirect will be to the location of the resource (if it exists).  The Tracker only knows about buckets, but can be used to find resources in this way.  Notice that if you point your browser at /whereis/<bucket>/<resource>, the browser will automatically redirect to the resource's actual location (if it exists).
  • 301 Permanently Moved: the resource exists at the specified Location
  • 404 Not Found: the resource does not exist

Mine Bucket

If your node creates a new bucket, it must tell the Tracker about it.  This is done with a POST request to /mine/<bucket> with form-value "node" pointing to your node URL.
  • 200 OK: the bucket was not owned before, or your node already owns it.  Either way, it now belongs to you.
  • 409 Conflict: the bucket is already owned by some other node; you cannot have it.

Ping Node

Your node must respond to pings from the tracker so it knows you're still alive.  The pings will be HEAD requests to your server, so you shouldn't need to do anything (the default doHEAD will work).

Status

For testing purposes, the Tracker will list all the nodes currently online via a human-readable page at /status.

Node Access Protocol

Each node should implement a simple REST API to store and retrieve files, similar to Amazon's S3.

GET Root

If a client sends a GET request to your node but doesn't specify a bucket, your node should respond with a human-readable index of all buckets on the node, with links to each one.  The exact format doesn't matter here, since it is only for testing purposes.

In the rest of this document, "noderoot" means whatever URL your client registered with the Tracker.  It should look like: http://host:port/ or possibly http://host:port/Servlet or even http://host:port/Project/Servlet, etc.

GET Bucket

Clients may request buckets by sending a GET request to your server as follows: noderoot/<bucket>.  You should respond with a listing of all resources in the bucket, with links to each one.  Formatting doesn't matter; for testing only.

An example request:

GET http://virtual40.cs.missouri.edu:8006/RyBucket HTTP/1.1

and response:

<h1>RyBucket Contents</h1>
<ul><li><a href="/RyBucket/SomeFile">/SomeFile</a></li>
...

GET Resource

Clients request resources by sending a GET request to noderoot/<bucket>/<resource>.  Here, <resource> can be any URL-encoded string, including a long "path" like folder/subfolder/file.  The node should respond with the requested resource, if it exists. 
  • 200 OK: return this when the resource was found on this node
  • 404 Not Found: return this when the bucket is on this node, but the resource was not found in the bucket.
  • 307 Moved Temporarily: return this to redirect to the Tracker when the bucket is not on this node (see about redirects below).

POST Bucket

Clients can create a bucket by POSTing it on the node:

POST /mybucket HTTP/1.1

If the bucket already exists on the node or elsewhere in the cloud, the node should return an appropriate error code.  Notice that the body of the POST Bucket request should be empty, and will be ignored.

The POST Bucket handler should send a Mine Bucket request to the Tracker before trying to create a new bucket.
  • 200 OK: return this when a POST Bucket request tries to create a bucket that already exists on this node.  The bucket should not be changed.  In this case, you don't need to send a Mine Bucket request to the Tracker.
  • 201 Created: return this when the bucket does not exist on the node, the Tracker says 200 OK, and the bucket is created on the node.
  • 409 Conflict: return this when the bucket does not exist on the node but the Tracker says 409 Conflict.  This may happen when the bucket already exists on some other node.

PUT Resource

Clients can upload a file to a bucket by PUTing it in the bucket.  The bucket must exist on the node for this operation to succeed.
  • 200 OK: return this when the resource already exists and has been updated with new content.
  • 201 Created: return this when the resource did not exist before now.
  • 404 Not Found: return this when the bucket does not exist on this node.

Node Redirection Mechanism

This part is confusing, so stay with me!  Nodes never transfer resources between themselves or with the Tracker.  You might expect a node to proxy requests through other nodes, but this would not be very scalable.  Instead, resources are only transfered between nodes and clients.  If a client requests a resource from a node that doesn't have it, the node will redirect the client to the correct node via the Tracker.

For example, say a client requests the contents of the bucket "pail" from virtual40.  If "pail" is not owned by virtual40, then the node will immediately respond with a 307 "Temporary Redirect" to the Tracker.  The redirect will be a Where Is Bucket request, which the Tracker will respond to with a second redirect to the node that owns the bucket.  This way, even if a client asks the wrong node for a bucket, the client will be redirected to the correct node.

Again, the node should not proxy the request to the Tracker or to other nodes.  There is no need for that.

To redirect a client, just return 307 and set the Location header value.

There are a two cases where your node will need to redirect the client:

GET Bucket Redirect

If your node is asked to GET a bucket, one of two things can happen:
  • If the bucket is owned by the node, it should service the request itself (as described above) and return 200 OK.
  • If the bucket is not owned by the node, it should respond immediately with a 307 Redirect to the Tracker's Where Is Bucket mechanism.  The Tracker will in turn Redirect the client to the node that owns the bucket, or will respond with 404 Not Found.

GET Resource Redirect

If your node is asked to GET a resource, one of two things can happen:
  • If the bucket is owned by the node, it should service the request itself (as described above) and return either 200 OK or 404 Not Found.
  • If the bucket is not owned by the node, it should respond immediately with a 307 Redirect to the Tracker's Where Is Resource mechanism.  The Tracker will in turn Redirect the client to the node that owns the bucket, or respond with 404 Not Found.  The node that owns the bucket will then respond with the requested resource, or 404 Not Found.

Requirements

Implement a node.  You do not need to implement a Tracker, but your node will need to talk to mine.

Your node doesn't necessarily need to persist resources to a database.  It is up to you how you handle persistence.  You may choose to use a database or just a hash table in memory, for example.  If your node does not sync to a database, then it is possible it will lose all its buckets when it is restarted.  This is okay.  The protocol should handle this loss of information gracefully.

When your node is complete, keep it running on your virtual machine.  It will become part of the cloud along with other completed nodes.  To get full credit, make sure that:
  • the Tracker knows that your node is online (check /status to be sure).
  • you can access buckets on your node
  • you can access buckets on other nodes via the redirect mechanism
  • your node follows the spec exactly


Grading

To grade this project, we'll upload a web page to a bucket on your node.  The web page will request several buckets via your node: some buckets will exist on your node, some will exist elsewhere, some won't exist at all.  We'll load the web page in a browser and make sure your node handles these cases correctly.

Project Scope

When the cloud is complete, you will be able to deploy websites to the cloud as follows:
  • choose a node to store your website (in a real cloud, this would be the node closest to you geographically)
  • create a bucket to store your website's resources
  • put webpages in the bucket
  • now, anyone can access the website through your node (if they are geographically close to you) or via any other node in the cloud (if they are closer to some other node)
This is very similar to how real clouds like S3 work.  If you installed nodes on servers around the world, you'd have a real cloud!  To get really fancy, you can have DNS servers around the world resolve your domain to different nodes depending on their geographic location.  That way yourcloud.com resolves to the closest node, wherever you are in the world.

Future Work

I plan to teach a Cloud Computing course in a few semesters.  There are lots of features to add to this cloud, which I plan to require for the future course.  If you are interested in improving the cloud on your own, here are some ideas:
  • Have the storage node read and write resources to/from the local file system.  Install your node on each of your computers so you can access your files remotely.
  • Implement the rest of WebDAV so that your desktop computer can mount the cloud, list the contents of directories, drag-and-drop upload, etc.
  • Implement your own tracker which redirects requests to a "super-tracker" when it doesn't find a bucket (instead of responding with 404).  Make an "extended star" network of trackers, super-trackers, and super-super-trackers to reduce the load on any one server.
  • Add the ability to move and copy buckets between storage nodes.  If a node notices lots of requests for a bucket it doesn't own, it can temporarily copy or take ownership of it.
  • Add a third type of server: an edge-cache node.  Reorganize the cloud so that all storage clouds are in one big data center, and edge-caches are scattered around the world.  Make clients talk to the nearest edge-cache node, instead of talking directly to the storage nodes.  Have the edge-cache nodes proxy requests to the data center and keep local caches of recently-accessed buckets.
  • Combine the storage, tracker, and edge-cache into one monolithic node.  Create an ad-hoc mesh network of these nodes without requiring a central tracker.
  • Charge other saps to use your storage cloud.  Profit.
Comments