Here are some notes on a talk given October 5th, 2007 by Utkarsh Srivastava on PNUTS, a Platform for Nimble Universal Table Storage.
From the seminar description:
The PNUTS project is to build a data management service for providing back-end support to Yahoo!'s web applications. To obtain acceptable latency and throughput while operating at Yahoo!'s scale, PNUTS uses massive parallelism and distribution---data is partitioned and replicated over thousands of servers. At the same time, PNUTS provides clean abstractions for data access that hide all this system complexity from the application programmer...In contrast to traditional database solutions, PNUTS is a centrally hosted and managed data service. Such a shared shared service model frees applications from the burden of having to set up, maintain and scale their own data store, and also amortizes the operational cost across all of Yahoo!'s applications. ...
Utkarsh, who also works on PIG, started with an outline of hurdles a startup faces building applications at web-scale (millions of users, terrabyte+ of data). You'll need a replicated, persistent datastore, caching, messaging between all systems to manage coherency, etc. You'll then throw away your first implementation. "If you have funding left, THEN you can start to work on application logic." PNUTS wants to make it so deploying a web-scale application requires nought but "...three guys, a weekend, and some PHP." (This latter quote is apparently from Mr. Del.icio.us).
For a folksy intro on how most of the big web apps can make do w/ just basic insert, update, and delete, see the PNUTS home page. Utkarsh had his own spin using flickr and del.icio.us for illustration (not that these apps currently run on PNUTS).
- Goals: Low-latency, self-tuning, replicated data service with automated fail-over and recovery.
- This is not just a research project. Has active participation of yahoo infrastructure team.
- No plans to open-source.
- PNUTS is NOT a multi-media store. Its more for metadata. Data is expected to be located elsewhere. PNUTS hosts user prefs, pointers to data, etc.
- ACID gurantees are relaxed for the sake of performance. Some clients may get stale data.
The implemented basic relational operators do not allow for adhoc analysis and bulk processing; use PIG or hadoop instead.
- They have a PNUTS SQL-like language but its very basic: no support for joins, aggregations, etc.
- Two table types: An Hash Table (key/value with put, get, update, delete, etc.) and An Ordered Table -- ordered by keys -- with typed columns. Against the Order Table you can run all Hash Table operations plus 'limit (top K)', 'predicates', etc. Currently only Hash Table has been implemented. Even Order Tables are key/value; the value has within it 'columns'.
- A PNUTS install is made up of Regions: e.g. west-coast Region, east-coast Region, east-asian Region, etc. Data is replicated between Regions (e.g. east-coast region has an up-to-date replica of west-coast, etc.). Within a region, a Tablet Controller load balances tablets -- "horizontal table fragments" -- over many Tablet Servers. A tablet holds some subset of the rows of a table (PNUTS is row-based -- later there may be support for column-based: "PNUTS horizontally fragments tables. Later there may be support for vertical fragmentation"). Routers -- which are not the same thing as Tablet Controllers -- direct clients to the tablet server hosting the requested row. PNUTS makes use of a yahoo "Log Service" to which all edits are written first before they are committed to the pertinent Tablet Server. Rows have a Write Master who administers all of its updates. To keep write latency low, the Write Master for a row is located close to the client writing: e.g. rows for Japanese users will have their Write Master in the Japanese Region (Write responsibility can apparently migrate across Regions to move closer to the updating client). Recovery -- having a lapsed Region get up-to-date copying from another Region -- is probably the most complicated part of the system coordinating many Write Masters and reading the log-of-edits from Log Servers.
- Features: Test and set (write only if version is same as that specified, otherwise fail: i.e. row-level 'transaction'), "critical reads" (I will for sure see my edits), and "read latest" (You may see stale data).
- Talked of 'indexes'. Indexes are an optional secondary table lazily maintained by the system that is keyed using a column other than that of the key used in the primary column: Example was an ordered table that had fruits for keys and columns of price and description. Using the primary table you could do "all fruits between apple and tomato" but you couldn't do "all fruits priced between $2 and $5 dollars a pound". You can configure a secondary table keyed by price to do the latter.
- A separate API for bulk inserts, updates, etc. Rationale was that proceeding through bulk data in order means beating up Tablet Servers in series. Better to have two passes: one to look at data and then in the second do a parallel upload.
- Some support for views/joins maintained by the system. Folks are going to do this in applications anyways. Might as well have support within the system.