By Ellen Friedman, Twitter ID: @Ellen_Friedman
Co-organizer of Bay Area Apache Drill User Group
The Apache Drill project is building an innovative tool for ad hoc, interactive queries in the time scale of 100ms to 20 minutes on large, distributed data systems. Participants in the open source Apache Drill community recently came together to take a look at how Drill works now and what will be the next steps in the project.
The event was the November 4th meet-up of the Bay Area Apache Drill Users, the first entirely Drill-based meet-up group. This meeting was hosted at Cisco in San Jose, with MapR Technologies as co-host. A large group collected on-site at Cisco’s conference facility and almost as many participants joined remotely via WebEx. The event marked the recent first official release of the Apache Drill project.
Speakers included two from the MapR Drill team from San Jose, Drill lead engineer Jacques Nadeau and developer Steven Phillips and Timothy Chen, a Drill contributor who lives in Seattle, where he is an engineer at Microsoft.
Tim Chen came down to San Jose earlier in the day before the meet-up to get together directly with other Drill developers and enjoy the unusual situation of being able to discuss the work in person.
Drill meet-up speakers Steven, Tim & Jacques
Apache Drill: A Look Forward
Jacques Nadeau kicked off the evening meet-up with a road map toward maturity at version 1.0 for the Drill project. He pointed out that Milestone 1 was achieved in late September when the Apache Foundation approved the first official code release. Progress toward Milestone 2 is actively under way now.
Does a project need to reach version 1.0 to be usable? The answer varies with the project, but generally an Apache project releases usable but work-in-progress versions before full maturity. An example is the Apache Mahout project, which is currently at version 0.8 and yet has been used successfully in production settings for over a year. Drill isn’t at that stage now – functionality is built in stages – but it’s beginning to be ready for early users to try it out and give feedback.
- Milestone 1: Initial functionality PASSED
JDBC, Distributed execution, Parquet and JSON readers
- Milestone 2: Architectural validation IN PROGRESS
Performance, total sort, node buffering, diagnostic tools and instrumentation, Parquet writer
- Milestone 3: Query Complete
TPC-H, Hive UDF, Hive read SerDe and HBase
- Milestone 4: User Feature Complete
Pushdown, optimization, complex and vectorized operators, Hive metastore, additional file formats
- Milestone 5: Production Quality
ODBC, additional optimizer rules, resource scheduling, stability
Apache Drill is an ambitious project designed to be more flexible, more wide ranging and extensible than many of the other tools being built to address similar issues. That’s a challenge, but one that is being met with some very promising initial work.
Lifetime of a Query in Apache Drill
The second speaker, Timothy Chen, presented the story of the lifetime of a Drill query, starting with SQL input and following the events to the distributed Drillbits on different nodes.
To follow what happens to a query, it’s helpful to understand that Drill, like Google’s Dremel project, relies on multi-level execution trees and leverages columnar-oriented storage. Whether schemaless or not, abstractly you can think of each data tree as a JSON object. Each tree is composed of a key and a root node. (ref to Dremel paper: http://research.google.com/pubs/pub36632.html)
Another important concept is the DrillBit: as Tim explained, a DrillBit is simply a worker process running on any particular node in the cluster. To tell the story of what happens to a user query as it is processed by Drill, Tim used an example system that included DrillBits on three nodes plus the coordinating services of ZooKeeper and Hazelcast.
Drill can accept full ANSI SQL: 2003 queries, which in turn are passed via Sqlline to Optiq, a library Drill uses for SQL parsing and planning according to a collection of Planning Rules. These rules come into play as the system builds a logical plan for the query. The logical plan describes the abstract dataflow of the query (which is language-agnostic). The logical plan tries to work with primitive operators without focusing on optimization at this stage.
Figure shows highly simplified view of the lifetime of a Drill query
The next step is for the logical plan to be passed to and through the Foreman. A Foreman in Drill is the DrillBit that initially handles the query, effectively forming the root node of the multi-level execution tree. Any DrillBit potentially could serve as Foreman, but once the process is in motion, the Foreman will direct processing to appropriate additional DrillBits on other nodes, to maximize locality. A number of things happen at this stage, as the Foreman turns the logical plan into a physical plan for execution.
This is of course a very simplified summary of the detailed sequence described during the talk.
Apache Drill Live Demo: Drill performs on distributed nodes
Steven Phillips closed the evening with an in-depth technical discussion of the current state of Drill milestone 1, followed by a live demo. His presentation included a particular focus on the physical operators now in place.
Steven explained that the Drill logical plan is designed to be as easy as possible for language implementers to use. With the design aimed at high degree of flexibility, Drill does not constrain queries to SQL specific paradigm – instead, it also supports complex data type operators such as collapse and expand.
In addition to his detailed discussion of the current features of the alpha release, Steven included a live demo of a query being processed by Drill. For simplicity in the presentation, Steven ran his query on a single machine, but one of the advances in the first milestone version is that distributed mode is possible. This ability for Drill to run on a distributed system is a large step in the project since this summer when participants tried Drill queries on single machines during a Drill workshop at OSCON. At this stage of development, distributed mode is still somewhat cumbersome, requiring manual submission of a physical plan. To make this easier, Drill contributor Michael Hausenblas has put together a detailed description of how to do it: https://github.com/mhausenblas/apache-drill-sandbox/tree/master/M1
Code for the first milestone release of Drill can be found at the official project website. A link to the WebEx recording made available by Cisco for this meet-up is found below, along with link to Tim Chen’s blog on his talk .
Apache Drill Community
One of the strengths of an open source project developed under the umbrella of the Apache Foundation is that the community grows as the code is developed. The resulting project reflects a collective effort both from developers and early users, who can provide valuable feedback to guide further design. Apache Drill is fortunate to have a strong and growing community as it passes its first milestone release.
One of the challenges for an Apache project, however, is how to keep diverse members of the project connected, especially when they are separated by geography and often time zone. A live meet-up of the Drill community members in real-time helps to build communication and connections and gives the project a boost. Thanks to all who made this meet-up possible.
Apache Drill is an open source project that welcomes your participation. You can find out more on the project website, by joining the Bay Area meet-up, or by following the project on Twitter.
Apache Drill Resources
Follow on Twitter: @ApacheDrill
Tim Chen blog on “Lifetime of a Query in Apache Drill Alpha”: http://bit.ly/1erl77n
Bay Area Apache Drill User Group: http://bit.ly/17ArvnP
Apache Drill official project web site includes access to code for 1st milestone release: http://bit.ly/YDkYEl