Apache Drill User

Keeping track of Apache Drill. From geeks, for geeks.

0 notes &

Chime In on Drill

How would you use Drill?

What questions or comments do you have about the design of Drill? 

What are your thoughts or suggestions for the Drill community?

The Bay Area Apache Drill User group is going to meet in San Jose, California next Monday 24 February at 6pm, and where ever you may live, we want to hear from you.

Please tweet your ideas or comments using the hashtag #drilltalk by Monday evening Pacific time (you can follow Drill on Twitter as @ApacheDrill).  Or add a comment or question here.

We’ll select several comments and questions and feed them into the discussion on Monday night. Look for a video of the evening after the event to see if your input is included. The link to the meet-up site is

http://bit.ly/1gB2E6p

And to get you thinking about how you’d use Drill, I recently asked Michael Hausenblas (MapR Chief Data Engineer and Drill contributor) for his thoughts looking forward to what Drill will do:

“Apache Drill allows business analysts to query heterogeneous data sources at scale, in a time-efficient and familiar way.

* Heterogeneous data sources … no matter if the data resides in existing relational databases (such as Oracle DB, MySQL, etc.), in a NoSQL database such as MongoDB or is available as Apache Hadoop-native, that is, in HDFS, MapR-FS or HBase, Apache Drill queries the data in-situ, By querying the data where it sits, there is no ETL process required to move the data into a central location as is usual in a data warehouse setting.

  • At scale … Drill works well for small-sized datasets (a few gigabytes) but also scales out to the terabyte and petabyte range, depending only on the number of machines available in a cluster (hence dictating the degree of parallelism at which a query can be executed.
  • Time-efficient … this means two things in the context of Drill:
  1. Because there is no ETL step involved, the data can be queried directly where it is located
  2. Due to the style the query is executed (based on Google Dremel’s multi-level execution tree, in-memory, streaming operators ,etc.) with Drill the response times are typically in the low seconds. This rapid response time is possible even on large datasets, which means it is well suited for low-latency application scenarios. Imagine someone sitting in front of a BI tool clicking on a button, expecting an answer immediately rather than the minutes or hours generally expected from MapReduce-based systems.
  • Familiar way … on the one hand this means that standard query interfaces such as full SQL supported are guaranteed with Drill (no matter if the data resides in a strongly-typed datasource such as a RDBMS or exists as JSON files in, say, HDFS) but also that ad-hoc queries are possible.”

 With those thoughts about Drill in mind, what are your ideas about how you’d use it?

Tweet your comments/questions with hashtag #drilltalk and @ApacheDrill to join the discussion on Monday. 

 

0 notes &

Congratulations New Apache Drill Committers & Mentor

by Ellen Friedman on Twitter as @Ellen_Friedman

As we welcome the new year, Apache Drill has two new committers: Timothy Chen and Julian Hyde.  Their hard work on behalf of Drill has earned the notice and gratitude of the project and community.

Tim Chen is an engineer at Microsoft in Seattle who recently spoke at the Bay Area Apache Drill User Group meet-up about his work related to the lifetime of a Drill query end-to-end. Tim’s presentation was part of the celebration of the first milestone release for Drill. Please see the earlier post here at the Drill User blog for details. You can find out more at Tim’s blog or follow him on Twitter @tnachen

Julian Hyde was an engineer at Pentaho who recently moved to Hortonworks. For Drill, Julian has worked on the SQL. Julian is also lead developer of Mondrian OLAP engine and Optiq data platform and is one of the authors of the Manning book Mondrian in Action  http://www.manning.com/back/   Julian will be one of the speakers at the next Bay Area Apache Drill User Group planned for 24 Feb 2014. Stay tuned for details. You can follow Julian on Twitter @julianhyde

Drill is also fortunate to have the help of a new project mentor, Sebastian Schelter. Sebastian is a PhD student and research associate at TU Berlin, with expertise in machine learning, especially recommendation. Sebastian is active with the Apache Foundation, being a PMC member and committer for the Apache Mahout project. Sebastian is on Twitter as @sscdotopen

And a Happy New Year for 2014 to you all!

Follow the Apache Drill community on Twitter @ApacheDrill

Check out the Apache Drill project website at http://bit.ly/YDkYEl

0 notes &

Apache Drill Query in Action: Drill User Group Event

By Ellen Friedman, Twitter ID: @Ellen_Friedman

Co-organizer of Bay Area Apache Drill User Group

The Apache Drill project is building an innovative tool for ad hoc, interactive queries in the time scale of 100ms to 20 minutes on large, distributed data systems. Participants in the open source Apache Drill community recently came together to take a look at how Drill works now and what will be the next steps in the project.

The event was the November 4th meet-up of the Bay Area Apache Drill Users, the first entirely Drill-based  meet-up group. This meeting was hosted at Cisco in San Jose, with MapR Technologies as co-host.  A large group collected on-site at Cisco’s conference facility and almost as many participants joined remotely via WebEx. The event marked the recent first official release of the Apache Drill project.

Speakers included two from the MapR Drill team from San Jose, Drill lead engineer Jacques Nadeau and developer Steven Phillips and Timothy Chen, a Drill contributor who lives in Seattle, where he is an engineer at Microsoft. 

Tim Chen came down to San Jose earlier in the day before the meet-up to get together directly with other Drill developers and enjoy the unusual situation of being able to discuss the work in person.  

image

Drill meet-up speakers Steven, Tim & Jacques

Apache Drill: A Look Forward

Jacques Nadeau kicked off the evening meet-up with a road map toward maturity at version 1.0 for the Drill project. He pointed out that Milestone 1 was achieved in late September when the Apache Foundation approved the first official code release. Progress toward Milestone 2 is actively under way now.

Does a project need to reach version 1.0 to be usable? The answer varies with the project, but generally an Apache project releases usable but work-in-progress versions before full maturity. An example is the Apache Mahout project, which is currently at version 0.8 and yet has been used successfully in production settings for over a year.  Drill isn’t at that stage now – functionality is built in stages – but it’s beginning to be ready for early users to try it out and give feedback.

  • Milestone 1: Initial functionality PASSED

JDBC, Distributed execution, Parquet and JSON readers

  • Milestone 2: Architectural validation  IN PROGRESS

Performance, total sort, node buffering, diagnostic tools and instrumentation, Parquet writer

  • Milestone 3: Query Complete

TPC-H, Hive UDF, Hive read SerDe and HBase

  • Milestone 4: User Feature Complete

Pushdown, optimization, complex and vectorized operators, Hive metastore, additional file formats

  • Milestone 5: Production Quality

ODBC, additional optimizer rules, resource scheduling, stability

Apache Drill is an ambitious project designed to be more flexible, more wide ranging and extensible than many of the other tools being built to address similar issues.  That’s a challenge, but one that is being met with some very promising initial work.

Lifetime of a Query in Apache Drill

The second speaker, Timothy Chen, presented the story of the lifetime of a Drill query, starting with SQL input and following the events to the distributed Drillbits on different nodes.

To follow what happens to a query, it’s helpful to understand that Drill, like Google’s Dremel project, relies on multi-level execution trees and leverages columnar-oriented storage. Whether schemaless or not, abstractly you can think of each data tree as a JSON object.  Each tree is composed of a key and a root node.  (ref to Dremel paper: http://research.google.com/pubs/pub36632.html)

Another important concept is the DrillBit: as Tim explained, a DrillBit is simply a worker process running on any particular node in the cluster. To tell the story of what happens to a user query as it is processed by Drill, Tim used an example system that included DrillBits on three nodes plus the coordinating services of ZooKeeper and Hazelcast. 

Drill can accept full ANSI SQL: 2003 queries, which in turn are passed via Sqlline to Optiq, a library Drill uses for SQL parsing and planning according to a collection of Planning Rules. These rules come into play as the system builds a logical plan for the query. The logical plan describes the abstract dataflow of the query (which is language-agnostic).  The logical plan tries to work with primitive operators without focusing  on optimization at this stage.  

image

Figure shows highly simplified view of the lifetime of a Drill query

The next step is for the logical plan to be passed to and through the Foreman. A Foreman in Drill is the DrillBit that initially handles the query, effectively forming the root node of the multi-level execution tree. Any DrillBit potentially could serve as Foreman, but once the process is in motion, the Foreman will direct processing to appropriate additional DrillBits on other nodes, to maximize locality. A number of things happen at this stage, as the Foreman turns the logical plan into a physical plan for execution.

This is of course a very simplified summary of the detailed sequence described during the talk.

Apache Drill Live Demo: Drill performs on distributed nodes

Steven Phillips closed the evening with an in-depth technical discussion of the current state of Drill milestone 1, followed by a live demo. His presentation included a particular focus on the physical operators now in place.

image

Steven explained that the Drill logical plan is designed to be as easy as possible for language implementers to use. With the design aimed at high degree of flexibility, Drill does not constrain queries to SQL specific paradigm – instead, it also supports complex data type operators such as collapse and expand.

In addition to his detailed discussion of the current features of the alpha release, Steven included a live demo of a query being processed by Drill. For simplicity in the presentation, Steven ran his query on a single machine, but one of the advances in the first milestone version is that distributed mode is possible.  This ability for Drill to run on a distributed system is a large step in the project since this summer when participants tried Drill queries on single machines during a Drill workshop at OSCON.  At this stage of development, distributed mode is still somewhat cumbersome, requiring manual submission of a physical plan. To make this easier, Drill contributor Michael Hausenblas has put together a detailed description of how to do it: https://github.com/mhausenblas/apache-drill-sandbox/tree/master/M1

Code for the first milestone release of Drill can be found at the official project website. A link to the WebEx recording made available by Cisco for this meet-up is found below, along with link to Tim Chen’s blog on his talk .

Apache Drill Community

One of the strengths of an open source project developed under the umbrella of the Apache Foundation is that the community grows as the code is developed. The resulting project reflects a collective effort both from developers and early users, who can provide valuable feedback to guide further design.  Apache Drill is fortunate to have a strong and growing community as it passes its first milestone release.

One of the challenges for an Apache project, however, is how to keep diverse members of the project connected, especially when they are separated by geography and often time zone. A live meet-up of the Drill community members in real-time helps to build communication and connections and gives the project a boost.  Thanks to all who made this meet-up possible.

Apache Drill is an open source project that welcomes your participation.  You can find out more on the project website, by joining the Bay Area meet-up,  or by following the project on Twitter.

Apache Drill Resources

Follow on Twitter: @ApacheDrill

WebEx recording of Nov 4, 2013 meet-up presentation, runs 1 hour 41 min: https://cisco.webex.com/ciscosales/lsr.php?AT=pb&SP=MC&rID=72775662&rKey=031c783655239fd8

Tim Chen blog on “Lifetime of a Query in Apache Drill Alpha”:  http://bit.ly/1erl77n

Bay Area Apache Drill User Group:  http://bit.ly/17ArvnP

Apache Drill official project web site includes access to code for 1st milestone release: http://bit.ly/YDkYEl

0 notes &

Post-M1 status

After we’ve released the M1 alpha version of Apache Drill a lot of things happened:

UPCOMING: The Bay Area Apache Drill User Group will have a meet-up on coming Monday 4 Nov 2013 on Apache Drill: First Milestone Release.

Michael giving an Apache Drill talk at JAX London 2013

0 notes &

End of Summer Update for Apache Drill

It’s been a very active season for the Drill community as the project prepares for a first milestone release. And with the Drill demo on the website and participation in a hands-on workshop, the “user” part of this Apache Drill User site is beginning to live up to its name.

A sample of events in June – August include:

  • Article by Michael Hausenblas @mhausenblas and Jacques Nadeau @intjesus “Introduction to Apache Drill: Interactive Ad-Hoc Query for Large-scale Datasets”. Big Data. June 2013, 1(2): 100-104. doi:10.1089/big.2013.0011. http://bit.ly/15101Y7
  • Drill talks by @mhausenblas at Hive London and in Paris in June
  • Apache Drill project website redesigned to have a new look: http://incubator.apache.org/drill/
  • Interactive “How to Run Drill” demo added to the Apache Drill wiki: https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
  • Apache Drill hands-on workshop by Ted Dunning @ted_dunning and Jacques Nadeau @intjesus at OSCON in Portland, Oregon USA in July for ~40 participants. A blog post by Ellen Friedman @Ellen_Friedman reports on that Drill-via-Amazon-Cloud event and includes links to slides: http://bit.ly/18aS3Lk
  • Drill blog article by S. J. Vaughan-Nichols “Drilling into Big Data with Apache Drill” in Aug: http://bit.ly/1309MXA
  • Apache Drill project featured by panelist @tshiran in Aug for the “Hadoop + SQL” Hive Data Think Tank event in California Bay Area. A blog posting as a prelude to the event can be found here: http://bit.ly/1cvxn5D
  • New developers and non-code contributors are participating in the community
  • Discussion is getting started on the Apache Drill user mailing list: http://bit.ly/19modUt
  • Twitter group for @ApacheDrill grew significantly to 437 followers.

And September is starting with more activity, including an upcoming meetings of the Bay Area Apache Drill User group featuring MapR engineer Steve Phillips in September and a still-to-be-scheduled talk by Tim Chen (Microsoft) most likely in late Oct or November. Stay tuned!

By Ellen Friedman @Ellen_Friedman and ellenf@apache.org.

0 notes &

May 2013 updates & heads-up

This month has been exciting so far and promises even more. Here some highlights and announcements:

  • Drill talks
    • We had a very nice Hadoop get together in Berlin on 8 May. Lot’s of questions and good (brain) food.
    • On 14 May I presented Drill at the London HUG and again, huge interest and great discussions. I think it has been recorded. Stay tuned.
    • Then, on 16 May I had a gig in Stuttgart where the slides are now also available.
    • Upcoming: this Friday, on 24 May I’ll give a Drill tutorial at the Cloud East in Cambridge, UK. I suppose there are still some tickets available.
  • Check out ROOT, a distributed query engine from CERN. These guys really rock
  • Again, progress from Julian re operators.

And, as usual: don’t forget to join us at the weekly G+ hangouts at 9am PST / 5pm GMT/ 18:00 CET to discuss progress and issues.

1 note &

NoSQL matters 2013 in Cologne, Germany—lots of good discussions and great people around for the Apache Drill training day, thank you everybody involved and hope to ‘see’ you on the mailing list, on Twitter or F2F next time!

NoSQL matters 2013 in Cologne, Germany—lots of good discussions and great people around for the Apache Drill training day, thank you everybody involved and hope to ‘see’ you on the mailing list, on Twitter or F2F next time!

1 note &

Status update April 2013

There are many things happening in parallel at the moment. We’re making great progress in terms of APIs and storage engines.

Also, there are a number of events where Drill is discussed. For example, I recently gave a status talk at the HUG Munich and if you want to start get active in the development, be it via code or test data and test queries, consider joining us on our weekly Google+ hangout at 9am PT/5pm UTC.

Here are the current work-in-progress items:

To get your daily news flash, consider following us on @ApacheDrill and subscribe to one of the mailing lists!