Top 7 Reason Why You Should Choose Apache Zeppelin
All sooner or later they come to the analyst for the data. In large multiplayer games (and even a single player) without this already generally nowhere. How many users prefer the new mode; where the weak points of monetization; where to watch game designers, in order to increase the effectiveness of players; and a million more things – everything is counted. And all this affects the decisions that the developers will decide.
- But they introduce the analyst in different ways: someone buys third-party solutions (simply, but inflexible), someone writes for themselves (long and expensive), while someone just considers several basic metrics by the programmers and does not bother.
So we’ll talk about an instrument that will be useful to everyone. Whoever starts to build an analyst will be able to “build on the knee” a system from scratch, and companies with already ready solutions will “rush” their approach.
We are going to talk about Apache Zeppelin. This is a multifunctional interactive shell that allows you to perform queries to various data sources, process and visualize the results.
Quite close analog is Jupyter Notebook, but Zeppelin is somewhat more sharpened for working with databases. It uses the concept of “interpreters” – plug-ins, which provide a backend for a language and/or database.
- Zeppelin, like Jupyter, looks to the user as a collection of laptop files, consisting of paragraphs in which queries are written and executed. With built-in visualizers, a laptop with a set of queries can easily be turned into a full-fledged dashboard with data.
We will not intentionally touch upon the issues of installation and configuration – this is also in the documentation on the site, and on the network, you can find several tutorials for different databases. The purpose of the article is to tell about the user side of the question, interesting tool applications (including not the most obvious ones) and the advantages that analysts can extract from it, regardless of the solution they already use.
1. Omnivorous Zeppelin
Combining different data sources – within a single dashboard
Is one of its key advantages. Within the framework of the standard assembly, an impressive set of interpreters (to NoSQL and relational databases) is included.
In practice, this gives the following:
- Most companies with already operating databases and analytics systems can use it “out of the box” (as far as applicable to the open source product, heh). Enthusiasts, however, with more exotic databases can write the interpreter on their own, about which there is an article on the product website.
- Small companies, if desired, can build their own analytics system exclusively from the database and Zeppelin as an interface.
- As the experience of communication with colleagues shows – many data can flow from different sources, stored in different databases. Someone can use additionally third-party analytics services. Accordingly, analysts sometimes face the task of “making friends” of such a menagerie among themselves. Zeppelin also allows you to use your interpreter for each section inside one laptop, which will allow you to output query results to different sources in one place.
2. Zeppelin + Python/R
Zeppelin is not only a web interface for various databases, but it can also act as an interactive shell for executing scripts in programming languages. It includes interpreters for R and Python, so it may well be an alternative to the familiar RStudio and Jupyter. Yes, it provides fewer features than specialized IDEs (for example, there is no auto-substitution), but this is offset by the benefits discussed below.
In conjunction with the same Python, the power of Zeppelin increases manyfold: here you have the opportunity to obtain data from the API from third-party services (the previous paragraph), and the ability to process data in addition to normal database requests, as well as automate these processes. Zeppelin supports the renewal of dashboards on the crown without unnecessary movements (again, a quick glance at the decisions of colleagues shows that this, seemingly trivial task, sometimes has to be solved in very artfully conceived ways). Well, sweet: it has a built-in version control system – primitive, but sufficient for most analyst tasks.
We in the company actively use Python along with Appleton (internal system of analytics) for carrying out complex data processing. Therefore, the idea to try Zeppelin appeared exactly in relation to our scripts – we saw in this the potential to simplify a series of routines related, for example, by visualizing the results.
3. Visualization of everything in the world – one click
Zeppelin can display the data displayed in the paragraph in the form of several basic visualizers, working on the principle of summary diagrams: the interface selects the fields by which the axes will be built and how the output values will be aggregated. The resulting diagrams are clickable and allow you to easily view the data in different sections.
This, with a seemingly modest, functionality covers up to 95% of analysts’ tasks for visualizing the results. You can stop the endless export of uploads to Excel just for plotting, and even forget such terrible words as matplotlib, bokeh, and ggplot2 – the results of the scripts also turn into graphics with a couple of clicks.
However, for more complex visualizations, the names of graphics libraries can be recalled again – Zeppelin has built-in integration with the most popular graphics libraries for Python and R:
4. Collaboration and configuration of interfaces
Zeppelin can work locally and be used simply as an analytics tool, but if you deploy it on a server, you can turn it into a corporate analytical service with LDAP authorization and access settings, if desired. Depending on the needs in analytics, it can act as a set of dashboards according to project metrics, and a common store of scripts and uploads, or, for example, space for the teamwork of analysts. Pleasant bonus: there is no need to exchange any files or start a new dock in Confluence – you can just drop the link to the dashboard.
In this aspect, the flexibility in setting up the interface and the ability to generate simple forms for entering values play an important role. Of course, analysts feel comfortable when they see SQL queries and code on R, but here it is unprepared users that can drive into a stupor. Therefore, in the zeppelin dashboard you can hide the code (for example, the problem for Jupyter), make fields for entering dates and other changing parameters, and give the customer a neat and understandable form.
- In our company, many processes are tied to analytics, so different departments periodically need some specific unloading, for example, recalculate the balance sheet according to fresh data. For such things, we have long written scripts, but they still have someone to run. Have you ever tried to teach Jupyter 20 game designers? In the end, we elegantly solved this problem by shifting the scripts to Zeppelin, where, for example, the DG can get the data they need, just by pressing one button.
What is very important at this moment is that all interface preparation and tuning is done solely by the analysts themselves, without the involvement of programmers (or, God forbid, UX-ditch).
5. Advantages of parallel processes
Zeppelin works in several processes, which results in an interesting bonus – it allows you to run a separate Python instance for each laptop, and for each user. Thus, without tricky settings, you can run several bulk processing scripts in parallel – just by doing this in different laptops, and continue working without waiting for completion. This works in the case of a local copy, and when deployed on the server – and in general, you can take some of the calculations from local computers, running them on the server.
6. Embedding paragraphs into sites
If you deploy Zeppelin on the server, you can get a link to any of your paragraphs (with query results or a schedule) and publish it as an iframe on the site (it’s done very simply, there is a tutorial on the site). In the practice of analysts, it is not often necessary to publish some results on external resources, but this can be very convenient for adding visualizers to internal services (in the same Confluence). So you can create reports that have interactive forms and visualizers right in the text.
7. Self-describing reports
The markdown support allows you to add paragraphs to the dashboards, in addition to charts and tables, with formatted text. As a result, you can make visual reports with descriptions where the user can immediately see the data for some problem, see it all on the charts and read the interpretation of the results from the analysts. Unlike Jupyter, which also has markdown support, Zeppelin makes interactive forms and visualizes the results much faster, and the result is more accurate and accessible to the end user, which is important.
Thus, this is a quick and vivid alternative to the usual research of analysts. Typically, the work of analysts is built like this: analysts get the task to study some aspect of the game. They prepare data, test hypotheses, visualize confirming results, say, with charts and write a report (for example, in confluence). This is a correct, but rather a painstaking process. In our case, you can sketch out the laptop with these unloads and scripts without spending too much time, immediately illustrate the results with graphs, and in the next paragraphs describe your findings:
- Of course, Zeppelin has something to blame for, it still does not always work stable (still it’s open source), the web interface eats a lot of RAM, someone may lack the functions of a full-fledged IDE. But already there are a number of interesting applications where it can be useful – therefore, it definitely deserves the attention of analysts (and, of course, the more the community is, the better it can become in the future).
For small companies, it can become the main tool, since it allows you to build a complete analytics system over the database. For larger companies with an already developed analytic toolkit, it is a useful addition that will not replace the main system, but it will give several useful advantages.