Pig and Stratosphere Integration Alternatives
In this post I am going to present some alternative design choices concerning the actual implementation of the project, i.e. the integration of Pig and Stratosphere systems.
The main goal is to have a working system, such that Pig Latin scripts can be executed on top of the Nephele execution engine. However, performance is an issue, and of course, we wouldn't like to end up with a system slower than the current implementation :S The very motivation of this project is to overcome the limitations of the existing system, by exploiting Stratosphere's features.
The architectures of the two systems are shown side by side in the next diagram:
The integration can be achieved in several ways and on different levels:
- Translate MapReduce programs into PACT programs
This is the naive and straight-forward way of solving the given problem. PACT already supports Map and Reduce Input Contracts, which can be used for the transformation of the Map-Reduce Plan into a one-to-one PACT Plan. The Logical and Physical Plans that are generated by Pig can be re-used without modification. It is obvious that this solution wouldn't provide any gains compared to the existing implementation. In fact, it should be slower, since it adds one more layer to the system architecture. However, it is the simplest approach and it will be my starting point, in order to better understand the framworks' internals =)
- Translate the Physical Plan into a PACT Plan
This is a more natural solution and corresponds to the approach that would have been taken if Pig had been designed having Nephele in mind as execution engine, instead of Hadoop. It includes completely replacing the MapReduce Plan by a PACT Plan, which will be generated directly from the Physical Plan. This way, the additional Input Contracts, such as Match, Cross and CoGroup, could be used to compile common operation, like Joins. I hope and do expect this solution to be advantageous over the existing implementation. With this design, we should be able to exploit stratosphere's advantages and reflect them as performance gains, in certain classes of applications.
- Translate the MapReduce Plan (or even Physical Plan) into a Nephele Job
When any kind of abstracton is made, and this applies as well for high-level languages, there is always an overhead you have to pay in exchange for simplicity. The underlying system, of which the details the user doesn't need to know anymore, will be designed to take several decisions that would often differ from those an experienced low-level programmer would take.
However, the abstraction only has value, provided that the frustration imposed to the user by the slow-down of accomplishing their job, is lower than the satisfaction they get by being able to accomplish this job in a simpler way.
Hoping for a valuable abstraction!
Until next time, happy coding,
V.
The PACT programming model
PACT is Stratosphere's programming model. It consists of the so-called Parallelization Contracts which push the Map-Reduce idea one step further.
I was thinking of writing a post, explaining how PACT works, but the truth is that I wouldn't do any better than the already existing documentation at the Stratosphere project website.
So, I will only provide here some useful links:
- The Pact Programming Model: A high-level view of the programming model and in-detail presentation of the second-order functions available and the guarantees provided by the framework.
- Building a PACT Program: A guide to PACT programming, including everything you need to know before starting writing PACT programs.
- Example Jobs: Six example PACT programs of varying difficulty, starting from simple WordCount to more complex graph analysis algorithms.
- The PACT Compiler: A detailed overview of how the PACT Compiler is built and how it performs the transformation of PACT programs into Nephele DAGs.
If you are already familiar with MapReduce programming, you will also find this paper very helpful. It compares the two programming models and contains a series of examples of common data analysis tasks implemented in both models.
Keep on happy coding,
V.
The Nephele Execution Engine
Why choose Nephele over other engines?
One big advantage of Nephele is the high degree of parametrization it offers, which could lead to several optimizations. It is possible for the user to set the degree of data parallelism per task or explicitly specify the type of communication channels between nodes. More importantly, Nephele supports dynamic resource allocation. In contrast to MapReduce ans Dryad which are designed to work on static cluster environmnets, Nephele is capable of allocating resources from a Cloud environmnet depending on the workload.
Nephele defines a default strategy for setting up the execution of a job. However, there is a set of parameters that the user can tune in order to make execution more efficient. These parameters include the number of parallel subtasks, the number of subtasks per instance, how instances should be shared between tasks, the types of communication channels and the instance types that fulfill the hardware requirements of a specific job.
Nephele offers three types of communication channels that can be defined between tasks. A Network Channel establishes a TCP connection between two vertices and allows pipelined processing. This means that records emitted from one task can be consumed by the following task immediately, without being persistently stored. Tasks connected with this type of channel are allowed to reside in different instances. Network channels are the default type of communication channel chosen by the Nephele, if the user does not specify a type. Subtasks scheduled to run on the same instance can be connected by an In-Memory Channel. This is the most effective type of communication and is performed using the instance’s main memory, also allowing data pipelining. The third type of communication is through File Channels. Tasks that are connected through this type of channel use the local file system to communicate. The output of the first task is written to an intermediate file, which the serves as the input of the second task.
Pig's execution plans can also be represented as DAGs. The challenge now is to study how to convert Pig's plans into Nephele Job graphs. This would be an approach that would skip Stratosphere's programming model layer. It is not clear what implications such a decision could have perfomace-wise. On one hand, skipping one layer of execution could definitely lead to performance gains. However, the PACT compiler is designed to perfom several optimizations when translating PACT programs into Nephele DAGs. It is my wish is to implement and evaluate both alternatives. Let's hope I will have enough time for that!
Unti next time, happy coding!
V.
First Dive into Pig Code
However, the engine itself also has a front end and a back end.
This PhysicalPlan is then passed to the MRCompiler. The MRCompiler lies under ExecutionEngine-> MapReduceLayer. This is the compiler that transforms the PhysicalPlan into a DAG of MapReduce operators. It uses a predecessor depth-first traversal of the PhysicalPlan to generate the compiled graph of operators. When compiling an operator, the goal is first trying to merge it in the existing MapReduce operators, in order to keep the generated number of jobs as small as possible. A new MapReduce operator is introduces only for blocking operators and splits. The two operators are then connected using a store-load combination. The output of the MRComiler is an MROperPlan object. This corresponds to the Map-Reduce plan to be executed.
This plan is then optimized by using the Combiner where possible or by compining jobs that scan the same input data etc..
The final set of of MapReduce jobs is generated by the JobControlCompiler. This class lies under ExecutionEngine-> MapReduceLayer. It takes an MROperPlan and converts it into a JobControl object with the relevant dependency info maintained. The JobControl Object is made up of Jobs each of which has a JobConf. The conversion is done by the method compile(), which compiles all jobs that have no dependencies, removes them from the plan and returns. It must be called with the same plan until exhausted and it returns a JobControl Object
The generated jobs are then submitted to Hadoop and monitored by the MapReduceLauncher.
In the back end, each PigGenericMapReduce.Map, PigCombiner.Combine, and PigGenericMapReduce.Reduce use the pipeline of physical operators constructed in the front end to load, process, and store data.
Pig 101
words = FOREACH wordInput GENERATE FLATTEN((TOKENIZE($0))) AS word;
grouped = GROUP words BY word;
result = FOREACH grouped GENERATE group AS key, COUNT(words) AS count;
STORE result INTO 'wordOutput'; Pig’s declarative nature makes it so much more intuitive to write data analysis programs :-)
You don’t have to think about keys and values and squeeze your head on how to fit your problem into a map and a reduce function!
Running this script locally on my machine giving as input the text of the first section of this post produces the following output:I 1
a 2
In 1
be 2
is 1
of 1
on 2
to 2
Pig 3
and 1
but 1
how 1
not 1
the 4
…
System OverviewHere’s a simple diagram to show the system architecture:
Each node of the Logical Plan represents an operation of the script and the arrows connecting the nodes show how data flow from one step to the next. Each node in this plan is then translated into one or more nodes of the Physical Plan, depending on the complexity of the operation. In the end, nodes are grouped together to form Map and Reduce operations. In our example, FOREACH and Local Rearrange can be performed inside the Mapper, while the Package and the next FOREACH can be performed inside the Reducer. The Global Rearrange and LOAD/STORE operations are taken care by the Hadoop framework automatically.More generally, there are some rules to follow in order to convert a Physical Plan into a Map-Reduce Plan:
- Convert each (CO)GROUP into a Map-Reduce job
- Map assigns keys based on the BY clause
- Each FILTER and FOREACH between the LOAD and the (CO)GROUP are pushed into the map function
- Commands between (CO)GROUP operations are pushed into the reduce function
- Perform tagging in case of multiple input sets
- Each ORDER command is compiled into 2 Map-Reduce jobs
- Job 1 samples the input to determine key distribution
- Job 2 generates roughly equal-sized partitions and sorts
V.
References and links
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing.
- http://pig.apache.org/
- http://www.cloudera.com/videos/introduction-to-apache-pig
myThesisProject.init();
Christmas holidays are over and I’m back in town!This will be the last semester of my MSc, during which I will be working on my thesis in collaboration with the Swedish Institute of Computer Science (SICS). I am very excited about the project and this is the first of a series of posts I intend to do, describing my progress and discoveries :-)So what is this super-interesting project I’m going to work on?Before getting to that, I will have to make a small introduction on two systems:
Apache Pig and Stratosphere.Pig is a platform for analyzing big data sets. It consists of a high-level declarative language, Pig Latin, and an execution engine that “translates” Pig scripts into Map-Reduce jobs.Stratosphere is a data-processing framework, under research by TU Berlin. It provides a programming model for writing parallel data analysis applications and an execution engine, Nephele, able to execute dataflow graphs in parallel. You can think of it as an extension/generalization of Hadoop Map-Reduce and it also shares a lot of ideas with Dryad.Although right now it is only possible to execute Pig scripts on top of Hadoop, Pig is designed to be modular and it should be straight-forward to deploy it on top of another execution engine. And this is exactly the initial idea of the project. Additionally, the current state of the project appears to have some limitations that make it about 1,5 times slower than native Map-Reduce at the moment. I believe that Stratosphere architecture has several features that could be exploited in order to improve performance.I am currently in the phase of studying the Pig architecture and the existing Hadoop compiler implementation. (Oh the joy of endless Java code :p )
Soon, I will post here my first findings, so stay tuned!Until then, sweet coding!V.
Sorry, I can't accept your Greek ID!
Well well!
It's already been two months I'm in Sweden and a bit more than a month studying and living in Stockholm. A lot of interesting things have happened so far and I've been always in the middle of something preventing me from posting here. But today I found myself in a situation I'd like to share.
It's the end of the month and I rushed this morning to pay my rent on time. I asked my neighbour where she pays hers and found out that I could pay at an exchange office near my place. So, I went there and when it was my turn I gave the invoice to the employee. I asked to pay with my credit card and she asked to see my ID *. So, I gave her my Greek ID and she started looking at it, making several weird faces while trying to understand what was written on it. Then she turned at me and said:
- I'm sorry, I cannot accept this. Don't you have a passport?
Well, I do, but I didn't have it with me at the moment. I tried to explain to her that this is an EU ID and that it is valid, but she wouldn't listen. What really stroke me as odd though, was the fact that I was in a *bank* and I was trying to *pay* and the bank was *refusing my money*. I mean, it's a payment, who cares if it's me paying it or someone else? That's really the first time in my life I come across such a situation. Anyway, I got quite pissed off and went to a tobacco place next door where I payed without any problem (apart from the fact that I had to pay in cash).
I don't blame the girl at the exchange office for not accepting my ID. In fact, I don't even know if she had the right to do so or not. In the beginning I got really angry but I quickly realised that I do understand her. The truth is that Greek IDs are really awful. Mostly written in greek with some fields written also in latin. Easily confuses anyone. And to think that actually my ID is quite new, as I had to issue it again just last year, after losing my wallet in Barcelona.
My previous ID had all the fields -- wait for it -- handwritten!
I've never had any serious problems with my Greek ID inside Europe before (leaving UK airports apart). Usually people are just surprised by the looks of it but recognise it and accept it. They might ask where to find the date of birth or issue date but up to that.
The sad thing is that none of this would have happened if I had my Swedish Personal Number already. But I don't and I applied for it more than a month ago. There, at the tax office I had a similar story to share which I found funny at the moment. Not having this number not only creates problems like the one I've just described but also prevents me from opening a bank account here and therefore receiving my salary! So, apart from not being able to pay and I can't get paid either!
I don't know why it takes so long to issue this number and to be honest I am quite disappointed by the Swedish public sector bureaucracy. When visiting the tax office I couldn't but compare my experience to the one I had in Barcelona last year for the same purpose. Yes, I admit that the queue was much longer and that I had to wait for hours, then sent to another office, then pay some fee and then come back and wait in the queue again, but at least my Spanish Personal Number was issued the same day!
I'm just hoping I get the Swedish Number soon and not having any more similar adventures :)
Cheers,
V.
* (note: my credit card is PIN protected and I am never asked for ID whenever I perform transactions with it)
Little girl on the beach
Isn't it scary how life seems so easy sometimes?
Don't you feel lucky and don't you feel terrified?
Isn't it scary how easy it is to forget?
Doesn't it scare you?
The storm that's approaching...
iMpRessioNs fRoM pRiMaveRa SouND 2011
... and an unexpected ending
5 days, 14 stages, more than 200 artists, more than 250 concerts, more than 100.000 people!
These are the numbers but nothing can describe the experience!
Note: I'm not a music maker, critic or producer, so I wrote this post as a humble festival attendant :)
Event Organization
I can't imagine how much planning and coordination is needed to organize such an event and overall I think they did pretty well. There were problems of course, but not that important to spoil the mood.
I'd note the following:
- Event venues
Poble Espanyol, totally fit for concerts, reminded me of Athens' Technopolis. Great sound quality and all the beauty of Spanish architecture around made the experience incomparable!
Parc del Forum. Sea breeze, sand, trees and all the things you can relate to Springtime aka Primavera! Big enough to host all these people and different stages. The stages were far enough to avoid interference of concurrent shows and close enough to move from one to the other on time. Also, surprising clean toilets, despite the amount of people.
- Long queues
Both in Poble Espanyol and Parc del Forum, the audience suffered long queues to exchange their tickets for the bracelets. However, that was not the case for those who had chosen to buy their tickets through PayPal.
- Portal and charge system failure
Apart from the bracelet, everybody received an access card that was supposed to be used for accessing the festival venues, as well as buying drinks during the concerts. In order to charge money on your card, you were supposed to login to the website portal and connect your credit card to this access card. Alternatively, you could charge money in the card at several kiosks around the festival area. That worked well during the first day at Poble Espanyol, but the next day the portal was down all day long and no purchases could be done with the card at the festival. That was kind of frustrating, but soon enough, special stands were established to get back the money charged in the cards. Traditional cash-only transactions from then on!
- Food and Drinks
Variety of food choices, especially in Parc del Forum. Hot dogs, crepes, sandwiches, chinese, hamburgers, pasta, salads, vegetarian. Beers and cocktails. Could have been cheaper though.
What I saw, What I liked and What I didn't
Day 1, Wednesday May 25th, Poble Espanyol
Echo & The Bunnymen: They played intensively, they played lively and the crowd loved them. They avoided slow songs and played what people knew and like. I hadn't seen them before and they're not my style. But I can say they gave a great show and I respect them.
Caribou: Overpassed every expectation I had. Excellent performer, him and also the rest of his "crew". Loved their way of playing several songs without a pause. The crowd was excited and dancing non-stop! The first great show of the festival and surely a show to remember. The next morning, waiting for the bus to go to the university, I realised I was dancing alone in the bus stop in the rythm of "Sun" that wouldn't leave my mind...
Day 2, Thursday May 26th, Parc del Forum
Grinderman: Huge Nick Cave, huge show, huge performance! Magic voice. Passion, anger, intensity. Cave is a man to worship and the crowd did exactly that. Getting off the stage often to touch his audience and be touched, screaming with rage, then singing gently like the man of your dreams. Nick Cave I bow to you!
Interpol: I'm not a fan but I like a few songs and Interpol is considered a "must-see" band by many people, so I decided to ignore Nick Cave's suggestion (who kept saying during his show that we have to go see Suicide next) and move to Llevant. I think I managed to stay about 20 minutes. Too formal, too fake, too slow, too predefined, if you know what I mean. Everything was in place and total order. There was no feeling of freedom on stage. It seemed like they had even rehearsed their face expressions. However, there were people that loved the show. As for me, I decided to go to Suicide for the remaining time, but Caribou was playing closer and grabbed my attention once more!
The Flaming Lips: What a show! "Come on! Come on! Come on!", excellent perfomer Wayne calling at the crowd every now and then. Maybe it was too much of a show than a concert. But it was absolutely one of the highlights of the festival.
Suuns: I guess I went because I was around at that moment. I can't even remember what they were playing. Totally boring.
Day 3, Friday May 27th, Parc del Forum
Explosions In The Sky: "Somos explosiones en el cielo" they said and the flight took off. The whole concert was a trip. Pure Magic. Great artists, thanked the audience of Barcelona for inviting them once again and you could feel they meant it. You could feel they were happy to be here. You could see they felt the music and that feeling was spreading in the air. My favourite of the festival.
Pulp: "Do you remember the first time?" First time I saw Pulp and I will certainly remember it! The most anticipated concert of the festival by many. They gave it all. Played what you were expecting them to play, interacted with the people. They made me dance even right after the "darkness" of Explosions and they made me admire. I know I saw a historic concert but they didn't win my heart. I'm sorry.
Del Rey: What can I say. Astonished. Amazing double drum set, amazing rythm, amazing audience. These guys have talent and it's obvious! The only show I attended that the crowd asked for "otra"! And they didn't let us down. They returned on stage and played for 15 more minutes! They have a place in my heart :)
Day 4, Saturday May 28th, Parc del Forum
Einstürzende Neubauten: I didn't know them and was not interested in any other show at that moment. They described them to me as "minimalistic". Not sure what that means, but they caught my attention. Typically German but not something you've seen before. I'd describe them as "psycho". They convinced me. I will listen to them.
PJ Harvey: I know a lot of people will hate me but the impression I have in my mind is a common woman in white dress singing without passion. Disappointed.
Mogwai: One of my favourite bands and a show that I anticipated a lot. And they didn't let me down :) Athough, saying "gracias, thank you very much", that and nothing more, after every single song, was kind of annoying I must admit.
The Black Angels: I arrived 10 min before the end of Odd Future concert which was right before, to find around 30 people on stage dancing, jumping, singing! When the Black Angels came on stage, they realised there were sound problems. Probably because of the previous mess. The problem was solved after about half an hour and the Black Angels appeared on stage to overcome every expectation of their impatient audience! The next day everybody was speaking about their performance. I'd just like to point out that the sound quality for those standing on the right side of the stage was really inferior to the rest of the area. That was most probably a permanent problem of the Pitchfork stage, located right next to the sea without any coverage on the right side of the stage.
DJ Coco: Who's that, right? He played 80s, he played 90s and the audience loved him. Actually it was like being in a typical disco of Barcelona. Nothing impressive but exactly what you need to end the party!
Day 5, Sunday May 29th, Apolo
...or at least that was the plan. Go to Apolo to see The Black Angels once again. But the queue was long and we decided to have a mojito first, so we headed for the closest square, the one in Raval.
The bar was empty. We ordered mojitos and waited. I knew there was live flamenco show perfomed by locals, but I couldn't imagine what was about to follow.
Two guys with their guitars appeared and started playing. People started filling the bar. Some minutes later a blonde guy (totally not Spanish) was passing outside. He heard the music and got in. He was holding a trumpet. It was obvious that he didn't know the rest of the people playing. He sat down, felt the rythm and joined them. Then I noticed an old man with grey long beard. He left his drink and left for a while. He came back with a pair of bongos and joined, too. After a while the trumpets became two, a girl started dancing, everybody clapping their hands.
This is a small video I managed to record:
As you may imagine, we never arrived at Apolo on time for The Black Angels.
But this was one of the best performances I saw these days
and one more reason to fall in love with Barcelona :)
V.
Ούτε ένας καμμένος κάδος...
Μα τι ξεφτίλα, τι ντροπή για διαδήλωση, τι αίσχος!
Να μη κάψουν ούτε έναν κάδο, ένα atm, ένα μετανάστη βρε αδερφέ!
Να μην κλείσουν έναν δρόμο, να μην κάνουν μια κατάληψη, να μη δείρουν έναν αστυνομικό;
Ούτε ένα δακρυγόνο, μια μολότοφ, ένα ντου με καδρόνια έστω!
Μα να κάθονται απλά και ειρηνικά και να κατασκηνώνουν στη μέση της πλατείας;
Θυμάμαι πριν 3-4 χρόνια, όταν είχε ξεκινήσει αυτή η ιστορία με τα ιδιωτικά πανεπιστήμια και το άρθρο 16.
Είχα πάει σε μία πορεία από τις λίγες όπου δεν είχαν γίνει επεισόδια.
Ήμασταν πάρα πολύς κόσμος, κάμποσες χιλιάδες φοιτητές. Είχαμε πλημμυρίσει το κέντρο γύρω από το Σύνταγμα.
Και ήμασταν χαρούμενοι για την τόση συμμετοχή, κυρίως μη κομματοποιημένων φοιτητών.
Φοιτητών που, σαν κι εμένα, είχαμε πάρει μόνοι μας την πρωτοβουλία να στηρίξουμε τη διαδήλωση.
Και η πορεία κύλησε ειρηνικά, χωρίς απρόοπτα, χωρίς ξύλο, χωρίς γνωστούς-αγνώστους, χωρίς ματ, χωρίς δακρυγόνα.
Και δεν το έμαθε ποτέ κανείς, δεν ακούστηκε σε κανένα μεγάλο δελτίο, δε γράφτηκε σε καμιά μεγάλη εφημερίδα. Το θυμόμαστε μόνο όσοι ήμαστε εκεί.
Και την επόμενη μέρα, ένας συμφοιτητής μου λέει: "Και τι περίμενες; Αν δε γίνουν επεισόδια, αν δεν εμπλακούν μπάτσοι, αν δεν πέσει ξύλο κι αν δεν καεί το κέντρο, δεν πρόκειται να ασχοληθεί κανείς μαζί μας".
Και να λοιπόν που είχε άδικο.
Εδώ και 6 μέρες, Ισπανοί έχουν μαζευτεί στις πλατείες όλων των μεγάλων πόλεων της χώρας και κατασκηνώνουν σε ένδειξη διαμαρτυρίας για τις αυριανές εκλογές, για την κρίση, για ότι τελοσπάντων βλέπουν στραβό και άσχημο. Και διαμαρτύρονται ειρηνικά.
Μήπως, λέω μήπως, είναι καιρός να παρουμε ένα καλό μάθημα εκεί στην πατρίδα;
Εγώ απόψε θα είμαι εκεί.
Όχι γιατί με απασχολούν οι εκλογές στην Ισπανία, αλλά γιατί αυτοί οι άνθρωποι κατάφεραν αυτό που προσπαθούμε εμείς τόσα χρόνια.
Και το κατάφεραν τόσο απλά.
Β.









