Hadoop command line find file

8/8/2023

I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test.

This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. The first thing to do is get a lot of game data. There is also a - case meaning the game is ongoing or cannot be scored, but we ignore that for our purposes. The 1-0 case means that white won, the 0-1 case means that black won, and the 1/2-1/2 case means the game was a draw. We are only interested in the results of the game, which only have 3 real outcomes. Since I had no idea what kind of format this was, I checked it out on Wikipedia. The first step in the pipeline is to get the data out of the PGN files. The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory. However, considering the problem for a bit, it can be easily solved with streaming analysis that requires basically no memory at all.

This is because all game data is loaded into RAM for the analysis. Tom mentions in the beginning of the piece that after loading 10000 games and doing the analysis locally, that he gets a bit short on memory. You can pretty easily construct a stream processing pipeline with basic commands that will have extremely good performance compared to many modern Big Data (tm) tools.Īn additional point is the batch versus streaming analysis approach. Even the concepts of Spouts, Bolts, and Sinks transfer to shell pipes and the commands between them. This is basically like having your own Storm cluster on your local machine.

The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. One especially under-used approach for data processing is using standard shell tools and commands. Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques. This is absolutely correct, although even serial processing may beat 26 minutes. This is probably better than it would take to run serially on my machine but probably not as good as if I did some kind of clever multi-threaded application locally. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).Īfter reporting that the time required to process the data with 7 c1.medium machine in the cluster took 26 minutes, Tom remarks Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Command-line Tools can be 235x Faster than your Hadoop Cluster IntroductionĪs I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR.

0 Comments

Hadoop command line find file

Leave a Reply.

Author

Archives

Categories