More flexible Hadoop Cluster

I had an interesting scenario where the customer had a Hadoop Cluster based on HortonWorks.
They had multi PB large HDFS cluster that have done a lot of great work for them, but the amount of data they pushed in daily in to the cluster and how much hardware they needed to buy to fit all data, was not cost effective for them.
They run daily batch jobs and not real-time analytic, that mean they only need a sort of amount of CPU and memory capacity to be able to process all the new data that comes in. But they keep the old data to be able to get much better analytic result.

When you read a standard Hadoop design are you recommended to run local disk on a server and then run multiple copies of each file to be able to secure and get a better performance then what a single SATA disk can produce. But this recommendation is because Hadoop was designed to run on cheap server hardware to play with.
It's not until now when Hadoop get accepted and runs in to large enterprise as a daily used tool.

In this situation where they growing with 10% of new data monthly and needed to buy a lot of new servers quarterly did the hardware cost in the end cost them more then what the Hadoop give them back.
And the customer needed to choose either to rethink the idea how to build up their cluster environment or reduce the amount of data.

What we did was a really interesting solution. What we did at the next quarterly buy-in of new hardware was buying less servers and installing IBM Spectrum Scale FPO on does servers, in this case could the customer now only save 1 copy of each file instead of 3.
For the next few weeks could we convert server-by-server from HDFS too Spectrum Scale and we did then reduce the space use with more the half  in total and then could the stop the unnecessary investment every quarter of new servers.

To be able to keep the up with security of hardware and data redundancy did we automatic send a copy of all data in to a much cheaper media such tape by using IBM Spectrum Protect, their by could they save the data much more efficient then by using 3 copy on disk.

With Spectrum Scale in their Hadoop cluster could the also run normal "POSIX" command that are well known by their IT staff that are daily using Linux and Unix.
Their by could we also use standard analytic tools to understand the data what the customer actually use.

One major concerns that the customer was very worry about by using Spectrum Scale with 1 version instead of HDFS with 3 copies was the performance from the slow SATA Disks that each server had installed.
This is something that Spectrum Scale makes it so interesting, Spectrum Scale can create a much larger block size of the filesystem then any standard Linux filesystem. With a larger block size can we get a better performance from the SATA drives then the small blocks that HDFS spread out on multiple servers.
And because Hadoop create larger filesets of data that fits Spectrum Scale perfect.



They story didn't end hear.
As the next step of the development of Spectrum Scale FPO can you now mix local and shared disk as one single name space.
Next step in our development the environment is to replace all 2 Units server with a 1 Units servers and central storage, either are we going to use Spectrum Scale Server (Scale-Out Storage) or standard SAN based storage such IBM Storwize V7000.
Both solutions has their benefit but with the new picture can we easily reduce the amount of hardware that fits more their needs instead of get forced to add more hardware they don't need.

When we are done with this project should we be able to reduce the amount of hardware to much less then before and also make get a better scalability and flexibility  with both disk and CPU/memory when they really need it.


I will got on vacation now.
I hope you all have a great summer and I will continue my blogg during vacation but not every week.
See you all hopefully in August again.


Comments

Popular posts from this blog

Move a Spectrum Scale Filesystem to an new disk

Manual Upgrade IBM Spectrum Protect 7.1.x to 8.1.x

Upgrade GPFS 3.4 to Spectrum Scale 4.1.1