The following instructions assume that you have already installed Hadoop in your cluster and you have tested it using some examples. The following tests use the k-means clustering query kmeans.mrql.
First, you need to generate random ponts and store them in a HDFS file using the MRQL program points.mrql:
mrql -dist points.mrql 1000000 |
This will create 1M points and will store them in HDFS as the sequence file points.bin
. You can adjust this number to fit your cluster. Then, run the k-means clustering in MapReduce mode using:
mrql -dist -nodes 10 kmeans.mrql |
where -nodes
specifies the max number of MapReduce reducers.
To run the same query using Hama, you need to know the number of simultaneous BSP tasks that can run in parallel on your Hama cluster without a problem. For example, if you have 16 nodes with 4 cores each, you need to set -nodes
less than 64, eg 50. First, you need to generate random points and store them in a HDFS file (if you haven't done so for the MapReduce example):
mrql.bsp -dist -nodes 50 points.mrql 1000000 |
This will create 1M points and will store them in HDFS as the sequence file points.bin
. You can adjust this number to fit your cluster. Then, run the k-means clustering in BSP mode using:
mrql.bsp -dist -nodes 50 kmeans.mrql |
To run the same query using Spark, change the SPARK_MASTER and FS_DEFAULT_NAME in conf/mrql-conf.sh
to point to your Spark cluster URLs. Then, you need to generate random points and store them in a HDFS file (if you haven't done so for the other examples):
mrql.spark -dist -nodes 50 points.mrql 1000000 |
This will create 1M points and will store them in HDFS as the sequence file points.bin
. You can adjust this number to fit your cluster. Then, run the k-means clustering in BSP mode using:
mrql.spark -dist -nodes 50 kmeans.mrql |