asked 2019-03-22 10:21:55 -0500

I am evaluating in premise ETL setup with the EMR. At present we have the flow as follows: OLTP(MSSQL) --> SQOOP(Extraction) --> Spark Computation(Transformation) --> HiveQL(Transformation/Load) --> HDFS. This has been hosted in HDP distribution. Is there any benefit I am going to get if we move it to EMR? How is it different than HDP? What EMR manages for us? What about the HA in EMR? Can we scale down to 0 nodes when not using it? During processing of the Spark job does it use ephemeral storage? Is it fast enough for our Hive queries?

edit retag flag offensive close merge delete