Suketu Nayak's Blog: September 2018

Tuesday, September 18, 2018

Big Data Processing using E-MapReduce in Alibaba Cloud

In this article, we will discuss about Big Data processing service in Alibaba cloud, E-MapReduce. E-MapReduce is solution for Big Data processing on Alibaba cloud platforms. Basically it is part of Alibaba cloud ECS service and which is based on open source Apache Hadoop clusters and Apache Spark – in memory processing service. On E-MapReduce we can also run queries of Apache hive, Apache Pig and HBase to analyse big data and processing of big data on Alibaba cloud. Also Alibaba cloud E-MapReduce provide us facility to import big data and also we can export big data from many other public cloud data storage systems and other database systems and of course it is well connected with OSS and Cloud RDS. E-MapReduce is providing integrated Big Data solutions to manage your clusters using tools like selection of Host, Deployment of environment, Building clusters, Configuration of Clusters, Configuration of Jobs, Running Jobs, Management of Clusters and monitoring of performance. Using E-MapReduce we can process procurement, preparation, operations, maintenance of clusters etc we can manage so that user can focus more on the application and its logic etc. As we know in Big Data processing we have different types of processing such as Batch Processing, Real time data processing, stream oriented data processing etc. So E-MapReduce Service of Alibaba cloud we have flexible modes are available by which we can select Hadoop services for daily statistics and batch processing and also we can choose Spark services for stream oriented computation and real time computations.

The main point in E-MapReduce is clusters, Cluster is a basically of Spark or Hadoop Cluster on Alibaba Cloud ECS. In Apache Hadoop we know there is combination of master and slave nodes like namenode, datanode, resource manager and node manager etc. So namenode and resource manager is Master nodes and datanode and nodemanager is slave nodes.

Image Ref: https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.p38356.b99.2.6c933d19BiCI36

In Alibaba Cloud, E-MapReduce clusters is set of multiple layers which is built on ECS of Alibaba Cloud Instance. There is HDFS layer above E-MapReduce Agent layer for distributed file system. YARN is for resource management, complete spark core engine and other spark libraries, Hbase, pig, hive, storm and notebooks like zeppelin is integrated and top layer is E-MapReduce Web User Admin for configuration and management.

Image Ref: https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.p38356.b99.2.6c933d19BiCI36

So, Alibaba Cloud E-MapReduce clusters enough capable to implement various scenarios like offline big data processing, ad-hoc data analysis queries, online massive scale data processing services etc. E-MapReduce is deeply integrated with other Alibaba Cloud services and offerings so that we ca use that as an input source or output source. Also E-MapReduce is integrated with Resource Access and Permission management systems so that we can isolate team access with primary and sub accounts.

Tuesday, September 11, 2018

Serverless Data Lake Analytics service in Alibaba Cloud

Alibaba Cloud having so many DTPlus service among all Data Lake Analytics is demanding service by industry, Alibaba Cloud Data Lake Analytics is a serverless interactive cloud native and analytics service which is fully managed by Alibaba Cloud using Massive Parallel Processing compute nodes managed by Alibaba Cloud So, no need to maintain it its 0 maintenance service provided by Alibaba cloud and made available for enterprise users on Pay As You Go mode. Alibaba Cloud Data Lake Analytics service offering cloud native query using standard SQL interface with SQL Compatibility and comprehensive built in functions. You can connect various data sources using JDBC and ODBC connectors. Also Data Lake Analytics on Alibaba cloud can integrate with BI product which help this service to turn in to Big Data Insights and visualization. This also helps customer to provide them cloud migration process in low migration cost. Alibaba Cloud Data Lake Analytics offers to do complex analytics on data which may come from different sources and in its formats. Using Alibaba Cloud Data Lake Analytics we can analyse data which is stored on Alibaba Cloud Object Storage (OSS) or Table Storage or we can also join the results and can generate new insights. Alibaba Cloud Data Lake Analytics is powered by full Massive Parallel Processing architecture (See Fig.) and can provide vectorized execution optimization, operator pipelined execution optimization and multi tenancy resource allocation and priority scheduling.

Using Alibaba Cloud Data Lake Analytics we can analyse OSS Raw data like Logs, Csv, Json, Avro etc we can execute query against specific OSS file folder and we can create table, search the query and can integrate BI as well. Using Data Lake Analytics we can query time series data, pipeline data, logs and Post ETL Data which is stored in TableStore. Using DLA we can query single table store table or we may join across multiple tables. Also we can Join across heterogeneous data sources like we have data in OSS and Table Store both than we can JOIN query from data sources and turn in to insights. Here data is isolated so only visible to data owner once you activate data lake analytics the system will grant your account to access permissions to the database.

Alibaba Cloud Data Lake Analytics offers so may types of inbuilt functions like aggregation functions which ignore null values and return null without input, also have binary functions and operators, Bitwise functions, conversion functions which helps to convert numeric and character values to the required type casting, Date and Time Functions and Operators, JSON Functions and Operators, Mathematical Functions and Operators, String Functions and Operators and Window Functions. All the tables which we create in DLA must have parent database schema and it must be unique in each of your Alibaba cloud regions. Below is sample Table creation query which is more like syntax of Hive query:

1. CREATE EXTERNAL TABLE nation_text_string (

2. N_NATIONKEY INT COMMENT 'column N_NATIONKEY',

3. N_NAME STRING,

4. N_REGIONKEY INT,

5. N_COMMENT STRING

6. )

7. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'

8. STORED AS TEXTFILE LOCATION 'oss://your-bucket/path/to/nation_text';

Alibaba Cloud Data Lake Analytics is compatible with Serialization and Deserialization data records mechanism of Hive including data files in CSV, Parquet, ORC, RCFile, Avro and JSON formats. So make sure whenever we are creating any table from csv we need to take appropriate SerDe ( Serialization and Deserialization data records) on the basis on contents of CSV file.

For Ex:

1.     CREATE EXTERNAL TABLE test_csv_opencsvserde
2. (id STRING,
3.       name STRING,
4.       location STRING,
5.       create_date STRING,
6.       create_timestamp STRING,
7.       longitude STRING,
8.       latitude STRING
9.       )
10.     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
11.     with serdeproperties(
12.     "separatorChar"=",",
13.     "quoteChar"="\"",
14.     "escapeChar"="\\"
15.     )
16.     STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';