Suketu Nayak's Blog

Tuesday, September 18, 2018

Big Data Processing using E-MapReduce in Alibaba Cloud

In this article, we will discuss about Big Data processing service in Alibaba cloud, E-MapReduce. E-MapReduce is solution for Big Data processing on Alibaba cloud platforms. Basically it is part of Alibaba cloud ECS service and which is based on open source Apache Hadoop clusters and Apache Spark – in memory processing service. On E-MapReduce we can also run queries of Apache hive, Apache Pig and HBase to analyse big data and processing of big data on Alibaba cloud. Also Alibaba cloud E-MapReduce provide us facility to import big data and also we can export big data from many other public cloud data storage systems and other database systems and of course it is well connected with OSS and Cloud RDS. E-MapReduce is providing integrated Big Data solutions to manage your clusters using tools like selection of Host, Deployment of environment, Building clusters, Configuration of Clusters, Configuration of Jobs, Running Jobs, Management of Clusters and monitoring of performance. Using E-MapReduce we can process procurement, preparation, operations, maintenance of clusters etc we can manage so that user can focus more on the application and its logic etc. As we know in Big Data processing we have different types of processing such as Batch Processing, Real time data processing, stream oriented data processing etc. So E-MapReduce Service of Alibaba cloud we have flexible modes are available by which we can select Hadoop services for daily statistics and batch processing and also we can choose Spark services for stream oriented computation and real time computations.

The main point in E-MapReduce is clusters, Cluster is a basically of Spark or Hadoop Cluster on Alibaba Cloud ECS. In Apache Hadoop we know there is combination of master and slave nodes like namenode, datanode, resource manager and node manager etc. So namenode and resource manager is Master nodes and datanode and nodemanager is slave nodes.

Image Ref: https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.p38356.b99.2.6c933d19BiCI36

In Alibaba Cloud, E-MapReduce clusters is set of multiple layers which is built on ECS of Alibaba Cloud Instance. There is HDFS layer above E-MapReduce Agent layer for distributed file system. YARN is for resource management, complete spark core engine and other spark libraries, Hbase, pig, hive, storm and notebooks like zeppelin is integrated and top layer is E-MapReduce Web User Admin for configuration and management.

Image Ref: https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.p38356.b99.2.6c933d19BiCI36

So, Alibaba Cloud E-MapReduce clusters enough capable to implement various scenarios like offline big data processing, ad-hoc data analysis queries, online massive scale data processing services etc. E-MapReduce is deeply integrated with other Alibaba Cloud services and offerings so that we ca use that as an input source or output source. Also E-MapReduce is integrated with Resource Access and Permission management systems so that we can isolate team access with primary and sub accounts.

Tuesday, September 11, 2018

Serverless Data Lake Analytics service in Alibaba Cloud

Alibaba Cloud having so many DTPlus service among all Data Lake Analytics is demanding service by industry, Alibaba Cloud Data Lake Analytics is a serverless interactive cloud native and analytics service which is fully managed by Alibaba Cloud using Massive Parallel Processing compute nodes managed by Alibaba Cloud So, no need to maintain it its 0 maintenance service provided by Alibaba cloud and made available for enterprise users on Pay As You Go mode. Alibaba Cloud Data Lake Analytics service offering cloud native query using standard SQL interface with SQL Compatibility and comprehensive built in functions. You can connect various data sources using JDBC and ODBC connectors. Also Data Lake Analytics on Alibaba cloud can integrate with BI product which help this service to turn in to Big Data Insights and visualization. This also helps customer to provide them cloud migration process in low migration cost. Alibaba Cloud Data Lake Analytics offers to do complex analytics on data which may come from different sources and in its formats. Using Alibaba Cloud Data Lake Analytics we can analyse data which is stored on Alibaba Cloud Object Storage (OSS) or Table Storage or we can also join the results and can generate new insights. Alibaba Cloud Data Lake Analytics is powered by full Massive Parallel Processing architecture (See Fig.) and can provide vectorized execution optimization, operator pipelined execution optimization and multi tenancy resource allocation and priority scheduling.

Using Alibaba Cloud Data Lake Analytics we can analyse OSS Raw data like Logs, Csv, Json, Avro etc we can execute query against specific OSS file folder and we can create table, search the query and can integrate BI as well. Using Data Lake Analytics we can query time series data, pipeline data, logs and Post ETL Data which is stored in TableStore. Using DLA we can query single table store table or we may join across multiple tables. Also we can Join across heterogeneous data sources like we have data in OSS and Table Store both than we can JOIN query from data sources and turn in to insights. Here data is isolated so only visible to data owner once you activate data lake analytics the system will grant your account to access permissions to the database.

Alibaba Cloud Data Lake Analytics offers so may types of inbuilt functions like aggregation functions which ignore null values and return null without input, also have binary functions and operators, Bitwise functions, conversion functions which helps to convert numeric and character values to the required type casting, Date and Time Functions and Operators, JSON Functions and Operators, Mathematical Functions and Operators, String Functions and Operators and Window Functions. All the tables which we create in DLA must have parent database schema and it must be unique in each of your Alibaba cloud regions. Below is sample Table creation query which is more like syntax of Hive query:

1. CREATE EXTERNAL TABLE nation_text_string (

2. N_NATIONKEY INT COMMENT 'column N_NATIONKEY',

3. N_NAME STRING,

4. N_REGIONKEY INT,

5. N_COMMENT STRING

6. )

7. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'

8. STORED AS TEXTFILE LOCATION 'oss://your-bucket/path/to/nation_text';

Alibaba Cloud Data Lake Analytics is compatible with Serialization and Deserialization data records mechanism of Hive including data files in CSV, Parquet, ORC, RCFile, Avro and JSON formats. So make sure whenever we are creating any table from csv we need to take appropriate SerDe ( Serialization and Deserialization data records) on the basis on contents of CSV file.

For Ex:

1.     CREATE EXTERNAL TABLE test_csv_opencsvserde
2. (id STRING,
3.       name STRING,
4.       location STRING,
5.       create_date STRING,
6.       create_timestamp STRING,
7.       longitude STRING,
8.       latitude STRING
9.       )
10.     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
11.     with serdeproperties(
12.     "separatorChar"=",",
13.     "quoteChar"="\"",
14.     "escapeChar"="\\"
15.     )
16.     STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';

Tuesday, August 7, 2018

Learn Hybrid Connectivity options available in Alibaba Cloud Networking Services

Learn Hybrid Connectivity options available in Alibaba Cloud Networking Services

In this article, We will discuss hybrid connectivity options available in Alibaba Cloud Networking Services. As we all know industry and enterprises are more towards hybrid connectivity options to connect their own on premises Internet Data Center (IDC) to Cloud Computing Service Provider Data Center or Virtual Private Cloud. As industry does not want to migrate 100% workload on Cloud they are more preferring different types of Hybrid Connectivity Cloud Computing Service providers like Alibaba Cloud. So, Alibaba Cloud having so many different types of services under Networking like Express Connect, Cloud Enterprise Network, VPN Gateway by which enterprises can have hybrid connectivity options as per their requirements and budget.

Express Connect is service by Alibaba Cloud Networking by which we can directly connect our two Virtual Private Cloud either both VPCs are in same region or different region and either both VPCs are of same alibaba account or different alibaba account as well using intranet private connectivity along with two VPCs by using Express Route we can connect VPC with On premises IDC as well. So, to connect two VPCs in alibaba cloud creating Route Interface connection on the VRouter of both connecting VPCs by which Express Connect using own backbone transmission network of alibaba cloud. Route Interface is basically virtual device which is providing communication channel and control to connect two VPCs in that one VPC will become connection initiator and another VPC will become connection receiver to establish connectivity between two VPCs. Now to connect VPC with On Premise IDC we need to use Physical connection which works on physical layer. Physical Connection is basically private network circuit which established between Alibaba Cloud VPC Access Point and your on premises IDC data center connectivity device. For this we need to contact private network carrier who will provide us a leased line cable on rent and connect our on premise IDC to Alibaba Cloud Access Point on Cloud. So, for this we need to create virtual border router (VBR) to connect our on premises IDC to Alibaba Cloud VPC for hybrid cloud environment. VIrtual Border Router is a service which maps leased line with VSwitch to access it and it also works as a Border Gateway Protocol Router between our on premises equipment to the VPC on cloud. So Express Connect is basically private network connectivity option it is not using public internet for hybrid connectivity so it is reliable and secure. Express Connect service providing is three types of connectivity specifications: Small (10 Mbps to 40 Mbps), Middle (100 Mbps to 900 Mbps) and Large (1 Gbps to 4.5 Gbps) hybrid connectivity.

Another Service is Cloud Enterprise Network, Cloud Enterprise Network allowing us to create a large global network of hybrid cloud computing solutions which is capable to connect your VPCs across the global regions and your on premise local data center together and its highly scalable, reliable and secure. CEN having three components 1. CEN Instance - To connect our network globally we need to create CEN Instance first and attach network to it. 2. Networks (Including VPC and VBR) - This is second component by which we need to attach network with CEN instance so that each can communicate with each other across the globe. 3. Bandwidth - This component required only for cross region communications in which we need to specify interconnection areas.

Another popular hybrid connectivity option available in Alibaba Cloud is VPN Gateway - This is basically Site to Site and Point to Site connectivity options available in Alibaba Cloud over Internet Medium using encrypted tunnel between VPC to On premise DC or VPC to remote employee computer / Laptop. So, Alibaba Cloud is providing both IPSec Protocol tunnel for Site to Site Connectivity and Secure Socket Layer Protocol tunnel for Point to Site Connectivity. Point to note here is VPN Gateway is not providing Internet access services. So, using VPN Gateway we can have Site to site , Multi site connectivity, VPC to VPC Connection, Point to site connectivity to remote laptops, phones, desktops etc and we can have combined IPSec and SSL VPN connections and we can have multinational intranet connections using VPN Gateway and Express Connect. Per VPN Gateway we can have 10 IPSec connections and 1 SSL Server which can have 50 clients.

Monday, August 6, 2018

Server Load Balancer in Networking Services of Alibaba Cloud

Server Load Balancer in Networking Services of Alibaba Cloud

In this article we will understand Server Load Balancer of Networking Services on Alibaba Cloud. Load Balancer services is available with every public cloud service providers but Alibaba Cloud Server Load Balancer Services having so many other features available for your enterprise level load balancing requirements. Server Load Balancer is basically traffic distribution service that redirect incoming traffic to ECS instances of your Alibaba Cloud to balance the incoming load for internet and intranet service based on that Public IP or Private IP assigned by system so we have both internet facing and private intranet purpose both the load balancing service together. In this we need to configure load balancing forwarding rules to distribute incoming traffic. Alibaba Cloud SLB having more application service capabilities and more enhanced application availability as well. Alibaba Cloud SLB is basically converts available ECS instance backend pool in to High performance and Highly available application service pool by applying virtual service addresses and it distributes the incoming traffic requests to ECS instances in the backend Instance Server pool basis on setted forwarding rules. Alibaba Cloud SLB checks the Health Probe status of available ECS Instances and if it founds unhealthy threshold than automatically isolate that instances to eliminate single point of failure. In addition to that Alibaba Cloud SLB having integrated 5 Gbps DDoS Attack Resistance Service to protect your application services on ECS Instances.

To use SLB we must create at least one listener and two backend pool ECS instances this is minimum. Listener actually checks health of ECS and forwards the requests to the backend ECS Instances. We may have multiple Listeners as well. Alibaba Cloud SLB providing the Layer - 4 Transport Layer (TCP and UDP Protocol) and Layer - 7 Application Layer (HTTP and HTTPS Protocol) Load Balancing Services to user. Transport Layer SLB using the open source software called LVS (Linux Virtual Server) with Keepalived to get load balancing service. And Application Layer (Layer-7) SLB using Tengine (Nginx Based Web Server Project) to get load balancing service.

Alibaba Cloud SLB using Health Check feature which automatically blocked abnormal ECS Instances and redirect requests automatically when they become normal again in that while configuration we need to set threshold to check ECS Instance is normal or not. So, in this case we need to set two threshold unhealthy and healthy threshold in Health Check of SLB. Alibaba Cloud SLB is also supporting Session Persistence feature in which we can set Listener rules to forward same client requests to same ECS Instance until session life cycle of that client with ECS Instance.

Alibaba Cloud SLB having three routing methods to configure to distribute load towards backend pool ECS instances:

1. Round Robin Routing Method

2. Weighted Round Robin

3. Weighted List Connections

In round robin clients requests are distributed sequentially towards ECS instances backend pool. In Weighted Round Robin Routing Method we can set weight to each ECS Instance like 70-30, 60-40 etc so that higher weighted instances will get more requests out of total incoming requests. In third one Weighted List Connections Routing Method along with the weights we can set number of connections to the instance as well so sometime if both having same weight at any point of time SLB will redirect live connections to instance which having less connections. One more important facility SLB having is URL based routing so that SLB can redirect traffic or requests to backend instance based on URLs. Alibaba Cloud SLB we can configure across multiple zones of the region as well so sometime one zone is performing abnormal that SLB will automatically redirect all traffic to zone 2 which is normal to avoid faulty zones. For security reasons we can also add whitelist IP addresses to our SLB also who can access our Server Load Balancer.

In Layer - 7 Application Layer Server Load Balancer we have centralized certificate management service for HTTPS Listeners by which no need to upload certificates to ECS instances of our backend pool. So cryptographic and deciphering computation CPU overhead is not on ECS instances. Also we have bandwidth peak features by which we can peak bandwidth per listener basis on which type of application service is that is providing by backend pools.

One more added feature in Alibaba Cloud Server Load Balancer is Cross Region Disaster Tolerance, In which we can configure Server Load Balancer instances in different regions and we can add ECS instances in different zones of the regions to server load balancer along with DNS Service. So DNS will resolve the domain name to the IP Addresses of the Server Load Balancer in different regions So, one region become unavailable we can stop domain name resolution for the unavailable region So ultimately our user access to the instances will be not affected. So Alibaba Cloud Server Load Balancer is Cross Region Disaster Tolerance Service.

We also need to take care about default limits while using Alibaba Cloud Server Load Balancer like instance we can have 60 (Default) but if we raise tickets Alibaba Cloud team may enhance it for you. Listener limitation is 50 and at a time we can add and delete maximum 20 instances to our server load balancer.

Virtual Private Cloud and its Isolation in Alibaba Cloud

Virtual Private Cloud and its Isolation in Alibaba Cloud

In this article we are talking about Virtual Private Cloud, short form VPC are logically isolated private network from other virtual networks in provided by Alibaba Cloud. We can have full control as an administrator on our VPC, We can specify IP address range in Class Less Inter Domain Routing (CIDR) IP addressing scheme (No Classfull IP Addressing Scheme supported). We can provision ECS Instance, RDS (Relational Database Services) and Software Load Balencer in our own Virtual Private Networks. Also in Alibaba Cloud we have Hybrid Connectivity Options to connect two or more Virtual Private Networks or Connection between VPC to On Premises Network.

We have two more important components in VPC 1. VRouter 2. VSwitch

When we create VPC in Alibaba Cloud its automatically creates VRouter with Route Table, VRouter is working like a Hub to connect our zones of VPC and its also working as a Gateway to connect our VPC with other networks as well. Another VPC component is VSwitch, by which we can create subnets (Small Address Range Groups of VPC Address Space, Segmentation of VPC is Subnets) So, If you want to create two separate subnets of your VPC than you need to create two VSwitches and these VSwitches are internally connected, That means your Application Web Servers you can deploy in one Subnet and your Data Server Instances you can deploy in another Subnet. VRouter will redirect requests to VSwitches either you want to access Application Web Server Zone or Data Server Instance zones. So, Both VSwitches are connected with VRouter.

As I mentioned in above paragraphs VPC follows CIDR Addressing Scheme so full Class 4 IP Address Space is available for you to define your VPC Address Range. But when we specify CIDR IP Address for our VPC e.g. 192.168.0.0/16 So in this IP Address Range of our own VPC out of 32 Bits IP Address means 192 - 8 Bits, 168 - 8 bits, 0 - 8 bits, 0 - 8 bits total 32 Bits, after /16 is a CIDR Value. That means Initial 16 Bits (192.168) are Network Address and Last 16 Bits out of 32 bits are Host Address So in this 192.168.0.0/16 VPC can have 2 raise to 16 = 65536 Instances we can accommodate. Now if we want to further segment our VPC in to subnet using VSwitches we can give CIDR IP Address to first VSwitch 192.168.0.0/24, That Means Initial 24 bits are for Network Address and last 8 bits out of 32 bits of VSwitch1 are for host address So, approx. 2 raise to 8 = 256 instances we can accommodate in first subnet. And if want another subnet for data server instances we can create VSwitch2 where we can specify 192.168.1.0/24 that means again approx. 256 Data Server Instance we can accommodate in another subnet of same VPC.

Another important concept in Alibaba Cloud VPC is mainstream tunneling technology in that here in Alibaba Cloud we have unique tunnel Id per VPC. So, each data packet travel across VPC Instances to Instances having unique tunnel Id encapsulated with each data packet header. So, ECS instances of two separate VPCs can not communicate with each other until hybrid connectivity between two communicating VPCs.

Another Logical component of VPC is Controller, Controller basically uses the self-developed protocol to forward the forwarding table to the VPC Gateway and VSwitches, completing the key configuration path. So here in VPC Data Path and Configuration Paths are different and have redundant Disaster Recovery that improves high availability of the VPCs in Alibaba Cloud.

Here, Alibaba Cloud is providing you very good Security Isolation, because Cloud Servers of different users are belongs to different Virtual Private Cloud and different VPCs having unique tunnel IDs. Different Cloud Server are using VSwitches to communicate with each other and Different Cloud Servers of different subnets of same Virtual Private Cloud using VRouter to communicate with each other. Also intranet connectivity of VPCs are completely isolated and can only be connected by external mapping of IPs. And third layer of isolation is each instance of Alibaba Cloud having Security Group Firewall to control the inbound and outbound control network access. One more advantages of Alibaba Cloud VPCs is software VPNs and Lease Line Connection are supported as a connectivity options.