Big data technology dispute: PIG to Hive

Big data technology dispute: PIG to Hive

Pig and Hive have become essential tools for large-scale data exchange in enterprises, with the obvious advantage of eliminating the need to write complex MapReduce code. As an important force in the Hadoop ecosystem, both components provide an abstraction layer that is based on the core implementation. Hive's initial design idea is to provide a similar SQL user experience, while simplifying the RDBMS transition process. Pig has more procedural programs designed to help users implement data operations without having to write MapReduce.

This article will compare Pig with Hive through examples and code.

Hive's inherent advantages

Apache Hive is an extremely powerful big data component whose main strength lies in data consolidation and retrieval. Hive has an outstanding performance when working with data that already has associated patterns. In addition, the Hive metastore tool can divide all data according to user-specified conditions, which will further improve data retrieval speed. However, when using a large number of partitions for a single query, you need to beware of the following issues that Hive may cause:

1) The increase in the number of partitions in the query means that the number of paths associated with them will also increase synchronously. We assume that in a use case, a query needs to point to a set of 10,000 top-level partitions of the table, and each of which contains more nested partitions. Some friends may have realized that Hive will try to set a path for all the partitions in the task configuration at the same time as it translates the query into a MapReduce task. Therefore, the number of partitions will directly affect the size of the task itself. Because the default size of jobconf is 5 MB, running above this limit throws a runtime execution error. For example, it may display "java.io.IOException: Exceeded max jobconf size: limit: 5242880". You can click here for more details.

2) Bulk partition registration (eg 10000 x 100000 partitions) via "MSCK REPAIR TABLE table name" is also subject to the Hadoop Heap size and GCOverheadlimit restrictions. Exceeding this limit clearly leads to errors or a crash on the stackoverflow shown below:

Exception in thread "main" java.lang.StackOverflowError

at org.datanucleus.query.expression.ExpressionCompiler.isOperator (ExpressionCompiler.)

at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression (ExpressionCompiler.)

at org.datanucleus.query.expression.ExpressionCompiler.compileExpression (ExpressionCompiler.)

at org.datanucleus.query.expression.ExpressionCompiler.compileOrAndExpression (ExpressionCompiler.)

at org.datanucleus.query.expression.ExpressionCompiler.compileExpression (ExpressionCompiler.)

3) The use of more complex multi-tier operation, such as access to multiple partitions, also has its limitations. Large-scale queries may have errors due to the Hive compiler using metastore for semantic verification. This is because Hive metastore is essentially a type of SQL schema store, so large queries can raise the following error: "com.mysql.jbdc.PacketTooBigException: Packet for query is too large".

It is clear that various properties, including jobconf size, Hadoop Heap size, and packet size, can not be configured. To avoid these problems, we should better design the semantics instead of changing the configuration frequently.

The strength of Hive is that it is designed based on the data system model on HDFS. It can hold a large amount of data in each acceptable partition, but it is not suitable to use a large number of partitions to accommodate a small amount of data. After all, the existence of partitions is to speed up the specific data query speed, without the need to operate on the overall data set. The reduction in the number of partitions means that we can achieve minimum load and maximize cluster resource utilization.

When to use Pig

Polyester Full Draw Yarn Cationic

Polyester Full Draw Yarn Cationic is considered a modified polyester fiber, there are cationic dyeable sulfonic acid groups within the fabric structure, allowing for improvement in its dyeability, and has high degree of staining power. Only a small amount of dye is needed to get a rich and deep color, enhancing the vividness of the dye. Saturable dyeing can be done under regular pressure and 100℃ conditions,Excellent dyeing fastness, can improve the dye migration problem encountered during the laminating and gluing processes.


Application


Widely use for Casual wear, sportswear, jackets, and outerwear.

FDY CD


Types Of Yarn,Full Draw Yarn Cationic,Polyester Cationic Yarn,Polyester Full Draw Yarn Cationic

YIBIN MERRY TRADING CO. LTD. , https://www.cnmerry.com