w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
Hive - dynamic partitions: Long loading times with a lot of partitions when updating table

During this slow phase, Hive takes the files it built for each partition and moves it from a temporary directory to a permanent directory. You can see this in the "explain extended" called a Move Operator.

So for each partition it's one move and an update to the metastore. I don't use EMR but I presume this act of moving files to S3 has high latency for each file it needs to move.

What's not clear from what you wrote is whether you're doing a full load each time you run. For example why do you have a 2013-03-05 partition? Are you getting new log data that contains this old date? If this data is already in your logs table you should modify your insert statement like

SELECT fields
FROM new_logs
WHERE dt > 'date of last run';

This way you'll only get a few buckets and only a few files to move. It's still wasteful to scan all this extra data from new_logs but you can solve that by partitioning new_logs.

© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.