|Hive - dynamic partitions: Long loading times with a lot of partitions when updating table|
During this slow phase, Hive takes the files it built for each partition
and moves it from a temporary directory to a permanent directory. You can
see this in the "explain extended" called a Move Operator.
So for each partition it's one move and an update to the metastore. I don't
use EMR but I presume this act of moving files to S3 has high latency for
each file it needs to move.
What's not clear from what you wrote is whether you're doing a full load
each time you run. For example why do you have a 2013-03-05 partition? Are
you getting new log data that contains this old date? If this data is
already in your logs table you should modify your insert statement like
WHERE dt > 'date of last run';
This way you'll only get a few buckets and only a few files
|Creating more partitions than reducers|
(a) No. You can have any number of reducers based on your needs.
Partitioning just decides which set of key/value pairs will go to which
reducer. It doesn't decide how many reducers will be generated. But, if
there is a situation wherein you want to set the number of reducers as per
your requirement, you can do that through Job :
(b) This is actually what happens. Based on the availability of slots a set
reducers is initiated which process all the input fed to them. If all the
reducers have finished and some data is still left unprocessed a second
batch of reducers will start and finish rest of the data. All of your data
will eventually get processed irrespective of the number of partitions and
Please make sure your partition logic is correct.
|Iterator over all partitions into k groups?|
This works, although it is probably super inneficient (I sort them all to
def clusters(l, K):
prev = None
for t in clusters(l[1:], K):
tup = sorted(t)
if tup != prev:
prev = tup
for i in xrange(K):
yield tup[:i] + [[l] + tup[i],] + tup[i+1:]
yield [ for _ in xrange(K)]
It also returns empty clusters, so you would probably want to wrap this in
order to get only the non-empty ones:
def neclusters(l, K):
for c in clusters(l, K):
if all(x for x in c): yield c
Counting just to check:
def kamongn(n, k):
res = 1
for x in xrange(n-k, n):
res *= x + 1
for x in xrange(k):
res /= x + 1
|Sum across partitions with window functions|
SELECT ts, a, b, c
, COALESCE(max(a) OVER (PARTITION BY grp_a), 0)
+ COALESCE(max(b) OVER (PARTITION BY grp_b), 0)
+ COALESCE(max(c) OVER (PARTITION BY grp_c), 0) AS special_sum
,count(a) OVER w AS grp_a
,count(b) OVER w AS grp_b
,count(c) OVER w AS grp_c
WINDOW w AS (ORDER BY ts)
ORDER BY ts;
First, put actual values and following NULL values in a group with the
aggregate window function count(): it does not increment with NULL values.
Then take max() from every group, arriving at what you are looking for. At
this point you could just as well use min() or sum(), since there is only
one non-null value per group.
COALESCE() catches NULL values if the overall first value in time is NULL.
|Generate numeric partitions|
Use the "partitions" package:
# [1,] 4 3 2 2 1
# [2,] 0 1 2 1 1
# [3,] 0 0 0 1 1
# [4,] 0 0 0 0 1
|Advanced partitions query|
I believe you want this:
GROUP BY infopath_form_id
That will give you the average number of minutes between the first and last
entry for each InfoPath_form_id.
Explanation of functions used:
MIN() returns the earliest date
MAX() returns the latest date
DATEDIFF() returns the difference between two dates in a given unit
(Minutes in this example)
COUNT() returns the number of rows per grouping item (ie InfoPath_form_id)
So simply divide the total minutes elapsed by one less than the number of
records giving you the average number of minutes between events.
|wrong partitions with matlab's cvpartition|
Are you using the stratified form of cross-validation that cvpartition
Use the second syntax described in the documentation page, i.e. c =
cvpartition(group,'kfold',k) rather than c = cvpartition(n,'kfold',k). Here
group is a vector (or categorical array, cell array of strings etc) of
class labels, and will stratify the selection of observations into folds
rather than just splitting everything randomly into groups.
|What is the "number of partitions" and "range" of an array?|
A partition in a sort is basically a section of the list based upon a pivot
point. For example , using the quick sort algorithm to sort the following:
First Pass Second Pass
3 3 1
8 1 3
5 <- Pivot 5--------- 5
1 8 7
7 7 8
In the first pass, there are two partitions based off numbers that are less
than or greater than 5
The range is the difference between the largest and smallest values, so in
this example that is 7 (8 - 1)
So the line you are questioning works as
(2 * log(7)) > 2 == Use HeapSort
1.691 > 2 false
|Difference between / and /mnt/upgrade if mounted on different partitions|
No. /mnt/upgrade is NOT part of mtdblock03.
/ and /mnt/upgrade are all virtual points in a virtual filesystem, which is
only a virtual map to the underlying physical media (NAND-flash in your
Look at it this way :
1. When the system boots
Initially using the kernel bootargs rootfs=,
the entire filesystem / is be mounted.
At this point in time, mtdblock03 (pointed to by ubi0) is mounted to /.
Anything written anywhere under / ends up in mtdblock03.
Either manually or using init scripts,
mtdblock06 (pointed to by ubi1) is mounted,
Now anything written under / EXCEPT under /mnt/upgrade ends up in
And anything written under /mnt/upgrade ends up in mtdblock06.
As long as the second mount is not unmounted (using umount),
|Can we have partitions within partition in a Hive table?|
Hive supports multiple levels of partitioning. But keep in mind that having
more than a single level of partitioning in Hive is almost never a good
idea. HDFS is really optimized for manipulating large files, ~100MB and
larger. Each partition of a Hive table is a HDFS directory. There are
normally multiple files in each of these directories. You really should be
closing on a petabyte of data to make multiple levels of partitioning in a
What problem are you trying to solve? I'm sure we can find a sensible
solution for it.
|Why do partitions require nested selects?|
It seems to be the same "rule" as any query, column aliases aren't visible
to the WHERE clause;
This will also fail;
SELECT id AS newid
WHERE newid=1; -- must use "id" in WHERE clause
|Python Integer Partitioning with given k partitions|
def part(n, k):
def _part(n, k, pre):
if n <= 0:
if k == 1:
if n <= pre:
ret = 
for i in range(min(pre, n), 0, -1):
ret += [[i] + sub for sub in _part(n-i, k-1, i)]
return _part(n, k, n)
>>> part(5, 1)
>>> part(5, 2)
[[4, 1], [3, 2]]
>>> part(5, 3)
[[3, 1, 1], [2, 2, 1]]
>>> part(5, 4)
[[2, 1, 1, 1]]
>>> part(5, 5)
[[1, 1, 1, 1, 1]]
>>> part(6, 3)
[[4, 1, 1], [3, 2, 1], [2, 2, 2]]
def part(n, k):
cache = [[[None] * n for j in xrange(k)] for i in xrange(n)]
def wrapper(n, k, pre):
|Wrong calculation for daily partitions|
Ok, I created the table, inserted some data and ran some of your queries
and you've got something wrong with your substring:
SQL> CREATE TABLE "MO_USAGEDATA" (
2 "REQUESTDTS" TIMESTAMP (9) NOT NULL ENABLE
4 partition by range ("REQUESTDTS") INTERVAL(NUMTODSINTERVAL(1,'DAY'))
5 (partition PART_MINVALUE values less than(TIMESTAMP '2012-06-18
SQL> INSERT INTO MO_USAGEDATA
2 (SELECT SYSDATE + ROWNUM FROM dual CONNECT BY LEVEL <= 30);
30 rows inserted
SQL> SELECT high_value, INTERVAL
2 FROM all_tab_partitions
3 WHERE table_name = 'MO_USAGEDATA'
4 AND table_owner = USER
5 ORDER BY PARTITION_POSITION;
|[Qt][Linux] List drive or partitions|
You need to use platform specific code. And, please, read the docs!
Returns a list of the root directories on this system.
On Windows this returns a list of QFileInfo objects containing "C:/",
"D:/", etc. On other operating systems, it returns a list containing just
one root directory (i.e. "/").
|Data Modeling with Kafka? Topics and Partitions|
When structuring your data for Kafka it really depends on how it´s meant
to be consumed.
In my mind, a topic is a grouping of messages of a similar type that will
be consumed by the same type of consumer so in the example above, I would
just have a single topic and if you´ll decide to push some other kind of
data through Kafka, you can add a new topic for that later.
Topics are registered in ZooKeeper which means that you might run into
issues if trying to add too many of them, e.g. the case where you have a
million users and have decided to create a topic per user.
Partitions on the other hand is a way to parallelize the consumption of the
messages and the total number of partitions in a broker cluster need to be
at least the same as the number of consumers in a consumer group to mak
|Hive : Insert overwrite multiple partitions|
Hive supports dynamic partitioning, so you can build a query where the
partition is just one of the source fields.
INSERT OVERWRITE TABLE dst partition (dt)
SELECT col0, col1, ... coln, dt from src where ...
The where clause can specify which values of dt you want to overwrite.
Just include the partition field (dt in this case) last in the list from
the source, you can even do SELECT *, dt if the dt field is already part of
the source or even SELECT *,my_udf(dt) as dt, etc
By default, Hive wants at least one of the partitions specified to be
static, but you can allow it to be nonstrict; so for the above query, you
can set the following before the running:
|What are horizontal and vertical partitions in database and what is the difference?|
Not a complete answer to the question but it answers what is asked in the
question title. So the general meaning of horizontal and vertical database
Horizontal partitioning involves putting different rows into different
tables. Perhaps customers with ZIP codes less than 50000 are stored in
CustomersEast, while customers with ZIP codes greater than or equal to
50000 are stored in CustomersWest. The two partition tables are then
CustomersEast and CustomersWest, while a view with a union might be created
over both of them to provide a complete view of all customers.
Vertical partitioning involves creating tables with fewer columns and using
additional tables to store the remaining columns. Normalization also
involves this splitting of columns across tables, but vertical par
|How can i partition a MySql table for use with 90 day rotating partitions?|
Actually, the problem is that you can't define a PRIMARY or UNIQUE key on a
partitioned table, if all the columns in the key are not included in the
One possible "fix" would be to remove the "PRIMARY" keyword from the KEY
The problem is that MySQL has to enforce uniqueness when you declare a key
to be UNIQUE or PRIMARY. And in order to enforce that, MySQL needs to be
able to check whether the key value already exists. Instead of checking
every partition, MySQL uses the partitioning function to determine the
partition where a particular key would be found.
|Installing M2Crypto inside of virtualenv without installing swig to the system|
So in the end I got this to work by letting buildout handle downloading and
installing swig and M2Crypto and then just moving the built M2Crypto and
EGG-INFO directories from where buildout put them to where virtualenv
wanted them...this might not be the optimal solution, but hey, it worked.
|Installing SQL objects in C# - issue when installing CLR assembly and function scripts|
I have now fixed this problem. The issue turned out to be that I was using
the same SQLCommand later on to execute a stored procedure (as I have given
the command a SQLTransaction so I can rollback if it fails) and had
forgotten to remove the parameters that were added when I run the next text
command. Therefore it was running the first script file and failing
|Installing setuptools prior to installing python dependencies on Mac|
You command attempts to install files into system directories. You must
execute the command as root or with sudo:
sudo sh setuptools-0.6c11-py2.7.egg
You can see this error with the:
[Errno 13] Permission denied
If you don't have permissions, you can try installing into your user's
Library folder, or look at virtualenv. (See the part about using locally
|Processing performance hit in SSAS with 2000+ partitions in 2008 R2|
Partitions are generally used to increase the performance, not to decrease
performance, but you're right that if you have too many, then you will take
a performance hit. It looks like you want to know how to find out how many
partitions is too many.
I'm going to assume that the processing time you are talking about is the
time to process the cube, not the time to query the cube.
The general idea of partitions is that you only have to process only a
small subset of the partitions when you are reprocessing the cube. This
makes it a huge performance enhancement. If you are processing a large
number of partitions, then the overhead of processing an individual
partition becomes non-negligible. The point this happens can depend on a
number of factors. The factors that scale with partitions in
|Removing partitions from the cube in SQL Server Management Studio|
To view the Partitions Manager dialog box, in SQL Server Data Tools, click
the Table menu, and then click Partitions.
To delete a partition:
In Partition Manager, in the Table listbox, verify or select the table that
contains the partition you want to delete.
In the Partitions list, select the partition you want to delete and then
Hope this helps
|Batch File to detect active Partitions/Drives|
Give this a go - and if it works then parse out the 'Fixed disks"
for /f "tokens=1,*" %%a in ('fsutil fsinfo drives ^| find ":" ') do (
for %%c in (%%b) do fsutil fsinfo drivetype %%c
|IIS and COM+ Partitions: Failed to create ASP Application XXX due to invalid or missing COM Partition ID|
After a lot of research on this subject I have found the solution: do not
use the PartitionId from IIS and do not enable partitions from IIS either.
Leave them to default values.
The solution to this is the following: Each partition should be assigned as
default partition for one user and each IIS Application (and each App pool)
should run on the same users that the default partitions use.
So basically if you have two IIS Applications named: web1 and web2, and two
app pools: app1 and app2, two users user1 and user2 and two partitions:
part1 and part2.
web1 should run under user1 and app1 should run also under user1 (app1 is
the application pool for web1). Then in Component Services user1 should
have the default partition: part1. Then when the web1 will search for a
COM+ component it will
|What is the difference between installing an app via homebrew or installing it "normal"?|
homebrew (like Macports) is a package manager. It allows you to manage
packages (update, delete etc.). Most importantly, homebrew will compile the
application on your platform. That's especially important for ports, e.g.
homebrew will give you greater and more fine grained control over what you
install, where, what compilation attributes you want to use etc. But this
comes at the cost of a bit more complexity and the need to know your way
around the command line.
Downloading a binary and putting it in the Applications folder is easier by
far and usually works fine. If your not a developer and don't need to
manage many different tools then I'd recommend sticking with binary
downloads. If you're a developer however, you will most likely not get
around a package manager if you n
|Checking and Installing .net 4.0 before while installing my forms app in SharpSetup WIX|
You can check if the .NET framework is installed by linking to the
NetFxExtension with light. Just add a PropertyRef to the one you want. You
can find a list of those properties here.
Say you want to make sure .NET framework 4.0 Full is present before
installing your software, you'd add this somewhere in your source code:
<PropertyRef Id="NETFRAMEWORK40FULL" />
<Condition Message=".NET Framework 4.0 Full is not installed.">
When running the MSI, the LaunchConditions action will run and check if the
NETFRAMEWORK40FULL property is set. If it is, the installation continues,
if not, the installation fails.
However, if you wanted to install the .NET Framework beforehand, you'll
need two WiX projects. One for your basic MSI, and one for
|Any way to compute statistics on a hive table for all partitions with a single analyze command?|
According to Hive manual if you do not specify partition specs statistics
are gathered for entire table,
When the user issues that command, he may or may not specify the partition
specs. If the user doesn't specify any partition specs, statistics are
gathered for the table as well as all the partitions (if any).
|FileStream will not open Win32 devices such as disk partitions and tape drives. (DotNetZip)|
MergeDirectories("Sample 1.zip", "Sample 2.zip", "Merged.zip");
private void MergeDirectories(string filePath1, string filePath2,
string workspace = Environment.CurrentDirectory;
filePath1 = Path.Combine(workspace, filePath1);
filePath2 = Path.Combine(workspace, filePath2);
mergedName = Path.Combine(workspace, mergedName);
DirectoryInfo zip1 = OpenAndExtract(filePath1);
DirectoryInfo zip2 = OpenAndExtract(filePath2);
string merged = Path.GetTempFileName();
using (ZipFile z = new ZipFile())
|Dividing an array into partitions NOT evenly sized, given the points where each partition should start or end, in python|
If I understand you, you need something like that
>>> a = range(20)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> i = [[1, 5], [5, 8], [8, 20]]
>>> [a[x:y] for x, y in i]
[[1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
or, as Jon Clements suggested in comments:
>>> [a[slice(*s)] for s in i]
[[1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
|Rails installing mysql - Error installing mysql2: ERROR: Failed to build gem native extension|
I've been so annoyed by the same problem, and finally succeeded in
installing mysql2. Kudos to odiszapc@github. It appears any other solution
I've found via Google than below doesn't work to me.
Copied and pasted from here. So no credit to me.
gem uninstall mysql2
Download last MySQL connector from
Extract it to C:connector-6.0.2
gem install mysql2 --platform=ruby --
Additional info on mine.
ruby 1.9.3p392 (2013-02-22) [i386-mingw32]
MySQL Server 5.6
Even if you successfully installed mysql2, you may still need some work
|Installation of ubuntu 12.04: win7 not detected, partitions not detected|
You need to run the Ubuntu GRUB bootloader, follow the instructions on the
When I installed ubuntu, I lost my partitions on the drive, in the end I
reverted to having two drives one with Windows 7 and the other with Ubuntu,
I had to recover the drive with partition recovery software (TestDisk).
Skipping the quick scan saves time as you will probably want to do a full
|Print all unique integer partitions given an integer as input|
I would approach it this way:
First, generalize the problem. You can define a function
printPartitions(int target, int maxValue, string suffix)
with the specification:
Print all integer partitions of target, followed by suffix, such that
each value in the partition is at most maxValue
Note that there is always at least 1 solution (provided both target and
maxValue are positive), which is all 1s.
You can use this method recursively. So lets first think about the base
printPartitions(0, maxValue, suffix)
should simply print suffix.
If target is not 0, you have to options: either use maxValue or not (if
maxValue > target there is only one option: don't use it). If you don't
use it, you should lower maxValue by 1.
if (maxValue <= target)
|Installing/using SDL for Qt|
Compiling with SDL and g++ cannot find -lSDLmain etc
Undefined reference to WinMain@16 when using SDL
I am sure one of those is also applicable to your question.
|Installing pip on Mac OS X|
You can install it through Homebrew on OS X. Why would you install Python
The version of Python that ships with OS X is great for learning but
it’s not good for development. The version shipped with OS X may be
out of date from the official current Python release, which is
considered the stable production version. (source)
Homebrew is something of a package manager for OS X. Find more details on
the Homebrew page. Once Homebrew is installed, run the following to
install the latest Python, Pip & Setuptools:
brew install python
|Installing MSI vb.net with MsiSetExternalUI|
Take a look at Windows Installer XML's (Wix) Deployment Tools Framework
(DTF) MSI interop library (Microsoft.Deployment.WindowsInstaller.dll ) It
has all the pieces needed to invoke an installation and provide an external
UI handler to receive the ProgressBar update messages that you can then
route to your VB.Net UI.
See the following topic and subtopics for more information:
Monitoring an Installation Using MsiSetExternalUI
The examples are in C++ using MSI Win32 functions and the DTF interop
library encapsulates all of this with classes. The DTF help file tells you
which classes and methods map to which Win32 functions.
|Installing m2e plugin - RAD 7.5.5|
As far as I can tell from here, RAD 7.5.5 is based on an old version of
Eclipse. m2e is unlikely to work, but its predecessor m2eclipse might.
Hopefully you will be able to find it as explained here.
However m2e evolved a lot in the last few years and I'd suggest that you
switch to a more recent version of Eclipse, if you can. The latest versions
actually have m2e directly integrated.
|Error installing PIL on Mac OS 10.8.4|
I somewhat remember this exact problem. Have you installed the Xcode
command line tools? That cured my headaches.
You can find it here.. https://developer.apple.com/xcode/
From Xcode's Preferences menu, install the Command Line Tools
gcc-4.2 failed with exit status 1
I can't install 'pip install pil' in Osx
|Why isn't Nokogiri installing?|
You need to specify nokogiri to use the system libraries instead, so it
doesn't try to build them itself.
NOKOGIRI_USE_SYSTEM_LIBRARIES=1 bundle install
Answer found here: Error installing nokogiri 1.6.0 on mac (libxml2).
Instead of using easy_install, you can try with the Windows binaries for
Pywin32 that are available on Christoph Gohlke's website.