w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
  Home » HADOOP » Page 1
How to Get MAX from Each Group in PIG
Can you try like this? Include Application_Year also in the 3 stmt GROUP by application year Sort the bags by count in desc order Get the top element Print the groupname and count q3_Count_Reasons_Yearwise = FOREACH q3_Group_Reason_Year GENERATE q3_ReasonYearWise.Application_Year as my_application_year,group as me, COUNT(q3_ReasonYearWise.(Application_Year, loan_purpose)) as tot; At the end

Categories : Hadoop

Hadoop MapReduce: Is it possible to only use a fraction of the input data as the input to a MR job?
It's not possible. Because for a map-reduce job we simply specifies the input. One thing we can do is, writing a condition in mapper. if key is b/w min & max value, then only process key-value pair & emit output to reducer. Otherwise, simply do nothing. But even in this case, our map phase is processing on all of the input, but reduce phase will be only on the key range we have spe

Categories : Hadoop

Hadoop 2.x - [:50070/dfshealth.html] not working
Following Apache jira is the one which you are asking about https://issues.apache.org/jira/browse/HDFS-5334 (Implement dfshealth.jsp in HTML pages) If you look at the above Hadoop jira, you could see that the fix version is 2.3.0. Which means hadoop version 2.2.0 doesn't have the new UI.

Categories : Hadoop

Spark - Joining 2 PairRDD elements
rdd.join(otherRdd) provides you inner join on the first rdd. To use it, you will need to transform both RDDs to a PairRDD that has as key the common attribute that you will be joining on. Something like this (example, untested): val rddAKeyed = rddA.keyBy{case (k,v) => key(v)} val rddBKeyed = rddB.keyBy{case (k,v) => key(v)} val joined = rddAKeyed.join(rddBKeyed).map{case (k,(json1,json2))

Categories : Hadoop

Cloudera/Hive - Can't access tables after hostname change
Never Mind I found the answer. You can confirm that hive/impala is looking for the wrong location by executing describe formatted [tablename]; O/P 14 Location: hdfs://[oldhostname]:8020/user/hive/warehouse/sample_07 NULL Then you can change "Location" property using : ALTER TABLE sample_07 SET LOCATION "hdfs://[newhostname]:8020/user/hive/warehouse/sample_07";

Categories : Hadoop

Apache Drill support for all ANSI SQL 2003 queries
Drill supports more than just HDFS filesystems. S3, local Linux filesystem, NTFS, MapR-FS in addition to HDFS. Currently Drill doesn't allow for single record updates, but I think that functionality is on the roadmap.

Categories : Hadoop

i have 10 files with CSV and TSV. I want to get the output what is CSV and TSV data using MapReduce in Apache Hadoop
Here's the code to replace comma delimiter with pipe and combine all the lists for the same surname into one package my.reader; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapredu

Categories : Hadoop

Phoenix Error while trying to create table
Check your hdfs://.../hbase/data/default/ is exist SYSTEM.CATALOG ? if there is nothing, you must try to use bin/hbase clean --cleanZk before you use the command, you must stop hbase Master and regionServers,but still keep ZK alive.

Categories : Hadoop

WritableComparable object is not serializable
As i see from MongoRecordWriter source code it does not support arbitrary WritableComparable object as key. You can use one of these classes as key: BSONWritable, BSONObject, Text, UTF8, simple wrappers like IntWritable. Also i think you can use Serializable object as key. So i can suggest you two workarounds: Make your MongoKey serializable (implements Serializable, implement writeObject, readO

Categories : Hadoop

How do I log to file in Scalding?
Any logging framework will do. You can obviously also use println() - it will appear in your job's stdout log file in the job history of your hadoop cluster (in hdfs mode) or in your console (in local mode). Also consider defining a trap with the addTrap() method for catching erroneous records.

Categories : Hadoop

Hadoop-2.4.1 custom partitioner to balance reducers
In short, you should not do it. First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network Second, to make the reducer process the whole output of the mapper, you fi

Categories : Hadoop

Hive to vertica data export with Unix named pipe
We are doing this, not using a named pipe (mkfifo) but a standard anonymous shell pipe: hive -e "select whatever FROM wherever" | dd bs=1M | /opt/vertica/bin/vsql -U $V_USERNAME -w $V_PASSWORD -h $HOST $DB -c "COPY schema.table FROM LOCAL STDIN DELIMITER E' ' NULL 'NULL' DIRECT" This works perfectly fine for us. Note the 'dd' between hive and vsql. This is mandatory to have it working proper

Categories : Hadoop

Limitation in Deploying Hadoop in windows platform
Yes, Its possible with Hadoop-2.x onwards, you can install hadoop full frameworks on windows. And follow this steps for installing hadoop-2.x on windows: Apache hadoop2 install on windows and this is helpful blog for installing hadoop on windows And for installing Hortonworks sandbox on windows follow this steps follow this steps: http://hortonworks.com/products/hortonworks-sandbox/

Categories : Hadoop

comparing data with last 5 versions of feed data in C* using datastax,hadoop,hive
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is f

Categories : Hadoop

Passing objects to MapReduce from a driver
This is related to the problem of side data distribution. There are two approaches for side data distribution. 1) Distributed Caches 2) Configuration As you have the objects to be shared, we can use the Configuration class. This discussion will depend on the Configuration class to make available an Object across the cluster, accessible to all Mappers and(or) Reducers. The approach here is

Categories : Hadoop

Hadoop mapreduce.job.reduces in Generic Option Syntax?
Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working: hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments. Here is an example of how to implement

Categories : Hadoop

copyFromLocal: unexpected URISyntaxException
Remove space between Directory name and it will work steps are as follows 1.Rename directory name and remove space between them change Event ordering to Eventordering 2 now run following command hadoop fs -copyFromLocal /home/hduser/Pictures/Eventordering/* input/

Categories : Hadoop

Is MapR a substitute for MapReduce
MapR is a commercial distribution of Apache Hadoop with HDFS replaced with MapR-FS. Essentially it is the same Hadoop and same Map-Reduce jobs running on top of with, covered with tons of marketing that causes the confusion and questions like yours. Here's the diagram of the components they have in their distribution: https://www.mapr.com/products/mapr-distribution-including-apache-hadoop For st

Categories : Hadoop

Hadoop Based Automation
Oozie is perfect candidate for automating jobs in hadoop. This workflow scheduler is having various actions, one of them is a shell action. You can use a shell script to invoke whichever tool you want and create the shell action in oozie to invoke the shell script. This shell action can have arguments and you can pass them during workflow execution. I have used the actions in oozie to execute a sh

Categories : Hadoop

How to get XmlInputParser work with self-closing XML tags?
This is a somewhat ugly hack. Change the START_TAG_KEY and END_TAG_KEY as below: config.set(XmlInputFormat.START_TAG_KEY, "<row"); config.set(XmlInputFormat.END_TAG_KEY, "/>"); The "keys" are being used like delimiters, and accept any string, rather than just XML tags. Not a "clean" solution, and may stop working on future implementations, but it gets the work done now. Note: I figured

Categories : Hadoop

No FileSystem for scheme: hdfs
Try to add hadoop-hdfs as compile scoped dependency: <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${org.apache.hadoop.version}</version> </dependency>

Categories : Hadoop

How to import table data stored in Hive in my MapReduce job?
Hive is developed to minimize the writing mapreduce program.You can perform the process using Hive queries, internally it will convert into mapreduce job. However, if you want to access the Hivedb data,you can access. Hive is not a database. All the data stored under warehouse dir in readable format. You can give full path as a input to your mapreduce program. Have you tried a sample mapreduce p

Categories : Hadoop

What is Keyword Context in Hadoop programming world?
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output. Applications can use the Context: to report progress to set application-level status messages update Counters indicate they are alive to get the values that are stored in job configuration across map/reduce phas

Categories : Hadoop

hadoop-2.2.0 compalition failing on Mac OS X 64bit
I got the same error in windows 8. Here's what I did: Ensure javah is copied to the path that's specified as JAVA_HOME. Tools.jar should be copied to to the corresponding lib folder of JAVA_HOME. For ex: if JAVA_HOME is c:..jre, tools.jar should be copied to c:..jrelib. This resolved the above said error. Hope this helps.

Categories : Hadoop

Differences between MapReduce and Yarn
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run a

Categories : Hadoop

namespace image and edit log
Please can anyone explain me what is the edit log? What is the role of this log file? Initially when the NameNode first starts up the fsimage file will itself be empty. When ever NameNode receives a create/update/delete request then that request is first recorded to edits file for durability once persisted in the edits file an in-memory update is also made. Because all read requests are serve

Categories : Hadoop

sampling of records inside group by throwing error
You can't use 'sample' command inside nested block.This is not supported in pig. Only few operations operations like (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY) are allowed in nested block. You have to use the sample command outside of the nested block. The other problem is, you are loading your input data using default delimiter ie tab. But your input data is delimited with space, so

Categories : Hadoop

Oozie Hive Action stuck in PREP state and the job stuck in RUNNING state
How many containers does your cluster setup have ? When you execute the workflow, one container will be occupied by oozie (it continues to hold that container till query executes) and rest will be used to execute the actual job.

Categories : Hadoop

Compression codec detection in Hadoop from the command line
If you are asking what codec is being used by mapreduce for intermediate map output and/or final output you can check Hadoop's configuration file, typically located at <HADOOP_HOME>/etc/mapred-site.xml. I am not, however, aware of a way to check directly from the command line. Settings for intermediate map output compression should look something like below: <property> <name>ma

Categories : Hadoop

How to store streaming data in cassandra
I think Cassandra single node can handle 1000 logs per second without bulk loading if your schema is good. Also depends on the size of each log. Or you could use Cassandra's Copy From CSV command. For this you need to create a table first. Here's an example from datastax website : CREATE TABLE airplanes ( name text PRIMARY KEY, manufacturer text, year int, mach float ); COPY airpla

Categories : Hadoop

mapreduce - Intermediate key and output
"As many as you wish" is true, there is no restriction on the number of output pairs, (as long as there is enough space of course). The type of output keys and the type of output values is predefined in the main method (in the Driver class of the old API), as well as the type of map output keys and values. Those are set in the following way: conf.setOutputKeyClass(VIntWritable.class); //just an

Categories : Hadoop

HortonWorks hadoop data security and encryption tools
The Hortonworks Data Platform (HDP) supports Apache Knox which is a REST Gateway that provides perimeter security in the form of authentication and access control. Here is a great Slide Share presentation that describes how Hortonworks works with Knox. Additionally, the Hortonworks Data Platform version 2.2 brings support for Apache Ranger, which is a policy-based security framework for defining

Categories : Hadoop

Writing data into flume and then to HDFS
As I see it, you would actually need to set up a Avro RPC source if you want to connect with NettyAvroRpcClient. A sample config would be as follows: # Define an Avro source called AvroSource on SpoolAgent and tell it # to bind to 0.0.0.0:41414. Connect it to channel MemChannel. agentMe.sources.AvroSource.channels = MemChannel agentMe.sources.AvroSource.type = avro agentMe.sources.AvroSource.bind

Categories : Hadoop

Getting the fileName in the mapper hadoop
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this: public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> { String filename; @Override protected void setup(Context context) throws IOException, InterruptedException { FileSplit fsFileSplit = (FileSp

Categories : Hadoop

Hadoop: reducer not getting invoked
After long time debugging the problem I Found that the issue is with the reduce override method. I used public void reducer instead of public void reduce observe that it should be reduce instead of reducer.

Categories : Hadoop

Apache Hue or Apache Ambari - how to install and configure them manually
Officialy, Ambari does not support installation on an existing cluster. When you install it you have to remove previous hadoop components. As mentioned on Ambari FAQ Installing a new cluster on top of an existing cluster When installing a Hadoop cluster via Ambari on hosts that already have Hadoop bits installed (including an existing cluster deployed via Ambari), perform the following: Sto

Categories : Hadoop

What is the -file argument for AWS EMR
Short Answer: -files is not a EMR flag rather it is a way to add files to the Distributed Cache. Long Version: Hadoop uses something called as GenricOptionsParser which is used to parse command line options. When you are using python for writing mapper or reducers which means that Hadoop is using something called as Streaming API to run the job. So, when you are running a Streaming Job you have

Categories : Hadoop

Combining AWS EMR output
If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge to merge them to local filesystem: hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE] And then put the merged file back to S3: hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/ For the above commands to work you should have the following properties set in core-site.xml <pr

Categories : Hadoop

custom partitioner to send single key to multiple reducers?
I was in a similar situation once. What I did is something like this : int numberOfReduceCalls = 5 IntWritable outKey = new IntWritable(); Random random = new Random(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // use a random integer within a limit outKey.set( random.nextInt(numberOfReduceCalls) );

Categories : Hadoop




© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.