/r/hadoop

Photograph via snooOG

Topics about Apache's Hadoop cloud platform and its related ecosystem - Pig, Hive, Cassandra, ZooKeeper


Related subreddits:

/r/Database

/r/Datamining

/r/BigData

/r/CloudComputing

/r/Nosql

/r/MongoDB

/r/DataSets

/r/Intel

/r/hadoop

7,354 Subscribers

2

AWS Snowmobile

With AWS Snowmobile being retired, what do people think are the best methods for uploading PB+ scale Hadoop datasets into the cloud?

0 Comments
2024/04/26
14:45 UTC

1

kerberos -I think- related error on datanodes while cluster is running

So I am playing around, trying to create a proper kerberized hadoop installation. I have a namenode, secondary node, and 3 data nodes, and I thought I had got it to work. It does kind of. I have kinit'ed all my keytabs, and the cluster starts up. I have compiled jscv that starts the datanodes as root, and the delivers it down to the hdfs account. I can see hadoop run on all 5 VMs, stuffs good, or so I thought.

Looking in the logs on a datanode I this error, while the cluster runs for like half an hour, untill I stop it:

2024-04-24 16:14:14,376 WARN org.apache.hadoop.ipc.Client: Couldn't setup connection for dn/doop3.myDomain.tld@HADOOP.KERB to nnode.myDomain.tld/192.168.0.160:8020 org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed

First off I get the same error on all 3 datanodes and I check that there is actual connection, with ncat on the datanode like this: 'nc nnode.myDomain.tld 8020' and I connect fine.

So obviously I worry that my Kerberos is not working. But the nodes will not start up, if the keytab file is not working. So in order to start the namenode, and the datanode, they do a kerberos login and works. And then stops working(?)

My keytabs looks like the Hadoop documentation: [https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS](https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS)

On my namenode (ok, I regret having the hdfs/-principal in there, but not referenced so w/e):

klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab

Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)

And here on my datanode:

klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab

Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)

On the data node and name node, checking the principals with kinit -t as mention in this article [https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429](https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429) gives no error, and as I said, the node starts so the initial Kerberos checks is accepted.

Again reading the error, I can't understand what it *actually* tells me. The cluster seems to continue to stay running until I shut it down. I have had it running for like half an hour, before I stopped it.

I thought of perhaps adding the credentials from all 5 VMs into keytab and just kinit all of it on all of them, but it doesn't seem reasonable.

This error is mentioned many times in google searches but nothing I find matches my scenario or fixes my issue.

hdfs-site.xml and core-site.xml on the 2 nodes are shown here, instead of making the post even longer: [https://pastebin.com/QLT6GqVd](https://pastebin.com/QLT6GqVd)

Any clues on, what the error expects me to look into is much appreciated. I have tried following Hadoops kerberos documentation, and is the base of my setup, if that matters.

0 Comments
2024/04/24
22:57 UTC

0

Apache Ranger UserSync Configuration HELP!!

I am trying to configure Apache ranger usersync with unix ! and Iam stuck at this point !:

After i execute this : sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ ./setup.sh

Then this error pops up:

teka@t3:/usr/local/ranger-usersync$ sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ./setup.sh

[sudo] password for teka:

INFO: moving [/etc/ranger/usersync/conf/java_home.sh] to [/etc/ranger/usersync/conf/.java_home.sh.28032024144333] .......

Direct Key not found:SYNC_GROUP_USER_MAP_SYNC_ENABLED

Direct Key not found:hadoop_conf

Direct Key not found:ranger_base_dir

Direct Key not found:USERSYNC_PID_DIR_PATH

Direct Key not found:rangerUsersync_password

Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper

at org.apache.ranger.credentialapi.CredentialReader.getDecryptedString(CredentialReader.java:39)

at org.apache.ranger.credentialapi.buildks.createCredential(buildks.java:87)

at org.apache.ranger.credentialapi.buildks.main(buildks.java:41)

Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper

at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)

at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

... 3 more

ERROR: Unable update the JCKSFile(/etc/ranger/usersync/conf/rangerusersync.jceks) for aliasName (usersync.ssl.key.password)

Can any one help me with that ?

Tools Iam using:

Host Device: MacBook m1

Guest Device: Ubuntu 20.04 LTS

Apache Ranger: 2.4 (Build from source code)

0 Comments
2024/03/28
14:54 UTC

4

Need Guidance, 4th semester Data Science Student

Hey everyone,

I'm currently in my 4th semester of data science, and while I've covered a fair bit of ground in terms of programming languages like C++ and Python (with a focus on numpy, pandas, and basic machine learning), I'm finding myself hitting a roadblock when it comes to diving deeper into big data concepts.

In my current semester, I'm taking a course on the fundamentals of Big Data. Unfortunately, the faculty at my university isn't providing the level of instruction I need to fully grasp the concepts. We're tackling algorithms like LSH, PageRank, and delving into Hadoop (primarily mapreduce for now), but I'm struggling to translate this knowledge into practical coding skills. For instance, I'm having difficulty writing code for mappers and reducers in Hadoop, and I feel lost when it comes to utilizing clusters and master-slave nodes effectively.

To add to the challenge, we've been tasked with building a search engine using mapreduce in Hadoop, which requires understanding concepts like IDF, TF, and more – all of which we're expected to learn on our own within a tight deadline of 10 days.

I'm reaching out to seek guidance on how to navigate this situation. How can I set myself on a path to learn big data in a more effective manner, considering my time constraints? My goal is to be able to land an internship or entry-level position in the data science market within the next 6-12 months.

Additionally, any tips on approaching this specific assignment would be immensely helpful. How should I go about tackling the task of building a search engine within the given timeframe, given my current level of understanding and the resources available?

Any guidance, advice, or resources you can offer would be greatly appreciated. Thank you in advance for your help!

0 Comments
2024/03/21
20:18 UTC

0

Hive Shell Issues

0 Comments
2024/03/19
08:25 UTC

1

Hadoop Installation

0 Comments
2024/03/17
13:13 UTC

1

Help with setup in MAC

Hi guys, I have been trying to run Apache Hadoop (3.3.1) on my M1 Pro machine and I have been getting this error of " Cannot set priority of namenode process XXXXX ". I understand that MacOS is not allowing background process to be invoked. Is there any possible fix to this guys?

0 Comments
2024/03/16
16:45 UTC

2

Namenode Big Heap

Hi guys,

Long Story short, running a big hadoop cluster, lots of files.

Currently the namenod has 20GB of Heap almost full the whole time, some long Garbage cycles freeing up little to no memory.

Is there anybody who is running Namenodes with 24 or 32 GB of heap.

is there any particulare tuning needed ?

Regards

5 Comments
2024/03/14
14:55 UTC

1

[Hiring] Big Data Engineer with Spark (located in Poland)

Scalac | Big Data Engineer (with Spark) | Poland | Gdańsk or remote | Full time | 20 000 to 24 000 PLN net/month on B2B (or equivalent in USD/EUR)

Who are we looking for?
We are looking for a Big Data Engineer with Spark who will be working on an external project in the credit risk domain. You should have expertise in the following technologies:
- At least 4 years of experience with Scala and Spark
- Excellent understanding of Hadoop
- Jenkins, HQL (Hive Queries), Oozie, Shell scripting, GIT, Splunk

As a Big Data Engineer, you will:
- Work on an external project and develop an application that is based on the Hadoop platform.
- Work with an international team of specialists.
- Design and implement database systems.
- Implement business logic based on the established requirements.
- Ensure the high quality of the delivered software code.
- Independently make decisions, even in high-risk situations.

Apply here: https://scalac.io/careers/senior-bigdata-engineer/

2 Comments
2024/03/12
09:03 UTC

1

Is there a way to access hadoop via eclipse

As the title suggests, I am new to hadoop and my instructor gave me a task to access it via eclispe, it's something called accessing it via java api. I've searched so many videos but most of them are wordcount problems and aren't solving my problem. Any suggestions?

1 Comment
2024/03/12
00:46 UTC

2

Cirata for Hadoop Migration

My company is exploring Cirata using a 5pb data migration to Azure. The technology (centered on Paxos algo) seems very impressive for large, unstructured datasets but I'm not sure. Does anyone have any experience using them and any thoughts they would be willing to share?

Thanks in advance.

1 Comment
2024/02/23
14:00 UTC

6

Onprem HDFS alternatives for 10s of petabytes?

So I see lots of people dumping on Hadoop in general in this sub but I feel a lot of the criticism is really towards YARN. I am wondering if that is also true for HDFS. Are there any onprem storage alternatives that can scale to say 50PBs or more? Is there anything else that has equal or better performance and lower disk usage with equality or better resiliency especially factoring in HDFS erasure coding with roughly 1.5x size on disk? Just curious what others are doing for storing large amounts of semi structured data in 2024. Specifically I'm dealing with a wide variety of data ranging from a few kilobytes to gigabytes per record.

10 Comments
2024/01/27
05:18 UTC

1

HIVE HELP NEEDED !!!

Hi guys its my first time using hive and I just set it up using a udemy course guideline. I got this error that reads schema too failde due to hive exception.

Error: Syntax error: Encountered "statement_timeout" at line 1, column 5. (state=42X01,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***

Can someone help me with this. I followed these stackoverflow to trouble shoot links too and they did not work even with removing the meta store file and re-initialising the same.

Please help thankyou for your time and patience. Your friendly neighborhood big data noob!!!

5 Comments
2024/01/26
13:00 UTC

2

Big Companies: Java Hadoop or Hadoop streaming

Hello all, I was wondering from your experience in the industry do big companies (in terms of market leadership not only in size) is the Java approach of writing their MapReduce jobs more popular or Hadoop Streaming approach. It would be very interesting to know to be if I need to brush up my Java skills or can stick with python streaming approach in order to prompt myself as Hadoop MapReduce practitioner/capable.

9 Comments
2024/01/17
02:15 UTC

1

Hadoop for GIS?

I’m a construction surveyor for 17 years and know cad, desktop GIS, some programming , pointclouds and photogrammetry.

I wonder if db with Hadoop can be useful to learn.

1 Comment
2024/01/05
12:41 UTC

1

how to use this program?

so, my teacher gave to us an activity to use hadoop, but he never really taught us how to use it, and i cant find any tutorial of how do it, can someone here help me to do it? i don't even know how to start the program, the activity is the following: As you noted, this unit does not have self-correction activities. A more practical activity is proposed, considering that you already have the Hadoop platform installed, as well as mahout, therefore, you will be able to carry out the experiments proposed here, where a Reuters text base is available.

The idea of the activity is for you to run the kmeans algorithm using one of the folders with the texts, and analyze the result of the algorithm. Observe the clusters generated, and whether the subjects are in fact related to each other. If you want to use other text bases, the sequence of commands should work.

Below is the example and sequence of commands used: Base Reuters C50train

hadoop fs -copyFromLocal C50/ /

./mahout seqdirectory -i /C50/C50train -o /seqreuters -xm sequential

./mahout seq2sparse -i /seqreuters -o /train-sparse

./mahout kmeans -i /train-sparse/tfidf-vectors/ -c /kmeans-train-clusters -o /train-clusters-final -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 10 -ow

./mahout clusterdump -d /train-sparse/dictionary.file-0 -dt sequencefile -i /train-clusters-final/clusters-10-final -n 10 -b 100 -o ~/saida_clusters.txt -p /train-clusters-final/clustered-points

5 Comments
2023/12/08
14:31 UTC

1

My datanode doesn't seem to run and I can't browse files as well

This is the message I get when I run Hadoop datanode. The OS is macOS Sonoma.

STARTUP_MSG: Starting DataNode

STARTUP_MSG: host = Sonals-MacBook-Air.local/127.0.0.1

STARTUP_MSG: args = []

STARTUP_MSG: version = 3.3.6

STARTUP_MSG: build = https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c; compiled by 'ubuntu' on 2023-06-18T08:22Z

STARTUP_MSG: java = 21.0.1

************************************************************/

2023-11-30 21:50:23,326 INFO datanode.DataNode: registered UNIX signal handlers for [TERM, HUP, INT]

2023-11-30 21:50:23,611 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2023-11-30 21:50:23,740 INFO checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/tmp/hadoop-sonalpunchihewa/dfs/data

2023-11-30 21:50:23,853 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties

2023-11-30 21:50:24,009 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

2023-11-30 21:50:24,009 INFO impl.MetricsSystemImpl: DataNode metrics system started

2023-11-30 21:50:24,211 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling

2023-11-30 21:50:24,233 INFO datanode.BlockScanner: Initialized block scanner with targetBytesPerSec 1048576

2023-11-30 21:50:24,237 INFO datanode.DataNode: Configured hostname is localhost

2023-11-30 21:50:24,238 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling

2023-11-30 21:50:24,242 INFO datanode.DataNode: Starting DataNode with maxLockedMemory = 0

2023-11-30 21:50:24,278 INFO datanode.DataNode: Opened streaming server at /0.0.0.0:9866

2023-11-30 21:50:24,279 INFO datanode.DataNode: Balancing bandwidth is 104857600 bytes/s

2023-11-30 21:50:24,279 INFO datanode.DataNode: Number threads for balancing is 100

2023-11-30 21:50:24,319 INFO util.log: Logging initialized u/2069ms to org.eclipse.jetty.util.log.Slf4jLog

2023-11-30 21:50:24,418 WARN server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets. Reason: Could not read signature secret file: /Users/sonalpunchihewa/hadoop-http-auth-signature-secret

2023-11-30 21:50:24,423 INFO http.HttpRequestLog: Http request log for http.requests.datanode is not defined

2023-11-30 21:50:24,439 INFO http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)

2023-11-30 21:50:24,442 INFO http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context datanode

2023-11-30 21:50:24,442 INFO http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static

2023-11-30 21:50:24,442 INFO http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs

2023-11-30 21:50:24,477 INFO http.HttpServer2: Jetty bound to port 62237

2023-11-30 21:50:24,479 INFO server.Server: jetty-9.4.51.v20230217; built: 2023-02-17T08:19:37.309Z; git: b45c405e4544384de066f814ed42ae3dceacdd49; jvm 21.0.1+12-LTS-29

2023-11-30 21:50:24,503 INFO server.session: DefaultSessionIdManager workerName=node0

2023-11-30 21:50:24,503 INFO server.session: No SessionScavenger set, using defaults

2023-11-30 21:50:24,505 INFO server.session: node0 Scavenging every 660000ms

2023-11-30 21:50:24,522 INFO handler.ContextHandler: Started o.e.j.s.ServletContextHandler@548e76f1{logs,/logs,file:///usr/local/var/hadoop/,AVAILABLE}

2023-11-30 21:50:24,523 INFO handler.ContextHandler: Started o.e.j.s.ServletContextHandler@1ee4730{static,/static,file:///usr/local/Cellar/hadoop/3.3.6/libexec/share/hadoop/hdfs/webapps/static/,AVAILABLE}

2023-11-30 21:50:24,622 INFO handler.ContextHandler: Started o.e.j.w.WebAppContext@737edcfa{datanode,/,file:///usr/local/Cellar/hadoop/3.3.6/libexec/share/hadoop/hdfs/webapps/datanode/,AVAILABLE}{file:/usr/local/Cellar/hadoop/3.3.6/libexec/share/hadoop/hdfs/webapps/datanode}

2023-11-30 21:50:24,633 INFO server.AbstractConnector: Started ServerConnector@5a021cb9{HTTP/1.1, (http/1.1)}{localhost:62237}

2023-11-30 21:50:24,633 INFO server.Server: Started u/2383ms

2023-11-30 21:50:24,738 WARN web.DatanodeHttpServer: Got null for restCsrfPreventionFilter - will not do any filtering.

2023-11-30 21:50:24,842 INFO web.DatanodeHttpServer: Listening HTTP traffic on /0.0.0.0:9864

2023-11-30 21:50:24,848 INFO datanode.DataNode: dnUserName = sonalpunchihewa

2023-11-30 21:50:24,848 INFO datanode.DataNode: supergroup = supergroup

2023-11-30 21:50:24,849 INFO util.JvmPauseMonitor: Starting JVM pause monitor

2023-11-30 21:50:24,893 INFO ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 1000, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, ipcBackoff: false.

2023-11-30 21:50:24,916 INFO ipc.Server: Listener at 0.0.0.0:9867

2023-11-30 21:50:24,917 INFO ipc.Server: Starting Socket Reader #1 for port 9867

2023-11-30 21:50:25,129 INFO datanode.DataNode: Opened IPC server at /0.0.0.0:9867

2023-11-30 21:50:25,168 INFO datanode.DataNode: Refresh request received for nameservices: null

2023-11-30 21:50:25,179 INFO datanode.DataNode: Starting BPOfferServices for nameservices: <default>

2023-11-30 21:50:25,187 INFO datanode.DataNode: Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000 starting to offer service

2023-11-30 21:50:25,194 INFO ipc.Server: IPC Server Responder: starting

2023-11-30 21:50:25,195 INFO ipc.Server: IPC Server listener on 9867: starting

2023-11-30 21:50:25,307 INFO datanode.DataNode: Acknowledging ACTIVE Namenode during handshakeBlock pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000

2023-11-30 21:50:25,310 INFO common.Storage: Using 1 threads to upgrade data directories (dfs.datanode.parallel.volumes.load.threads.num=1, dataDirs=1)

2023-11-30 21:50:25,319 INFO common.Storage: Lock on /tmp/hadoop-sonalpunchihewa/dfs/data/in_use.lock acquired by nodename 26063@Sonals-MacBook-Air.local

2023-11-30 21:50:25,323 WARN common.Storage: Failed to add storage directory [DISK]file:/tmp/hadoop-sonalpunchihewa/dfs/data

java.io.IOException: Incompatible clusterIDs in /private/tmp/hadoop-sonalpunchihewa/dfs/data: namenode clusterID = CID-97bdde6d-31e0-4ea9-bfd2-237aa6eac8fc; datanode clusterID = CID-3e1e75f3-f00d-4a85-acdb-fd8cccf4e363

at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:746)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:296)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:409)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:389)

at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:561)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:2059)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1995)

at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:394)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:312)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:891)

at java.base/java.lang.Thread.run(Thread.java:1583)

2023-11-30 21:50:25,326 ERROR datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid 2b6d373f-e587-4c49-8564-6339b7b939e2) service to localhost/127.0.0.1:9000. Exiting.

java.io.IOException: All specified directories have failed to load.

at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:562)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:2059)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1995)

at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:394)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:312)

at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:891)

at java.base/java.lang.Thread.run(Thread.java:1583)

2023-11-30 21:50:25,326 WARN datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid 2b6d373f-e587-4c49-8564-6339b7b939e2) service to localhost/127.0.0.1:9000

2023-11-30 21:50:25,326 INFO datanode.DataNode: Removed Block pool <registering> (Datanode Uuid 2b6d373f-e587-4c49-8564-6339b7b939e2)

2023-11-30 21:50:27,328 WARN datanode.DataNode: Exiting Datanode

2023-11-30 21:50:27,335 INFO datanode.DataNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down DataNode at Sonals-MacBook-Air.local/127.0.0.1

0 Comments
2023/11/30
16:23 UTC

1

BFS Mapreduce

Hey everyone, I am starting out with mapreduce and I'm stuck trying to figure out the best way to program an iterative BFS using mapreduce. Can someone please help me figure this out ?

0 Comments
2023/11/29
22:38 UTC

1

Simulating a cluster on a single machine using Docker

Hi all,

I'm working on Apache Hadoop for my Master's thesis. I don't have access to a real cluster of computers to test on, so I've decided to simulate a cluster in a single computer leveraging Docker container for that.
I just have a single doubt. How do container communicate among them? I've seen that some passwordless ssh is required? But I've seen some docker hadoop examples and they don't configure anything related to ssh, but in other places I've seen to configure a passwordless ssh...

I don't understand the paper passwordless ssh has in a hadoop cluster. Also, I've seen in the Hadoop documentation that clusters communicate via TCP I guess.

Thanks in advance!

3 Comments
2023/11/29
17:54 UTC

1

Heap size

How to find heap size in name node?

0 Comments
2023/11/29
01:11 UTC

1

Really basic surface level question (possibly stupid)

I am trying to understand the relationship between Apache Hadoop and Avro - if you need to exchange data between Hadoop components, why wouldn’t you use Avro? What are the pros and cons of using it, and what are the alternatives?

Any insight is appreciated.

0 Comments
2023/11/28
20:28 UTC

1

Oozie - Auto kill running workflow after some time

Hi!

I have a workflow defined using Oozie. It runs on a scheduled every day and takes a couple of hours to run. Sometimes, it gets "stuck" in a RUNNING status. I'd like to make sure that when we reach the next schedule, the RUNNING workflow gets killed so that a new one can be provisioned and started.

It could also be that after X hours, any RUNNING workflow are killed/failed. Can't find a way to achieve this. Any ideas?

2 Comments
2023/11/27
16:12 UTC

2

Help needed with Hadoop MapReduce Job

Apologies in advance if any of the below is poorly explained, I am a Hadoop novice and have very little overall programming experience.

For a college assignment, I have installed Hadoop on my Mac. I installed Hadoop (v3.3.6) using HomeBrew. I am running Hadoop inside Terminal on my Mac.

The install was successful and Hadoop is configured (after a month of struggling), I am now trying to set up a single node Hadoop cluster and run a small WordCount MapReduce job in standard mode, using an example jar file that comes with Hadoop (hadoop-streaming-3.3.6.jar).

When I run the MapReduce job, I check the status using the ResourceManager web UI (accessed through http://localhost:8088/). The job has been accepted but moves no further than that. I have tried checking the log files, but the log files relating to 'YARN ResourceManager' and 'YARN NodeManager' don't appear to be generating.

Does anyone have any suggestions on what I could try to troubleshoot why the MapReudce job is not running (just staying in Accepted state), and why the YARN log files are not generating?

If it is needed, the specs of my Mac are:
2 GHz Quad-Core Intel Core i5
16 GB 3733 MHz LPDDR4X
14.1.1 (23B81)

Thanks in advance!

0 Comments
2023/11/14
21:56 UTC

1

Yarn application has already ended

I am trying to install spark with hadoop on wsl but keep having this error after executing spark-shell

I am new to hadoop and couldn't find much resource much resources what am i missing?How can i access yarn application logs?

ERROR YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.

1 Comment
2023/11/10
12:08 UTC

1

The Live Nodes number is 0 and org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint error

I have set up a Hadoop cluster across 4 virtual machines, consisting of 1 Namenode and 3 Datanodes (with the Namenode also serving as the Secondary Namenode). However, currently, we are facing an issue where the number of Live Nodes in our Hadoop cluster is showing as 0. Upon reviewing the logs, it appears that there is an error message indicating 'org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint,' as shown in the screenshot below. What could be the potential reasons for this situation, and how can we resolve this problem to ensure the cluster functions correctly?

https://preview.redd.it/capew2hqkdsb1.jpg?width=1348&format=pjpg&auto=webp&s=fc958df3c793c06368673aea1d8d23a9cc6b1361

https://preview.redd.it/3jmw83hqkdsb1.jpg?width=1232&format=pjpg&auto=webp&s=20487c797fb55d425048029d7cf26dba3c51b8c5

2 Comments
2023/10/05
12:11 UTC

0

Could not find or load main class

0 Comments
2023/10/03
08:36 UTC

0

Hortonworks Hadoop with VMWare

So I downloaded the Hotornworks file tomuse with VMWare. I downloaded VMware and loaded the file.

Now what's next?

0 Comments
2023/09/23
09:26 UTC

Back To Top