/r/q?req.query.q -- Subreddit Search

7,395 Subscribers

[Meetup] Functional World #12 | How to handle things in your project without DevOps around?

Just one week to go! Join Functional World event - this time it's open to everyone - no matter what language you code in ;) Join us for a hands-on session where you’ll learn how to keep things smooth and fast, even when there’s no DevOps around. And leading the way is Darek, our DevOps pro from Scalac :) You can read more or sign up here if you want to be updated ;) You can also go directly on YouTube.

See you on November 19th at 6 PM CET!

1 Comment

2024/11/12
09:20 UTC

How to use EKS pod identity with hive metastore

Hi all, I am using hive metastore inside an eks pod. The pod is using eks pod identity for getting access to s3. I checked that the container has access to s3 by doing aws s3 ls. But my hive metastore fails when trying to access the s3. It lists all the credentials provider in the error but looks like eks pod identity is not supported. Has anyone faced this issue before? Thanks!

0 Comments

2024/10/18
11:58 UTC

How to go about testing a new Hadoop cluster

0 Comments

2024/10/15
19:37 UTC

How to go about testing a new Hadoop cluster

0 Comments

2024/10/15
19:36 UTC

Hive import is not compatible with importing into AVRO format.

i m trying to import a mysql db to hive database with sqoop as an avrodatafile but i m getting error that hive import is not compatible with import Avro format

this is my command
sqoop import --connect jdbc:mysql://localhost:3306/election_tunisie_2024 --connection-manager org.apache.sqoop.manager.MySQLManager --username root --password cloudera --table candidats --hive-database 4ds6 --hive-import --hive-table candidats --hive-drop-import-delims --m 1 --as-avrodatafile

output

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

24/10/14 16:18:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.12.0

24/10/14 16:18:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

24/10/14 16:18:15 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override

24/10/14 16:18:15 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.

Hive import is not compatible with importing into AVRO format.
thanks

1 Comment

2024/10/14
16:23 UTC

Need advice on what database to implement for a big retail company.

Hello, We want to set up and deploy a Hadoop ecosystem for a large retail company. However, we are not sure which technologies to use. Should we choose Cassandra, Hive, or Spark as the database?

Our requirements are as follows: It needs to be fast, real-time, and high-performance. We currently have 20 TB of data. I am open to suggestions.

5 Comments

2024/09/26
13:45 UTC

Help Needed: Hadoop Installation Error in Docker Environment

Hi r/hadoop,

I'm learning Big Data and related software, following this tutorial: Realtime Socket Streaming with Apache Spark | End to End Data Engineering Project. I'm trying to set up Hadoop using Docker, but I'm encountering an error during installation:

Error: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

Here's my setup:

I'm using a Docker-compose.yml file to set up multiple services including namenode, datanode, resourcemanager, nodemanager, and Spark master/worker.
In my Docker-compose.yml, I've set the HADOOP_HOME environment variable for each Hadoop service:
environment:

HADOOP_HOME: /opt/hadoop

PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH

I'm using the apache/hadoop:3 image for Hadoop services and bitnami/spark:latest for Spark services.
I've created a custom Dockerfile.spark that extends from apache/hadoop:latest and bitnami/spark:latest, and installs Python requirements.

Despite setting HADOOP_HOME in the Docker-compose.yml, I'm still getting the error about HADOOP_HOME being unset.

Has anyone encountered this issue before? Any suggestions on how to properly set HADOOP_HOME in a Docker environment or what might be causing this error?

docker-compose.yml

version: '3'
services:
  namenode:
    image: apache/hadoop:3
    hostname: namenode
    command: [ "hdfs", "namenode" ]
    ports:
      - 9870:9870
    env_file:
      - ./config2
    environment:
      ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
      HADOOP_HOME: /opt/hadoop
      PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
    volumes:
      - ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
    entrypoint: ["/hadoop-entrypoint.sh"]
  datanode:
    image: apache/hadoop:3
    command: [ "hdfs", "datanode" ]
    env_file:
      - ./config2
    environment:
      HADOOP_HOME: /opt/hadoop
      PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
    volumes:
      - ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
    entrypoint: ["/hadoop-entrypoint.sh"]
  resourcemanager:
    image: apache/hadoop:3
    hostname: resourcemanager
    command: [ "yarn", "resourcemanager" ]
    ports:
      - 8088:8088
    env_file:
      - ./config2
    environment:
      HADOOP_HOME: /opt/hadoop
      PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
    volumes:
      - ./test.sh:/opt/test.sh
      - ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
    entrypoint: ["/hadoop-entrypoint.sh"]
  nodemanager:
    image: apache/hadoop:3
    command: [ "yarn", "nodemanager" ]
    env_file:
      - ./config2
    environment:
      HADOOP_HOME: /opt/hadoop
      PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
    volumes:
      - ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
    entrypoint: ["/hadoop-entrypoint.sh"]
    spark-master:
      container_name: spark-master
      hostname: spark-master
      build:
        context: .
        dockerfile: Dockerfile.spark
      command: bin/spark-class org.apache.spark.deploy.master.Master
      volumes:
        - ./config:/opt/bitnami/spark/config
        - ./jobs:/opt/bitnami/spark/jobs
        - ./datasets:/opt/bitnami/spark/datasets
        - ./requirements.txt:/requirements.txt
      ports:
        - "9090:8080"
        - "7077:7077"
      networks:
        - code-with-yu

    spark-worker: &worker
      container_name: spark-worker
      hostname: spark-worker
      build:
        context: .
        dockerfile: Dockerfile.spark
      command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
      volumes:
        - ./config:/opt/bitnami/spark/config
        - ./jobs:/opt/bitnami/spark/jobs
        - ./datasets:/opt/bitnami/spark/datasets
        - ./requirements.txt:/requirements.txt
      depends_on:
        - spark-master
      environment:
        SPARK_MODE: worker
        SPARK_WORKER_CORES: 2
        SPARK_WORKER_MEMORY: 1g
        SPARK_MASTER_URL: spark://spark-master:7077
      networks:
        - code-with-yu

  
#  spark-worker-2:
  #    <<: *worker
  #
  #  spark-worker-3:
  #    <<: *worker
  #
  #  spark-worker-4:
  #    <<: *worker
  
networks:
    code-with-yu:

Thanks in advance for any help!

1 Comment

2024/07/23
16:38 UTC

ERROR in Hadoop (pls help)

When I entered hdfs namenode -format in command prompt it responded with Error: Could not find or load main class . What should I do

0 Comments

2024/07/11
09:30 UTC

I think we're doing cloud architecture management wrong and blueprints might help.

Hey Reddit, I'm Rohit, the co-founder and CTO of Facets.

Most of us know construction blueprints - the plans that coordinate various aspects of building construction. They are comprehensive guides, detailing every aspect of a building from electrical systems to plumbing. They ensure all teams work in harmony, preventing chaos like accidentally installing a sink in the bedroom.

Similar to that...

We regularly deal with a variety of services, components, and configurations spread across complex systems that need to work together.

And without a unified view, it is easy for things to get messy:

Configuration drift
Repetition of work
Difficulty onboarding new team members
The classic "it works on my machine" problem

A "cloud blueprint" could theoretically solve these issues. Here's what it might look like:

A live, constantly updated view of your entire architecture
Detailed mapping of all services, components, and their interdependencies
A single source of truth for both Dev and Ops teams
A tool for easily replicating environments or spinning up new ones

If we implement it right, this system could help declare your architecture once and then use that declaration to launch new environments on any cloud without repeating everything.

It becomes a single source of truth, ensuring consistency across different instances and providing a clear overview of the entire architecture.

Of course, implementing such a system would come with challenges. How do you handle rapid changes in cloud environments? What about differences between cloud providers? How do you balance detail with usability?

This thought led me and my co-founders to create Facets. We were facing the same challenges at our day jobs and it became frustrating enough for us to write a solution from scratch.

You can create a comprehensive cloud blueprint that automatically adapts to changes, works across different cloud providers, and strikes a balance between detail and usability.

This video explains the concept of blueprints better than I might have.

I'm curious to hear your thoughts. Do you see this being useful to your cloud infra management? Or have you created a different method for solving this problem at your org?

0 Comments

2024/06/28
11:58 UTC

A Novel Fault-Tolerant, Scalable, and Secure Distributed Database Architecture

In my PhD thesis, I have designed a novel distributed database architecture named "Parallel Committees."This architecture addresses some of the same challenges as NoSQL databases, particularly in terms of scalability and security, but it also aims to provide stronger consistency.

The thesis explores the limitations of classic consensus mechanisms such as Paxos, Raft, or PBFT, which, despite offering strong and strict consistency, suffer from low scalability due to their high time and message complexity. As a result, many systems adopt eventual consistency to achieve higher performance, though at the cost of strong consistency.
In contrast, the Parallel Committees architecture employs classic fault-tolerant consensus mechanisms to ensure strong consistency while achieving very high transactional throughput, even in large-scale networks. This architecture offers an alternative to the trade-offs typically seen in NoSQL databases.

Additionally, my dissertation includes comparisons between the Parallel Committees architecture and various distributed databases and data replication systems, including Apache Cassandra, Amazon DynamoDB, Google Bigtable, Google Spanner, and ScyllaDB.

I have prepared a video presentation outlining the proposed distributed database architecture, which you can access via the following YouTube link:

https://www.youtube.com/watch?v=EhBHfQILX1o

A narrated PowerPoint presentation is also available on ResearchGate at the following link:

https://www.researchgate.net/publication/381187113_Narrated_PowerPoint_presentation_of_the_PhD_thesis

My dissertation can be accessed on Researchgate via the following link: Ph.D. Dissertation

If needed, I can provide more detailed explanations of the problem and the proposed solution.

I would greatly appreciate feedback and comments on the distributed database architecture proposed in my PhD dissertation. Your insights and opinions are invaluable, so please feel free to share them without hesitation.

2 Comments

2024/06/08
15:04 UTC

Hadoop prequistes

Should I learn java and linux to start hadoop?

2 Comments

2024/05/13
13:21 UTC

AWS Snowmobile

With AWS Snowmobile being retired, what do people think are the best methods for uploading PB+ scale Hadoop datasets into the cloud?

0 Comments

2024/04/26
14:45 UTC

kerberos -I think- related error on datanodes while cluster is running

So I am playing around, trying to create a proper kerberized hadoop installation. I have a namenode, secondary node, and 3 data nodes, and I thought I had got it to work. It does kind of. I have kinit'ed all my keytabs, and the cluster starts up. I have compiled jscv that starts the datanodes as root, and the delivers it down to the hdfs account. I can see hadoop run on all 5 VMs, stuffs good, or so I thought.

Looking in the logs on a datanode I this error, while the cluster runs for like half an hour, untill I stop it:

2024-04-24 16:14:14,376 WARN org.apache.hadoop.ipc.Client: Couldn't setup connection for dn/doop3.myDomain.tld@HADOOP.KERB to nnode.myDomain.tld/192.168.0.160:8020 org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed

First off I get the same error on all 3 datanodes and I check that there is actual connection, with ncat on the datanode like this: 'nc nnode.myDomain.tld 8020' and I connect fine.

So obviously I worry that my Kerberos is not working. But the nodes will not start up, if the keytab file is not working. So in order to start the namenode, and the datanode, they do a kerberos login and works. And then stops working(?)

My keytabs looks like the Hadoop documentation: [https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS](https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS)

On my namenode (ok, I regret having the hdfs/-principal in there, but not referenced so w/e):

klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab

Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)

And here on my datanode:

klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab

Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp           Principal
---- ------------------- ------------------------------------------------------
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
   4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)

On the data node and name node, checking the principals with kinit -t as mention in this article [https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429](https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429) gives no error, and as I said, the node starts so the initial Kerberos checks is accepted.

Again reading the error, I can't understand what it *actually* tells me. The cluster seems to continue to stay running until I shut it down. I have had it running for like half an hour, before I stopped it.

I thought of perhaps adding the credentials from all 5 VMs into keytab and just kinit all of it on all of them, but it doesn't seem reasonable.

This error is mentioned many times in google searches but nothing I find matches my scenario or fixes my issue.

hdfs-site.xml and core-site.xml on the 2 nodes are shown here, instead of making the post even longer: [https://pastebin.com/QLT6GqVd](https://pastebin.com/QLT6GqVd)

Any clues on, what the error expects me to look into is much appreciated. I have tried following Hadoops kerberos documentation, and is the base of my setup, if that matters.

0 Comments

2024/04/24
22:57 UTC

Apache Ranger UserSync Configuration HELP!!

I am trying to configure Apache ranger usersync with unix ! and Iam stuck at this point !:

After i execute this : sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ ./setup.sh

Then this error pops up:

teka@t3:/usr/local/ranger-usersync$ sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ./setup.sh

[sudo] password for teka:

INFO: moving [/etc/ranger/usersync/conf/java_home.sh] to [/etc/ranger/usersync/conf/.java_home.sh.28032024144333] .......

Direct Key not found:SYNC_GROUP_USER_MAP_SYNC_ENABLED

Direct Key not found:hadoop_conf

Direct Key not found:ranger_base_dir

Direct Key not found:USERSYNC_PID_DIR_PATH

Direct Key not found:rangerUsersync_password

Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper

at org.apache.ranger.credentialapi.CredentialReader.getDecryptedString(CredentialReader.java:39)

at org.apache.ranger.credentialapi.buildks.createCredential(buildks.java:87)

at org.apache.ranger.credentialapi.buildks.main(buildks.java:41)

Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper

at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)

at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

... 3 more

ERROR: Unable update the JCKSFile(/etc/ranger/usersync/conf/rangerusersync.jceks) for aliasName (usersync.ssl.key.password)

Can any one help me with that ?

Tools Iam using:

Host Device: MacBook m1

Guest Device: Ubuntu 20.04 LTS

Apache Ranger: 2.4 (Build from source code)

0 Comments

2024/03/28
14:54 UTC

Need Guidance, 4th semester Data Science Student

Hey everyone,

I'm currently in my 4th semester of data science, and while I've covered a fair bit of ground in terms of programming languages like C++ and Python (with a focus on numpy, pandas, and basic machine learning), I'm finding myself hitting a roadblock when it comes to diving deeper into big data concepts.

In my current semester, I'm taking a course on the fundamentals of Big Data. Unfortunately, the faculty at my university isn't providing the level of instruction I need to fully grasp the concepts. We're tackling algorithms like LSH, PageRank, and delving into Hadoop (primarily mapreduce for now), but I'm struggling to translate this knowledge into practical coding skills. For instance, I'm having difficulty writing code for mappers and reducers in Hadoop, and I feel lost when it comes to utilizing clusters and master-slave nodes effectively.

To add to the challenge, we've been tasked with building a search engine using mapreduce in Hadoop, which requires understanding concepts like IDF, TF, and more – all of which we're expected to learn on our own within a tight deadline of 10 days.

I'm reaching out to seek guidance on how to navigate this situation. How can I set myself on a path to learn big data in a more effective manner, considering my time constraints? My goal is to be able to land an internship or entry-level position in the data science market within the next 6-12 months.

Additionally, any tips on approaching this specific assignment would be immensely helpful. How should I go about tackling the task of building a search engine within the given timeframe, given my current level of understanding and the resources available?

Any guidance, advice, or resources you can offer would be greatly appreciated. Thank you in advance for your help!

0 Comments

2024/03/21
20:18 UTC

Hive Shell Issues

0 Comments

2024/03/19
08:25 UTC

Hadoop Installation

0 Comments

2024/03/17
13:13 UTC

Help with setup in MAC

Hi guys, I have been trying to run Apache Hadoop (3.3.1) on my M1 Pro machine and I have been getting this error of " Cannot set priority of namenode process XXXXX ". I understand that MacOS is not allowing background process to be invoked. Is there any possible fix to this guys?

0 Comments

2024/03/16
16:45 UTC

Namenode Big Heap

Hi guys,

Long Story short, running a big hadoop cluster, lots of files.

Currently the namenod has 20GB of Heap almost full the whole time, some long Garbage cycles freeing up little to no memory.

Is there anybody who is running Namenodes with 24 or 32 GB of heap.

is there any particulare tuning needed ?

Regards

5 Comments

2024/03/14
14:55 UTC

[Hiring] Big Data Engineer with Spark (located in Poland)

Who are we looking for?
We are looking for a Big Data Engineer with Spark who will be working on an external project in the credit risk domain. You should have expertise in the following technologies:
- At least 4 years of experience with Scala and Spark
- Excellent understanding of Hadoop
- Jenkins, HQL (Hive Queries), Oozie, Shell scripting, GIT, Splunk

As a Big Data Engineer, you will:
- Work on an external project and develop an application that is based on the Hadoop platform.
- Work with an international team of specialists.
- Design and implement database systems.
- Implement business logic based on the established requirements.
- Ensure the high quality of the delivered software code.
- Independently make decisions, even in high-risk situations.

Apply here: https://scalac.io/careers/senior-bigdata-engineer/

2 Comments

2024/03/12
09:03 UTC

Is there a way to access hadoop via eclipse

As the title suggests, I am new to hadoop and my instructor gave me a task to access it via eclispe, it's something called accessing it via java api. I've searched so many videos but most of them are wordcount problems and aren't solving my problem. Any suggestions?

1 Comment

2024/03/12
00:46 UTC

Cirata for Hadoop Migration

My company is exploring Cirata using a 5pb data migration to Azure. The technology (centered on Paxos algo) seems very impressive for large, unstructured datasets but I'm not sure. Does anyone have any experience using them and any thoughts they would be willing to share?

Thanks in advance.

1 Comment

2024/02/23
14:00 UTC

Onprem HDFS alternatives for 10s of petabytes?

So I see lots of people dumping on Hadoop in general in this sub but I feel a lot of the criticism is really towards YARN. I am wondering if that is also true for HDFS. Are there any onprem storage alternatives that can scale to say 50PBs or more? Is there anything else that has equal or better performance and lower disk usage with equality or better resiliency especially factoring in HDFS erasure coding with roughly 1.5x size on disk? Just curious what others are doing for storing large amounts of semi structured data in 2024. Specifically I'm dealing with a wide variety of data ranging from a few kilobytes to gigabytes per record.

12 Comments

2024/01/27
05:18 UTC

HIVE HELP NEEDED !!!

Hi guys its my first time using hive and I just set it up using a udemy course guideline. I got this error that reads schema too failde due to hive exception.

Error: Syntax error: Encountered "statement_timeout" at line 1, column 5. (state=42X01,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***

Can someone help me with this. I followed these stackoverflow to trouble shoot links too and they did not work even with removing the meta store file and re-initialising the same.

Please help thankyou for your time and patience. Your friendly neighborhood big data noob!!!

5 Comments

2024/01/26
13:00 UTC

Big Companies: Java Hadoop or Hadoop streaming

Hello all, I was wondering from your experience in the industry do big companies (in terms of market leadership not only in size) is the Java approach of writing their MapReduce jobs more popular or Hadoop Streaming approach. It would be very interesting to know to be if I need to brush up my Java skills or can stick with python streaming approach in order to prompt myself as Hadoop MapReduce practitioner/capable.

9 Comments

2024/01/17
02:15 UTC

Hadoop for GIS?

I’m a construction surveyor for 17 years and know cad, desktop GIS, some programming , pointclouds and photogrammetry.

I wonder if db with Hadoop can be useful to learn.

1 Comment

2024/01/05
12:41 UTC

how to use this program?

so, my teacher gave to us an activity to use hadoop, but he never really taught us how to use it, and i cant find any tutorial of how do it, can someone here help me to do it? i don't even know how to start the program, the activity is the following: As you noted, this unit does not have self-correction activities. A more practical activity is proposed, considering that you already have the Hadoop platform installed, as well as mahout, therefore, you will be able to carry out the experiments proposed here, where a Reuters text base is available.

The idea of the activity is for you to run the kmeans algorithm using one of the folders with the texts, and analyze the result of the algorithm. Observe the clusters generated, and whether the subjects are in fact related to each other. If you want to use other text bases, the sequence of commands should work.

Below is the example and sequence of commands used: Base Reuters C50train

hadoop fs -copyFromLocal C50/ /

./mahout seqdirectory -i /C50/C50train -o /seqreuters -xm sequential

./mahout seq2sparse -i /seqreuters -o /train-sparse

./mahout kmeans -i /train-sparse/tfidf-vectors/ -c /kmeans-train-clusters -o /train-clusters-final -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 10 -ow

./mahout clusterdump -d /train-sparse/dictionary.file-0 -dt sequencefile -i /train-clusters-final/clusters-10-final -n 10 -b 100 -o ~/saida_clusters.txt -p /train-clusters-final/clustered-points

5 Comments

2023/12/08
14:31 UTC