/r/hadoop
Topics about Apache's Hadoop cloud platform and its related ecosystem - Pig, Hive, Cassandra, ZooKeeper
Related subreddits:
/r/hadoop
Hi all, I am using hive metastore inside an eks pod. The pod is using eks pod identity for getting access to s3. I checked that the container has access to s3 by doing aws s3 ls. But my hive metastore fails when trying to access the s3. It lists all the credentials provider in the error but looks like eks pod identity is not supported. Has anyone faced this issue before? Thanks!
i m trying to import a mysql db to hive database with sqoop as an avrodatafile but i m getting error that hive import is not compatible with import Avro format
this is my command
sqoop import --connect jdbc:mysql://localhost:3306/election_tunisie_2024 --connection-manager org.apache.sqoop.manager.MySQLManager --username root --password cloudera --table candidats --hive-database 4ds6 --hive-import --hive-table candidats --hive-drop-import-delims --m 1 --as-avrodatafile
output
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
24/10/14 16:18:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.12.0
24/10/14 16:18:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
24/10/14 16:18:15 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
24/10/14 16:18:15 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
Hive import is not compatible with importing into AVRO format.
thanks
Hello, We want to set up and deploy a Hadoop ecosystem for a large retail company. However, we are not sure which technologies to use. Should we choose Cassandra, Hive, or Spark as the database?
Our requirements are as follows: It needs to be fast, real-time, and high-performance. We currently have 20 TB of data. I am open to suggestions.
How to use Hadoop???
Honestly this is a stupid question but I can't find any help on YouTube and blogs.
I installed Hadoop set up the environment in windows 11 along with jdk. But what now? I don't understand how to work with it or how to install the virtual machine; and can't really find any good resource even tried Coursera udemy to see if they have something. Can someone please help me with it???
Hi r/hadoop,
I'm learning Big Data and related software, following this tutorial: Realtime Socket Streaming with Apache Spark | End to End Data Engineering Project. I'm trying to set up Hadoop using Docker, but I'm encountering an error during installation:
Error: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
Here's my setup:
I'm using a Docker-compose.yml file to set up multiple services including namenode, datanode, resourcemanager, nodemanager, and Spark master/worker.
In my Docker-compose.yml, I've set the HADOOP_HOME environment variable for each Hadoop service:
environment:
HADOOP_HOME: /opt/hadoop
PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
I'm using the apache/hadoop:3 image for Hadoop services and bitnami/spark:latest for Spark services.
I've created a custom Dockerfile.spark that extends from apache/hadoop:latest and bitnami/spark:latest, and installs Python requirements.
Despite setting HADOOP_HOME in the Docker-compose.yml, I'm still getting the error about HADOOP_HOME being unset.
Has anyone encountered this issue before? Any suggestions on how to properly set HADOOP_HOME in a Docker environment or what might be causing this error?
docker-compose.yml
version: '3'
services:
namenode:
image: apache/hadoop:3
hostname: namenode
command: [ "hdfs", "namenode" ]
ports:
- 9870:9870
env_file:
- ./config2
environment:
ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
HADOOP_HOME: /opt/hadoop
PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
volumes:
- ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
entrypoint: ["/hadoop-entrypoint.sh"]
datanode:
image: apache/hadoop:3
command: [ "hdfs", "datanode" ]
env_file:
- ./config2
environment:
HADOOP_HOME: /opt/hadoop
PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
volumes:
- ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
entrypoint: ["/hadoop-entrypoint.sh"]
resourcemanager:
image: apache/hadoop:3
hostname: resourcemanager
command: [ "yarn", "resourcemanager" ]
ports:
- 8088:8088
env_file:
- ./config2
environment:
HADOOP_HOME: /opt/hadoop
PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
volumes:
- ./test.sh:/opt/test.sh
- ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
entrypoint: ["/hadoop-entrypoint.sh"]
nodemanager:
image: apache/hadoop:3
command: [ "yarn", "nodemanager" ]
env_file:
- ./config2
environment:
HADOOP_HOME: /opt/hadoop
PATH: /opt/hadoop/bin:/opt/hadoop/sbin:$PATH
volumes:
- ./hadoop-entrypoint.sh:/hadoop-entrypoint.sh
entrypoint: ["/hadoop-entrypoint.sh"]
spark-master:
container_name: spark-master
hostname: spark-master
build:
context: .
dockerfile: Dockerfile.spark
command: bin/spark-class org.apache.spark.deploy.master.Master
volumes:
- ./config:/opt/bitnami/spark/config
- ./jobs:/opt/bitnami/spark/jobs
- ./datasets:/opt/bitnami/spark/datasets
- ./requirements.txt:/requirements.txt
ports:
- "9090:8080"
- "7077:7077"
networks:
- code-with-yu
spark-worker: &worker
container_name: spark-worker
hostname: spark-worker
build:
context: .
dockerfile: Dockerfile.spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
volumes:
- ./config:/opt/bitnami/spark/config
- ./jobs:/opt/bitnami/spark/jobs
- ./datasets:/opt/bitnami/spark/datasets
- ./requirements.txt:/requirements.txt
depends_on:
- spark-master
environment:
SPARK_MODE: worker
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_MASTER_URL: spark://spark-master:7077
networks:
- code-with-yu
# spark-worker-2:
# <<: *worker
#
# spark-worker-3:
# <<: *worker
#
# spark-worker-4:
# <<: *worker
networks:
code-with-yu:
Thanks in advance for any help!
When I entered hdfs namenode -format in command prompt it responded with Error: Could not find or load main class . What should I do
Hey Reddit, I'm Rohit, the co-founder and CTO of Facets.
Most of us know construction blueprints - the plans that coordinate various aspects of building construction. They are comprehensive guides, detailing every aspect of a building from electrical systems to plumbing. They ensure all teams work in harmony, preventing chaos like accidentally installing a sink in the bedroom.
Similar to that...
We regularly deal with a variety of services, components, and configurations spread across complex systems that need to work together.
And without a unified view, it is easy for things to get messy:
A "cloud blueprint" could theoretically solve these issues. Here's what it might look like:
If we implement it right, this system could help declare your architecture once and then use that declaration to launch new environments on any cloud without repeating everything.
It becomes a single source of truth, ensuring consistency across different instances and providing a clear overview of the entire architecture.
Of course, implementing such a system would come with challenges. How do you handle rapid changes in cloud environments? What about differences between cloud providers? How do you balance detail with usability?
This thought led me and my co-founders to create Facets. We were facing the same challenges at our day jobs and it became frustrating enough for us to write a solution from scratch.
You can create a comprehensive cloud blueprint that automatically adapts to changes, works across different cloud providers, and strikes a balance between detail and usability.
This video explains the concept of blueprints better than I might have.
I'm curious to hear your thoughts. Do you see this being useful to your cloud infra management? Or have you created a different method for solving this problem at your org?
In my PhD thesis, I have designed a novel distributed database architecture named "Parallel Committees."This architecture addresses some of the same challenges as NoSQL databases, particularly in terms of scalability and security, but it also aims to provide stronger consistency.
The thesis explores the limitations of classic consensus mechanisms such as Paxos, Raft, or PBFT, which, despite offering strong and strict consistency, suffer from low scalability due to their high time and message complexity. As a result, many systems adopt eventual consistency to achieve higher performance, though at the cost of strong consistency.
In contrast, the Parallel Committees architecture employs classic fault-tolerant consensus mechanisms to ensure strong consistency while achieving very high transactional throughput, even in large-scale networks. This architecture offers an alternative to the trade-offs typically seen in NoSQL databases.
Additionally, my dissertation includes comparisons between the Parallel Committees architecture and various distributed databases and data replication systems, including Apache Cassandra, Amazon DynamoDB, Google Bigtable, Google Spanner, and ScyllaDB.
I have prepared a video presentation outlining the proposed distributed database architecture, which you can access via the following YouTube link:
https://www.youtube.com/watch?v=EhBHfQILX1o
A narrated PowerPoint presentation is also available on ResearchGate at the following link:
My dissertation can be accessed on Researchgate via the following link: Ph.D. Dissertation
If needed, I can provide more detailed explanations of the problem and the proposed solution.
I would greatly appreciate feedback and comments on the distributed database architecture proposed in my PhD dissertation. Your insights and opinions are invaluable, so please feel free to share them without hesitation.
Should I learn java and linux to start hadoop?
With AWS Snowmobile being retired, what do people think are the best methods for uploading PB+ scale Hadoop datasets into the cloud?
So I am playing around, trying to create a proper kerberized hadoop installation. I have a namenode, secondary node, and 3 data nodes, and I thought I had got it to work. It does kind of. I have kinit'ed all my keytabs, and the cluster starts up. I have compiled jscv that starts the datanodes as root, and the delivers it down to the hdfs account. I can see hadoop run on all 5 VMs, stuffs good, or so I thought.
Looking in the logs on a datanode I this error, while the cluster runs for like half an hour, untill I stop it:
2024-04-24 16:14:14,376 WARN org.apache.hadoop.ipc.Client: Couldn't setup connection for dn/doop3.myDomain.tld@HADOOP.KERB to nnode.myDomain.tld/192.168.0.160:8020 org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed
First off I get the same error on all 3 datanodes and I check that there is actual connection, with ncat on the datanode like this: 'nc nnode.myDomain.tld 8020' and I connect fine.
So obviously I worry that my Kerberos is not working. But the nodes will not start up, if the keytab file is not working. So in order to start the namenode, and the datanode, they do a kerberos login and works. And then stops working(?)
My keytabs looks like the Hadoop documentation: [https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS](https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-common/SecureMode.html#HDFS)
On my namenode (ok, I regret having the hdfs/-principal in there, but not referenced so w/e):
klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab
Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
3 04/22/2024 15:29:09 host/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
3 04/22/2024 15:29:09 nn/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
2 04/22/2024 15:29:09 hdfs/nnode.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
And here on my datanode:
klist -etk /opt/hadoop/etc/hadoop/hdfs.keytab
Keytab name: FILE:/opt/hadoop/etc/hadoop/hdfs.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
3 04/22/2024 14:06:03 dn/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha384-192)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha256-128)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes256-cts-hmac-sha1-96)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (aes128-cts-hmac-sha1-96)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia256-cts-cmac)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (camellia128-cts-cmac)
4 04/22/2024 14:06:03 host/doop3.myDomain.tld@HADOOP.KERB (DEPRECATED:arcfour-hmac)
On the data node and name node, checking the principals with kinit -t as mention in this article [https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429](https://community.cloudera.com/t5/Support-Questions/Cloudera-Kerberos-GSS-initiate-failed/m-p/65429) gives no error, and as I said, the node starts so the initial Kerberos checks is accepted.
Again reading the error, I can't understand what it *actually* tells me. The cluster seems to continue to stay running until I shut it down. I have had it running for like half an hour, before I stopped it.
I thought of perhaps adding the credentials from all 5 VMs into keytab and just kinit all of it on all of them, but it doesn't seem reasonable.
This error is mentioned many times in google searches but nothing I find matches my scenario or fixes my issue.
hdfs-site.xml and core-site.xml on the 2 nodes are shown here, instead of making the post even longer: [https://pastebin.com/QLT6GqVd](https://pastebin.com/QLT6GqVd)
Any clues on, what the error expects me to look into is much appreciated. I have tried following Hadoops kerberos documentation, and is the base of my setup, if that matters.
I am trying to configure Apache ranger usersync with unix ! and Iam stuck at this point !:
After i execute this : sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ ./setup.sh
Then this error pops up:
teka@t3:/usr/local/ranger-usersync$ sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ./setup.sh
[sudo] password for teka:
INFO: moving [/etc/ranger/usersync/conf/java_home.sh] to [/etc/ranger/usersync/conf/.java_home.sh.28032024144333] .......
Direct Key not found:SYNC_GROUP_USER_MAP_SYNC_ENABLED
Direct Key not found:hadoop_conf
Direct Key not found:ranger_base_dir
Direct Key not found:USERSYNC_PID_DIR_PATH
Direct Key not found:rangerUsersync_password
Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper
at org.apache.ranger.credentialapi.CredentialReader.getDecryptedString(CredentialReader.java:39)
at org.apache.ranger.credentialapi.buildks.createCredential(buildks.java:87)
at org.apache.ranger.credentialapi.buildks.main(buildks.java:41)
Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 3 more
ERROR: Unable update the JCKSFile(/etc/ranger/usersync/conf/rangerusersync.jceks) for aliasName (usersync.ssl.key.password)
Can any one help me with that ?
Tools Iam using:
Host Device: MacBook m1
Guest Device: Ubuntu 20.04 LTS
Apache Ranger: 2.4 (Build from source code)
Hey everyone,
I'm currently in my 4th semester of data science, and while I've covered a fair bit of ground in terms of programming languages like C++ and Python (with a focus on numpy, pandas, and basic machine learning), I'm finding myself hitting a roadblock when it comes to diving deeper into big data concepts.
In my current semester, I'm taking a course on the fundamentals of Big Data. Unfortunately, the faculty at my university isn't providing the level of instruction I need to fully grasp the concepts. We're tackling algorithms like LSH, PageRank, and delving into Hadoop (primarily mapreduce for now), but I'm struggling to translate this knowledge into practical coding skills. For instance, I'm having difficulty writing code for mappers and reducers in Hadoop, and I feel lost when it comes to utilizing clusters and master-slave nodes effectively.
To add to the challenge, we've been tasked with building a search engine using mapreduce in Hadoop, which requires understanding concepts like IDF, TF, and more – all of which we're expected to learn on our own within a tight deadline of 10 days.
I'm reaching out to seek guidance on how to navigate this situation. How can I set myself on a path to learn big data in a more effective manner, considering my time constraints? My goal is to be able to land an internship or entry-level position in the data science market within the next 6-12 months.
Additionally, any tips on approaching this specific assignment would be immensely helpful. How should I go about tackling the task of building a search engine within the given timeframe, given my current level of understanding and the resources available?
Any guidance, advice, or resources you can offer would be greatly appreciated. Thank you in advance for your help!
Hi guys, I have been trying to run Apache Hadoop (3.3.1) on my M1 Pro machine and I have been getting this error of " Cannot set priority of namenode process XXXXX ". I understand that MacOS is not allowing background process to be invoked. Is there any possible fix to this guys?
Hi guys,
Long Story short, running a big hadoop cluster, lots of files.
Currently the namenod has 20GB of Heap almost full the whole time, some long Garbage cycles freeing up little to no memory.
Is there anybody who is running Namenodes with 24 or 32 GB of heap.
is there any particulare tuning needed ?
Regards
Scalac | Big Data Engineer (with Spark) | Poland | Gdańsk or remote | Full time | 20 000 to 24 000 PLN net/month on B2B (or equivalent in USD/EUR)
Who are we looking for?
We are looking for a Big Data Engineer with Spark who will be working on an external project in the credit risk domain. You should have expertise in the following technologies:
- At least 4 years of experience with Scala and Spark
- Excellent understanding of Hadoop
- Jenkins, HQL (Hive Queries), Oozie, Shell scripting, GIT, Splunk
As a Big Data Engineer, you will:
- Work on an external project and develop an application that is based on the Hadoop platform.
- Work with an international team of specialists.
- Design and implement database systems.
- Implement business logic based on the established requirements.
- Ensure the high quality of the delivered software code.
- Independently make decisions, even in high-risk situations.
Apply here: https://scalac.io/careers/senior-bigdata-engineer/
As the title suggests, I am new to hadoop and my instructor gave me a task to access it via eclispe, it's something called accessing it via java api. I've searched so many videos but most of them are wordcount problems and aren't solving my problem. Any suggestions?
My company is exploring Cirata using a 5pb data migration to Azure. The technology (centered on Paxos algo) seems very impressive for large, unstructured datasets but I'm not sure. Does anyone have any experience using them and any thoughts they would be willing to share?
Thanks in advance.
So I see lots of people dumping on Hadoop in general in this sub but I feel a lot of the criticism is really towards YARN. I am wondering if that is also true for HDFS. Are there any onprem storage alternatives that can scale to say 50PBs or more? Is there anything else that has equal or better performance and lower disk usage with equality or better resiliency especially factoring in HDFS erasure coding with roughly 1.5x size on disk? Just curious what others are doing for storing large amounts of semi structured data in 2024. Specifically I'm dealing with a wide variety of data ranging from a few kilobytes to gigabytes per record.
Hi guys its my first time using hive and I just set it up using a udemy course guideline. I got this error that reads schema too failde due to hive exception.
Error: Syntax error: Encountered "statement_timeout" at line 1, column 5. (state=42X01,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2
Use --verbose for detailed stacktrace.
*** schemaTool failed ***
Can someone help me with this. I followed these stackoverflow to trouble shoot links too and they did not work even with removing the meta store file and re-initialising the same.
Please help thankyou for your time and patience. Your friendly neighborhood big data noob!!!
Hello all, I was wondering from your experience in the industry do big companies (in terms of market leadership not only in size) is the Java approach of writing their MapReduce jobs more popular or Hadoop Streaming approach. It would be very interesting to know to be if I need to brush up my Java skills or can stick with python streaming approach in order to prompt myself as Hadoop MapReduce practitioner/capable.
I’m a construction surveyor for 17 years and know cad, desktop GIS, some programming , pointclouds and photogrammetry.
I wonder if db with Hadoop can be useful to learn.
so, my teacher gave to us an activity to use hadoop, but he never really taught us how to use it, and i cant find any tutorial of how do it, can someone here help me to do it? i don't even know how to start the program, the activity is the following: As you noted, this unit does not have self-correction activities. A more practical activity is proposed, considering that you already have the Hadoop platform installed, as well as mahout, therefore, you will be able to carry out the experiments proposed here, where a Reuters text base is available.
The idea of the activity is for you to run the kmeans algorithm using one of the folders with the texts, and analyze the result of the algorithm. Observe the clusters generated, and whether the subjects are in fact related to each other. If you want to use other text bases, the sequence of commands should work.
Below is the example and sequence of commands used: Base Reuters C50train
hadoop fs -copyFromLocal C50/ /
./mahout seqdirectory -i /C50/C50train -o /seqreuters -xm sequential
./mahout seq2sparse -i /seqreuters -o /train-sparse
./mahout kmeans -i /train-sparse/tfidf-vectors/ -c /kmeans-train-clusters -o /train-clusters-final -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 10 -ow
./mahout clusterdump -d /train-sparse/dictionary.file-0 -dt sequencefile -i /train-clusters-final/clusters-10-final -n 10 -b 100 -o ~/saida_clusters.txt -p /train-clusters-final/clustered-points