2,802 Subscribers

What to choose: Cassandra especially JDK21 or scylladb with golang

I want to build a massive real estate listing portal. I'm considering the database to use – Cassandra or ScyllaDB with golang for back end. I need high availability, and low-latency, high performance option for datatbase.

Has anyone tested these or has reliable data regarding access times, the amount of concurrent workloads these databases can handle in their latest versions? I'm specifically thinking about Cassandra running on JDK21.

What I like about Cassandra:

New or planned features
Open source

What I don't like about Cassandra:

Garbage collection and the issues it causes
Not fully utilizing the power of the latest servers, unlike ScyllaDB

What I like about ScyllaDB:

Optimal hardware utilization – for example, a 3-node cluster can already be an extremely powerful database.
Impressive access times and the ability to handle large concurrent workloads
Lower monitoring/maintenance demands (more automation)
The charybdis package provides helpers for low-code integration with ScyllaDB (GOLANG)

What I don't like about ScyllaDB:

Change in strategy, licensing, and the end of the open-source version
Lack of certain features available in Cassandra

Is there any charybdis package (ScyllaDB-golang helper) alternative in cassandra?

Anyone has reliable info, tests how these 2 performs? There is so small amount of informations or not so very reliable (based on older versions etc to prove that something is better :)

11 Comments

2025/01/18
13:05 UTC

Need guest access Invite ASF Slack workspace

Hi All,

I have started looking for apache Cassandra open source contribution and to get started I need access to Slack channel and Jira dashboard.

I don't have apache.org email address so the only other way to get access to Slack Channels is via Single-Channel Guest, and for that an existing user needs to send invite. Can some please send a ASF slack workspace invite so I can get started. My Email Address: pawanshaiitd@gmail.com once done I will update here.

Thanks

2 Comments

2025/01/13
09:31 UTC

Understanding Cassandra codebase & architecture

I am a java developer with most of my experience in framework based applications. I wanted to dip my toes in open source and want to understand the architecture and codebase of cassandra. But when I start it seems like a huge task and so much of the code I dont seem to understand (could be because of no expose to low level programming). How would some vetran cassandra contributors and developers suggest a path that I should take ?

4 Comments

2024/12/20
14:14 UTC

Select by objectId and delete by age

Getting frustrated! I want a Cassandra table keyed by objectId, but we also want to delete the old entries. So theres a day number (imjd) as well. How can I make a table which will allow both of these:

`SELECT * FROM table WHERE objectId=1234567 and

DELETE from table WHERE imjd < 60000

I have tried many different variations but no success.

7 Comments

2024/12/09
11:59 UTC

I just upgraded my Datastax DSE/Cassandra single node to a cluster, here's how

Hey folks! Following up from my single cassandra/Datastax DSE node setup, here's how I created a two-node cluster.

What I'm Working With:

Two Datastax DSE (Cassandra) nodes running on Ubuntu 24.10 VMs
DSE installed under 'home/user/node1 folder' and 'home/user/node2' for two nodes

Here's the step-by-step:

1. First, Stop Everything

Stop Cassandra on both nodes:

$ node1/bin/nodetool stopdaemon

2. Clean Slate

Remove old data from both nodes:

sudo rm -rf /var/lib/cassandra/*

3. The Important Part - cassandra.yaml Config 🔑

Find your cassandra.yaml file (mine was at 'node1/resources/cassandra/conf/cassandra.yaml')
Here's what you need to change:

A. Set the same cluster name on both nodes

yamlCopy
cluster_name: 'YourClusterName'

B. Seed Provider Setup (this is crucial!)

yamlCopy- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
      - seeds: "192.168.47.128"    # Use Node 1's IP here

!Pro tip: Make sure Node 2 also points to Node 1's IP in its seeds config!

C. Network Settings

For Node 1:yamlCopy

listen_address: 192.168.47.128
rpc_address: 192.168.47.128

For Node 2:

listen_address: 192.168.47.129 
rpc_address: 192.168.47.129

4. Open Firewall Ports

bashCopy$ sudo iptables -A INPUT -p tcp --dport 7000 -j ACCEPT
$ sudo iptables -A INPUT -p tcp --dport 9042 -j ACCEPT

5. Fire It Up!

6. Check If It Worked

$ bin/nodetool status

You should see something like this:

Datacenter: Cassandra ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving/Stopped --  Address         Load       Tokens  Owns    Host ID                               Rack UN  192.168.47.128  123.59 KiB  1      100.0%  2f3f874e-74d1-435d-b124-61f249f54d97  rack1 UN  192.168.47.129  204.15 KiB  1      100.0%  76b05689-845c-43e5-9606-50a06d68df14  rack1

Bonus: Checking Data Distribution

Want to see how your data is spread across nodes? Try this in CQL shell:

sqlCopy
cqlsh:killervideo> SELECT token(tag), tag FROM videos_by_tag;

You can also check which node owns what data:

$ node/bin/nodetool getendpoints keyspacename tablename 'partition_key_value'

# Example:
$ node/bin/nodetool getendpoints killrvideo videos_by_tag 'cassandra'

That's it! Let me know if you run into any issues or have questions! 🚀

5 Comments

2024/11/15
04:18 UTC

I just created a Datastax DSE/Cassandra test node on VM, here's how

I'm using a Mac M2 Pro, so the basic setup is VMware Fusio Pro + Ubuntu Server 24.10 for ARM + Datastax DSE(Cassandra 4)

Part 1 – PREPERATION (not mentioned in official doc, but essential to unexperienced users, LIKE ME)

Download VMware Fusion Pro – Now it’s free for personal use!

https://blogs.vmware.com/teamfusion/2024/05/fusion-pro-now-available-free-for-personal-use.html

Download DSE6.9 from Datastax website, it is a bin.tar.gz file

https://www.datastax.com/products/datastax-enterprise

Download a Linux ISO for VM setup, be aware for Mac non x86 chip, you have to download arm architecture ISO. For my test, I have downloaded Ubuntu Server 24.10 image from

https://ubuntu.com/download/server/arm

Create Ubuntu VM from ISO image, recommended configuration for single node DSE is 2-core, 8G RAM, 20G Drive, DSE installation file itself is around 2G

SCP the local downloaded DSE installation file to VM, e.g.

user@MacBook-user% scp dse-6.9.3-bin.tar.gz user@IP:/home/username

Part 2 - INSTALLATION

Once the file is transmitted, we can install test DSE following the official Doc steps.

https://docs.datastax.com/en/dse/6.9/installing/basic-install.html

After the installation, by entering the “dse-version/bin” directory, you can check the DSE node running status by “./nodetool status” or “./dsetool status” command.

Before running “./cqlsh” to start the querying fun, take note that DSE6.9.3 version right now only support Python3.8 to 3.11, the default Python packaged with Ubuntu 24.10 is Python3.12, you have to install previous version python, and update the cql python interpreter environment variable to older version. In my case, the command line is:

export CQLSH_PYTHON=python3.11

Start “./cqlsh” from the installation directory, if you can see "cqlsh>" prompt, that means you're all set!

0 Comments

2024/11/12
02:31 UTC

Why does my read operation go to SSTable when updated data is in Memtable?

I have data in the format of (id, data), such as (1, "someDataS").

Initially, when I insert data, it is stored in the Memtable, and reads pull directly from the Memtable.

After more data is inserted, it flushes to the SSTable. At this point, reads start retrieving the data from the SSTable, which makes sense.

However, I’m confused about what happens after updating older data that is already in the SSTable.

For example, if I update a data item that is currently in the SSTable, I expect the Memtable to hold the new version, while the older version remains in the SSTable. But when I perform a read after this update, it still checks the SSTable, even though a newer version should be in the Memtable.

Question: Why doesn’t the read operation return the updated data directly from the Memtable, where the latest version is stored? Is there a reason it still checks the SSTable?

I used query tracing feature to debug it, It led me to believe the relevant code is in following file https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java

more specific "queryMemtableAndSSTablesInTimestampOrder" method. To me it looks like, it always checks sstable.

7 Comments

2024/10/30
18:00 UTC

Concerned - Ideal Data size ratio to expanding nodes?

I currently have two Apache Cassandra nodes running on EC2, each with 300 GB of RAM and 120 TB of storage, with about 40 TB of free space left on each. My admin team hasn't raised any concerns about maintaining the current node sizes or expanding to improve performance, but I'm wondering if there's a general guideline or recommendation for how many nodes a Cassandra cluster should have and what the ideal node size might be for my setup? NOTE: the data is read and populated by Geomesa and is using geospatial queries. Should I be looking into adding more nodes or adjusting the current configuration? Any advice or best practices would be appreciated!

3 Comments

2024/10/29
15:28 UTC

Preparing for the DataStax Architect Certification Exam in Apache Cassandra

I'm gearing up to take the DataStax Architect Certification for Apache Cassandra exam this Friday. I've already completed and passed the Developer and Admin exams with a comfortable margin (Reference by certfun.com). I found those exams to be quite straightforward and passed them without much trouble.

Has anyone here taken the Architect exam? How would you rate its difficulty compared to the other exams? Also, does anyone know the passing criteria? Is there a specific reason why the exam is allocated 90 minutes?

EDIT: I passed! The Architect exam was definitely more challenging than the Admin or Developer exams. While the passing criteria isn't officially published, I suspect it’s higher than 70%—I’m thinking around 75%.

2 Comments

2024/10/26
09:13 UTC

Is Apache Cassandra and Datastax cassanra's SAI implementation same?

I am currently benchmarking storage attached index released in Apache Cassandra version 5.. it doesn't not compare anywhere near Datastax Cassandra's SAI.

Can someone please confirm if both implementations are the same??

TIA!

2 Comments

2024/10/16
15:45 UTC

Need help for a tutorial, pleaseee

I am a Computer Science Student and I had to choose for my license between MongoDB and Apache Cassandra and you already know what I have chosen. I have managed to set up a local Cassandra node using the prequisites from the documentation, but I can't get the PHP driver to work.

What I am looking for: a tutorial on Udemy(or any other platform) that covers Cassandra+connecting through to a backend using PHP+some front-end(optional) as I already know HTML+CSS+JS.

Thank you very much guys! 🖤

2 Comments

2024/10/14
09:53 UTC

Cassandra or Scylladb

We have a use case requiring a wide-column database with multi-datacenter support, high availability, and low-latency performance. I’m trying to determine whether Apache Cassandra or ScyllaDB is a better fit. While I’m aware that Apache Cassandra has a more extensive user base with proven stability, ScyllaDB promises lower latency and potentially reduced costs.

Given that both databases support our architecture needs, I would like to know if you’ve had experience with both and, based on that, which one you would recommend.

36 Comments

2024/10/10
15:41 UTC

DSE DBA certification exam

Does anyone has experience with the DataStax Enterprise (DSE) Administration Certification exam? If so, how was your experience, and how hard was the exam? I’m also curious about the exam format—how is it taken, and what types of questions are asked? Any details on the difficulty level and preparation tips would be really helpful. Thanks!

4 Comments

2024/10/03
08:17 UTC

Survey on data formats [responses welcome]

I'm currently conducting a survey to collect insights into user expectations regarding comparing various data formats. Your expertise in the field would be incredibly valuable to this research.

The survey should take no more than 10 minutes to complete. You can access it here: https://forms.gle/K9AR6gbyjCNCk4FL6

I would greatly appreciate your response!

0 Comments

2024/09/18
19:03 UTC

RPM Packages for Casssandra

Hello Everyone,

I am trying to install cassandra on RHEL 8 using RPM packages. But I couldn't find pakages anywhere.

If possible, please share links to download RPM packages of cassandra.

1 Comment

2024/09/16
06:33 UTC

Cassandra configurations for read heavy workload

I have a Cassandra cluster with 3 nodes with replica factor of 3. I have a use case of read heavy and comparatively less write workload. Can I tune my write consistency to all and read consistency of one to achieve nominal consistency and availability. So in my understanding read can have last version data with less latency. If I'm wrong somewhere how can I configure the Cluster(even addition of nodes) to have high throughput with less latency?

5 Comments

2024/08/29
12:21 UTC

Cassandra latency configuration.

I have a Cassandra Cluster with 3 nodes with 3 replica factor. I have a scenario where the 3 parallel update request with same timestamps for a row comes to the cluster's coordinator node, and each of which could cause a conflict when I read it after updating, how can I handle this situation. Suggest a configurations that can be tuned for this purpose.

11 Comments

2024/08/29
06:18 UTC

How Cassandra outperforms MySQL

I have a Cassandra cluster with single DC, 3 node, in contrast 1 master and 2 follower MySql architecture. I would have like, 10M reads in 3hrs and 3M write and updates in 3hrs, with replica factor. I have no complex queries and single primary key. What configuration can I have in my cluster to improve performance and latency issues.

10 Comments

2024/08/29
04:09 UTC

5.0 Webinar

Hey folks. I'm part of the C* project, and I'm hosting a 1 hour webinar + 30 min of Q&A on Thursday morning, 9am PDT, to show off new features coming in 5.0. I'll be covering:

New storage engine improvements: SAI, Trie Memtables, new BTI format w/ Trie indexes, vector search, new Unified Compaction Strategy
Security improvements: Dynamic Data Masking, CIDR authorizer
Improved operator control over what users can do with guardrails

I hope to see you there! Link to sign up is here: https://streamyard.com/watch/i8hUyrMzKEQ9

0 Comments

2024/08/20
14:19 UTC

Row level isolation guarantees

I know that Cassandra guarantees a row level isolation in a single replica, which means that other requests see either the entire update to that row applied or none. But does this guarantee that there are no dirty writes and dirty reads (in the scope of that row in that replica)?

6 Comments

2024/08/14
14:03 UTC

Question regarding first time Cassandra deploymnet

Hi All,

Want to learn Cassandra a bit by implementing my own deployment on my home server. I've currently got an HP MiniDesk G3 with 32GB ram, 2TB SSD storage, 12TB HDD (6x 2TB WDGreen) storage running Proxmox. My plan was to use this as my "database" for the other components in the server. (Few more HP Minis running a few services - nothing crazy)

Now, the ultimate goal of this is to learn how to deploy Cassandra at scale - given... that is kind of what it does. I'm less concerned with actual HA, than I am simulated HA given my hardware constraints. Let me know if the below sounds crazy.

Was thinking of spinning up 3x LXC Cassandra nodes on the one machine, and provisioning each one of them a 2TB HDD. (Potentially splitting up partitions of the 2TB SSD for the write log... but, need to get through the basics here) That would allow me to not have to RAID10 across the rest for replication, and then can offload snapshots to Azure or something to make sure whatever data I generate I don't lose.

I do have 3 other HP Minis (8GB Ram, 500GB NVMe) but - believe the overhead of running Ceph to get the HDD storage to the other nodes would be too much for the small cluster + Cassandra on three separate pieces of hardware.

Was thinking if I tune the heap size and let them fight over cores I'd be ok? (4x cores per i5-6500 in each machine)

Am I nuts? Anything you'd do differently? Thanks in advance!

-Mousse

6 Comments

2024/08/13
20:12 UTC

Read repairs and read consistency levels

We can read the following note in the documentation:

In read repair, Cassandra sends a digest request to each replica not directly involved in the read. Cassandra compares all replicas and writes the most recent version to any replica node that does not have it. If the query's consistency level is above ONE, Cassandra performs this process on all replica nodes in the foreground before the data is returned to the client. Read repair repairs any node queried by the read. This means that for a consistency level of ONE, no data is repaired because no comparison takes place. For QUORUM, only the nodes that the query touches are repaired, not all nodes.

If I understand it right, there're three cases of how a read repair can be carried out:

ONE/LOCAL_ONE - no read repairs at all
QUORUM/LOCAL_QUORUM - read repairs only for replicas that are part of the read query (but it may happen that all replicas are repaired due to read_repair_chance?)
all replicas are repaired

Does it work that way?

4 Comments

2024/08/13
12:52 UTC

Cassdio: Cassandra Web Console

Cassdio is a tool designed to make database operations simpler and more efficient. With minimal setup, it supports connections to various databases and facilitates easy data processing and query execution. Cassdio offers clean code and an intuitive interface, making it accessible for both beginners and experts. For more information, visit the GitHub page.

#cassandra #webconsole #hakdang

3 Comments

2024/08/05
15:39 UTC

Throttle Medusa in local storage mode

Im looking at Medusa to do our backups. Is there a possibility to thottle disk IO during backup when using the local storage mode? i have only seen options for s3 bucket throttle.

1 Comment

2024/07/29
16:00 UTC

Tool to create Cassandra labs environments in AWS using easy-cass-lab

Hey folks, I wanted to share a tool, easy-cass-lab, I've worked on for a while now that makes it easy to quickly spin up clusters in AWS. These are the same tools I've used for years as a consultant and Cassandra committer to find bugs, do performance analysis, and test C* features. Quickest way to get started is using homebrew.

https://rustyrazorblade.com/post/2024/easy-cass-lab-homebrew/

Project repo is here: https://github.com/rustyrazorblade/easy-cass-lab

Looking forward to hearing any feedback!

0 Comments

2024/07/19
19:30 UTC

What do you host on?

I'm currently working on making an interface for Cassandra using ImGui with C++ in order to visualize Cassandra data easier and have a better access to your database. I'm worried though how most users of this database host or deploy it. I'm working on making the app use some information from datastax. This would make it so the user would have to submit their clientID, secret, and secure connection bundle all provided by datastax. I've also been trying to implement a way to connect to the DB from docker but nothing I've tried so far has really worked.

1 Comment

2024/06/11
20:20 UTC

A Novel Fault-Tolerant, Scalable, and Secure Distributed Database Architecture

In my PhD thesis, I have designed a novel distributed database architecture named "Parallel Committees."This architecture addresses some of the same challenges as NoSQL databases, particularly in terms of scalability and security, but it also aims to provide stronger consistency.

The thesis explores the limitations of classic consensus mechanisms such as Paxos, Raft, or PBFT, which, despite offering strong and strict consistency, suffer from low scalability due to their high time and message complexity. As a result, many systems adopt eventual consistency to achieve higher performance, though at the cost of strong consistency.
In contrast, the Parallel Committees architecture employs classic fault-tolerant consensus mechanisms to ensure strong consistency while achieving very high transactional throughput, even in large-scale networks. This architecture offers an alternative to the trade-offs typically seen in NoSQL databases.

Additionally, my dissertation includes comparisons between the Parallel Committees architecture and various distributed databases and data replication systems, including Apache Cassandra, Amazon DynamoDB, Google Bigtable, Google Spanner, and ScyllaDB.

I have prepared a video presentation outlining the proposed distributed database architecture, which you can access via the following YouTube link:

https://www.youtube.com/watch?v=EhBHfQILX1o

A narrated PowerPoint presentation is also available on ResearchGate at the following link:

https://www.researchgate.net/publication/381187113_Narrated_PowerPoint_presentation_of_the_PhD_thesis

My dissertation can be accessed on Researchgate via the following link: Ph.D. Dissertation

If needed, I can provide more detailed explanations of the problem and the proposed solution.

I would greatly appreciate feedback and comments on the distributed database architecture proposed in my PhD dissertation. Your insights and opinions are invaluable, so please feel free to share them without hesitation.

1 Comment

2024/06/09
15:24 UTC

Cassandra spark job getting stuck

We have 10-15 spark jobs which takes data from one source and push it to cassandra and we have 15 nodes of cluster with 32 core and 90 GB memory per node. We are trying to create this cluster on demand and once the cassandra is up with all the nodes, we try to insert the data with spark job and some time jobs get stucked during the execution of spark job and all these cassand are running on GKE. We are frequently facing this issue and it works sometime but it stucked at last step most of the time.

5 Comments

2024/05/27
15:07 UTC

Does anyone have gone through this error while working with medusa-cassandra (please guide me)this issue comes when i run the ---- medusa backup --backup-name=b11 --mode=full command

(myenv) [root@e2e-19-193 ~]# medusa backup --backup-name=b11 --mode=full

[2024-05-09 17:44:11,990] INFO: Resolving ip address

[2024-05-09 17:44:12,000] INFO: ip address to resolve 43.252.90.193

[2024-05-09 17:44:12,004] INFO: Registered backup id b11

[2024-05-09 17:44:12,005] INFO: Monitoring provider is noop

[2024-05-09 17:44:12,025] INFO: Found credentials in shared credentials file: /etc/medusa/medusa-minio-credentials

[2024-05-09 17:44:13,368] INFO: Starting backup using Stagger: None Mode: full Name: b11

[2024-05-09 17:44:13,368] INFO: Updated from existing status: -1 to new status: 0 for backup id: b11

[2024-05-09 17:44:13,369] INFO: Saving tokenmap and schema

[2024-05-09 17:44:13,758] INFO: Resolving ip address 172.16.231.75

[2024-05-09 17:44:13,758] INFO: ip address to resolve 172.16.231.75

[2024-05-09 17:44:13,762] INFO: Resolving ip address 172.16.231.63

[2024-05-09 17:44:13,763] INFO: ip address to resolve 172.16.231.63

[2024-05-09 17:44:13,767] INFO: Resolving ip address 172.16.231.72

[2024-05-09 17:44:13,767] INFO: ip address to resolve 172.16.231.72

[2024-05-09 17:44:13,770] INFO: Resolving ip address 172.16.231.75

[2024-05-09 17:44:13,770] INFO: ip address to resolve 172.16.231.75

[2024-05-09 17:52:34,499] ERROR: Issue occurred inside handle_backup Name: b11 Error: <LibcloudError in <class 'libcloud.storage.drivers.s3.S3StorageDriver'> 'Unknown error. Status code: 501'>

[2024-05-09 17:52:34,500] INFO: Updated from existing status: 0 to new status: 2 for backup id: b11

[2024-05-09 17:52:34,500] ERROR: Error occurred during backup: <LibcloudError in <class 'libcloud.storage.drivers.s3.S3StorageDriver'> 'Unknown error. Status code: 501'>

Traceback (most recent call last):

File "/usr/local/lib/python3.6/site-packages/medusa/backup_node.py", line 199, in handle_backup

enable_md5_checks_flag, backup_name, config, monitoring)

File "/usr/local/lib/python3.6/site-packages/medusa/backup_node.py", line 231, in start_backup

node_backup.schema = schema

File "/usr/local/lib/python3.6/site-packages/medusa/storage/node_backup.py", line 137, in schema

self._storage.storage_driver.upload_blob_from_string(self.schema_path, schema)

File "/usr/local/lib/python3.6/site-packages/retrying.py", line 56, in wrapped_f

return Retrying(*dargs, **dkw).call(f, *args, **kw)

File "/usr/local/lib/python3.6/site-packages/retrying.py", line 266, in call

raise attempt.get()

File "/usr/local/lib/python3.6/site-packages/retrying.py", line 301, in get

six.reraise(self.value[0], self.value[1], self.value[2])

File "/usr/local/lib/python3.6/site-packages/six.py", line 719, in reraise

raise value

File "/usr/local/lib/python3.6/site-packages/retrying.py", line 251, in call

attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

File "/usr/local/lib/python3.6/site-packages/medusa/storage/abstract_storage.py", line 68, in upload_blob_from_string

headers=headers,

File "/usr/local/lib/python3.6/site-packages/libcloud/storage/drivers/s3.py", line 753, in upload_object_via_stream

storage_class=ex_storage_class)

File "/usr/local/lib/python3.6/site-packages/libcloud/storage/drivers/s3.py", line 989, in _put_object_multipart

headers=headers)

File "/usr/local/lib/python3.6/site-packages/libcloud/storage/drivers/s3.py", line 573, in _initiate_multipart

headers=headers, params=params)

File "/usr/local/lib/python3.6/site-packages/libcloud/common/base.py", line 655, in request

response = responseCls(**kwargs)

File "/usr/local/lib/python3.6/site-packages/libcloud/common/base.py", line 166, in __init__

message=self.parse_error(),

File "/usr/local/lib/python3.6/site-packages/libcloud/storage/drivers/s3.py", line 148, in parse_error

driver=S3StorageDriver)

libcloud.common.types.LibcloudError: <LibcloudError in <class 'libcloud.storage.drivers.s3.S3StorageDriver'> 'Unknown error. Status code: 501'>

u/medusa u/cassandra u/dbaas u/nosql u/coloumdatabase u/distributeddatabase

2 Comments

2024/05/09
17:59 UTC