/r/bigdata

Photograph via snooOG

For all bigdata gurus everywhere from hedgefunds (quant finance) to biotech (drug discovery) to social media (twitter) to discuss the latest trends, topics, career opportunities and tricks of the trade!

Rules: No advertising, don't blatantly link to your own product(s). Posts must be relevant to big data technologies or discussions.

Related subreddits:

r/datascience

r/bigdatajobs

r/machinelearning

r/datagangsta

/r/bigdata

55,088 Subscribers

1

Survey on the Role of Artificial Intelligence and Big Data in Enhancing Cancer Treatment

Hello everyone, I am currently doing my dissertation paper on Big Data and AI. Right here is a questionnaire that I prepared for my primary research.

Anyone who answers my questions will remain anonymous.

  1. Background Information

• What is your professional background? (Options: Healthcare, IT, Data Science, Education, Other)

• How familiar are you with AI and big data applications in healthcare? (Scale: Not familiar - Extremely familiar)

  1. Perceptions of AI and Big Data in Healthcare

• In your opinion, what are the most promising applications of AI and big data in healthcare?

• How do you think AI and big data can improve cancer tumor detection and treatment?

  1. Challenges and Barriers

• What do you see as the biggest challenges or barriers to implementing AI and big data solutions in healthcare settings?

• How concerned are you about privacy and security issues related to using AI and big data in healthcare? (Scale: Not concerned - Extremely concerned)

  1. Effectiveness and Outcomes

• Can you provide examples (if any) from your experience or knowledge where AI and big data have significantly improved healthcare outcomes?

• How effective do you believe AI is in personalizing cancer treatment compared to traditional methods?

  1. Future Trends

• What future developments in AI and big data do you anticipate will have the most impact on healthcare in the next 5-10 years?

• What role do you think cloud computing will play in the future of AI and big data in healthcare?

  1. Personal Insights

• What advice would you give to healthcare organizations looking to integrate AI and big data into their operations?

• What skills do you think are essential for professionals working at the intersection of AI, big data, and healthcare?

  1. Open-Ended Response

• Is there anything else you would like to add about the role of AI and big data in healthcare that has not been covered in this questionnaire?

Thank you for your time!

0 Comments
2024/04/29
02:33 UTC

2

I recorded a Python PySpark Big Data Course and uploaded it on YouTube

Hello everyone, I uploaded a PySpark course to my YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib. I am leaving the link to this post, have a great day!

https://www.youtube.com/watch?v=jWZ9K1agm5Y&list=PLTsu3dft3CWiow7L7WrCd27ohlra_5PGH&index=9&t=1s

0 Comments
2024/04/28
13:52 UTC

0

Open Source SQL Databases - OLTP and OLAP Options

Are you leveraging open source SQL databases in your projects?

Check out the article here to see the options out there: https://www.datacoves.com/post/open-source-databases

Why consider Open Source SQL Databases? 🌐

  • Cost-Effectiveness: Dramatically reduce your system's total cost of ownership.
  • Flexibility and Customization: Tailor database software to meet your specific requirements.
  • Robust Community Support: Benefit from rapid updates and a wealth of community-driven enhancements.

Share your experiences or ask questions about integrating these technologies into your tech stack.

0 Comments
2024/04/24
19:02 UTC

2

Data Project - Personal Finance

0 Comments
2024/04/23
17:55 UTC

7

WAL is a broken strategy?

Hi,

I'm studying a bit on big data systems.

I've bounced into this article, from 2019, which explains WAL is a broken strategy and actually inefficient - Written by VictoriaMetrics founder. In short: He says: Flush every second in SSTable format (of your choice), and do the background compaction to slowly build it up to descent size block. He says there are two systems out there using this strategy: VM and ClickHouse.

Would love to hear some expert Big Data take on this.

0 Comments
2024/04/23
17:17 UTC

1

Strategies for Handling Missing Values in Data Analysis

As data scientists and data analysts delve into the intricate world of data, they often encounter a common challenge: filling over gaps. The identified information can be lost due to several reasons, for instance human error, breakdown of sensors as well as lack of collection of data. Getting the missing values problem right is critical because if they are not handled correctly, they can be very detrimental to the functioning of machine learning models and statistical estimation. Click here to read more >>

0 Comments
2024/04/23
14:48 UTC

1

How can I share BigQuery reports with non-technical folks?

Want to easily share BigQuery insights with your external clients, partners, or vendors?

If complex BI tools or clunky CSV exports are your current solutions, it’s time for an upgrade! Softr now integrates with BigQuery, allowing you to easily connect to your BigQuery database to create dedicated dashboards and reports— without coding or complex analytics tools.

Here’s what you can do:

  • Data portals: Create intuitive, customized dashboards directly within Softr. No need for third parties and non-technical team members to master complex analytics software.
  • Secure access control: Fine-tune permissions to determine exactly what data each external user can see.

Transform the way you share your BigQuery insights.

0 Comments
2024/04/23
11:20 UTC

3

Strategies for Handling Missing Values in Data Analysis

As data scientists and data analysts delve into the intricate world of data, they often encounter a common challenge: filling over gaps. The identified information can be lost due to several reasons, for instance human error, breakdown of sensors as well as lack of collection of data. Getting the missing values problem right is critical because if they are not handled correctly, they can be very detrimental to the functioning of machine learning models and statistical estimation. This article covers some data scientists skills and methodologies that are a must for effectively managing missing data. Click here to read more >>

0 Comments
2024/04/23
04:26 UTC

0 Comments
2024/04/23
04:23 UTC

1

ClickHouse Performance Master Class – Tools and Techniques to Speed up any ClickHouse App Webinar

ClickHouse Performance Master Class – Tools and Techniques to Speed up any ClickHouse App
We’ll discuss tools to evaluate performance including ClickHouse system tables and EXPLAIN. We’ll demonstrate how to evaluate and improve performance for common query use cases ranging from MergeTree data on block storage to Parquet files in data lakes. Join our webinar to become a master at diagnosing query bottlenecks and curing them quickly. https://hubs.la/Q02t2dtG0 

0 Comments
2024/04/22
13:11 UTC

2

Data Integration Unlocked: From Silos to Strategy for Competitive Success

0 Comments
2024/04/22
10:13 UTC

1 Comment
2024/04/21
20:42 UTC

1

Seeking Data Sets of 2023 Headlines from Major Publications

0 Comments
2024/04/21
19:41 UTC

2

The Future of Healthcare: How AI is Revolutionizing Medical Diagnostics

Hey everyone, stumbled upon this fascinating article discussing the urgent need for AI integration in healthcare diagnostics. In today's rapidly evolving world, it's crucial for the healthcare sector to adapt, and this piece dives deep into why AI is the way forward.

Check it out: The Integration of AI in Healthcare: Enhancing Diagnostic Accuracy and Patient Outcomes

From highlighting the burden of diagnostic errors to exploring the promise of AI in addressing these challenges, this article offers a comprehensive overview. It delves into real-world examples, showcasing how AI is already making a tangible difference in patient outcomes.

What's particularly intriguing is the discussion on upcoming innovations in AI and the skills healthcare professionals need to develop to thrive in this AI-integrated environment.

Definitely worth a read for anyone interested in the intersection of technology and healthcare! Let's spark some discussions on how AI is shaping the future of medicine.

0 Comments
2024/04/21
09:52 UTC

3

Reporting system for microservices

Hi, we are trying to implement a reporting system for our microservices: our goal is to build a business intelligence service that correlates data between multiple services.

Right now, for legacy services, there is an ETL service that reads data (sql queries) from source databases and then stores it in a data warehouse where data is enriched and prepared for the end user.

For microservices, and in general for everything that is not legacy, we want to avoid this approach because multiple kinds of databases are involved (es: postgresql and mongodb) and our ETL service need to read an high amount of data, including things that has not been changed, every day (very slow and inefficient).

Because people of "data team" (the one who manage ETL jobs and business intelligence stuff) are not the same of dev team, every time a dev team decides to change something (e.g: schema, database engine, etc), our ETL service stops working, and this requires a lot of over coordination and sharing of low level implementation details.

We want to obtain the same level of backwards compatibility between changes and abstraction used for service-to-service interaction (REST API) but for data, delegating the dev team to maintain that layer of backwards compatibility (contract with data team), also because direct access to source databases and implementation details is an anti-pattern for microservices.

A first test was made using debezium to stream changes from sources database to kafka and then s3 (using iceberg as table format) in a kind of data lake, while using trino as query engine. This approach seems to be very experimental and difficult to maintain/operate (e.g. what happens with a huge amount of inserted/updated data!?). In addition to that, it is not clear how to maintain the "data backwards compatibility/abstraction layer": one possible way could be to delegate it to dev teams allowing them to create views on "data lake".

Any ideas/suggestions?

2 Comments
2024/04/20
09:06 UTC

1

adapt() gives error while using Normalization Layer in Sequential Models?

While using Normalization layer in Sequential Model, while adapt(), I am getting Unbound Error:

normalizer = Normalization()

normalizer.adapt(X_train)

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[198], line 2
      1 normalizer = Normalization()
----> 2 normalizer.adapt(X_train)

File /usr/local/lib/python3.10/site-packages/keras/src/layers/preprocessing/normalization.py:228, in Normalization.adapt(self, data)
    225     input_shape = tuple(data.element_spec.shape)
    227 if not self.built:
--> 228     self.build(input_shape)
    229 else:
    230     for d in self._keep_axis:

UnboundLocalError: local variable 'input_shape' referenced before assignment
0 Comments
2024/04/19
16:50 UTC

1

Apache kudu build

Im working on a big data architecture for real-time analytics with kudu and impala, i've been trying to install apache kudu following the documentation step by step from the official website however the build always exits with errors

  • yes i tried the quickstart and it works great but i have to download the full version

Have anyone faced the same problem with the build and how have you solved it? Note: im using ubuntu machines with 32 gb of memory, 8 cores and more than enough ssd disk space

0 Comments
2024/04/19
11:14 UTC

2

The Role of Smart Maritime IoT Solutions in Enhancing Maritime Safety

0 Comments
2024/04/19
06:19 UTC

1

Inteligencia Artificial en el Mundo Empresarial [Tecnología E3]

0 Comments
2024/04/18
00:42 UTC

1

Building Customizable Database Software and Apps with Blaze No-Code Platform

A cloud database is a collection of data, or information, that is specially organized for rapid search, retrieval, and management all via the internet. The guide below shows how with Blaze no-code platfrom, you can house your database with no code and store your data in one centralized place so you can easily access and update your data: Online Database - Blaze.Tech

0 Comments
2024/04/17
16:34 UTC

1

Flink SQL—Misconfiguration, Misunderstanding, and Mishaps

0 Comments
2024/04/17
14:22 UTC

0

Help me pick a laptop for Data engineering/Big data work

I am planning to buy a laptop and confused which one to pick. Considering high performance, budget under 40k. Thanks in advance!

20 Comments
2024/04/14
05:00 UTC

Back To Top