/r/bigdata

Photograph via snooOG

For all bigdata gurus everywhere from hedgefunds (quant finance) to biotech (drug discovery) to social media (twitter) to discuss the latest trends, topics, career opportunities and tricks of the trade!

Rules: No advertising, don't blatantly link to your own product(s). Posts must be relevant to big data technologies or discussions.

Related subreddits:

r/datascience

r/bigdatajobs

r/machinelearning

r/datagangsta

/r/bigdata

54,845 Subscribers

1

Help me pick a laptop for Data engineering/Big data work

I am planning to buy a laptop and confused which one to pick. Considering high performance, budget under 40k. Thanks in advance!

11 Comments
2024/04/14
05:00 UTC

1

How can I derive associations between player positions?

So I have a csv containing football data about goals where each goal has a scorer, GCA1(the player that gave assist), GCA2(the player that gave the pass to the assister)

I want to discover patterns of player positions that lead to a goal AKA buildups to a goal

Example: RB passed to a CAM which assisted a goal scored by a ST, or CB passed to a RW which assisted a goal scored by a LW

I want to find the most frequent buildups, think of it as finding frequent itemsets for a supermarket to derive discount decisions. Except my goal is to know which buildups are most common and make up coaching plans to better strengthen the relationship between the players in those buildups

I was thinking of using APRIORI algorithm or FP-Growth, I tried CHATGPT but it didn't help me that much (I'm getting only one association between FW players and no one, or sort of saying forward players scoring solo, which is definitely not logical based on my dataset) and gemini is the most awful AI out there. Seriously my grandma can do better, I gave it a prompt and rephrased it 3 times and it still gave me 'Rephrase your prompt and try again'

So does anyone know a way I can do this, or if there is a way to do it better. I'm still a junior data scientist so I'm still learning and I would gladly appreciate any feedback or advice.

1 Comment
2024/04/13
19:06 UTC

2

Example Data Pipeline with Prefect, Delta Lake, and Dask

I’m an OSS developer (primarily working on Dask) and lately I’ve been talking to users about how they’re using Dask for ETL-style production workflows and this inspired me to make something myself. I wanted a simple example that met the following criteria:

- **Run locally (optionally)**. Should be easy to try out locally and easily scalable.
- **Scalable to cloud**. I didn’t want to think hard about cloud deployment.
- **Python forward**. I wanted to use tools familiar to Python users, not an ETL expert.
The resulting data pipeline uses Prefect for workflow orchestration, Dask to scale the data processing across a cluster, Delta Lake for storage, and Coiled to deploy Dask on the cloud.

I really like the outcome, but wanted to get more balanced feedback since lately I’ve been more on the side of building these tools rather than using them heavily for data engineering. Some questions I’ve had include:
- **Prefect vs. Airflow vs. Dagster?** For the users I’ve been working with at Coiled, Prefect is the most commonly used tool. I also know Dagster is quite popular and could easily be swapped into this example.
- **DeltaLake or something else?** To be honest I mostly see vanilla Parquet in the wild, but I’ve been curious about Delta for a while and mostly wanted an excuse to try it out (pandas and Dask support improved a lot with delta-rs).

Anyway, if people have a chance to read things over and give feedback I’d welcome constructive critique.
Blog post: https://docs.coiled.io/blog/easy-scalable-production-etl.html
Code: https://github.com/coiled/etl-tpch

0 Comments
2024/04/11
13:52 UTC

1

IoT-Powered Smart Warehouse Management: A Detailed Guide

0 Comments
2024/04/11
11:11 UTC

1

Complete Survey on Database Tech Education for a Chance to Win a $100 Amazon Gift Card!

$100 Amazon gift card opportunity for participating in a 10-minute survey. We're inviting students from universities across the globe to participate in a brief survey conducted by Valley Consulting Group at UC Berkeley, in collaboration with Oracle Corporation.Your valuable perspectives will contribute to understanding database technology instruction in higher education globally. As a token of our appreciation, participants who complete the survey will be entered into a drawing for a chance to win a $100 Amazon gift card!

https://berkeley.qualtrics.com/jfe/form/SV_6wWbyuBP7pBJUmG

0 Comments
2024/04/10
21:24 UTC

0

I went the roadmap of Data Science That you use when you study data science

I'm Semester 4 of computer science student in University , I went to be data science after 6 months, I went to fastest way to get data science knowledge, I went to spend more time 4-6 hours every day to study data science

0 Comments
2024/04/09
09:39 UTC

1

Big Data Paradigms and LLMs

  1. Big Data Paradigms such as Batch Processing, Stream Processing, Hybrid Processing etc. are used by big companies like Google, Apple etc for which application? Where these companies uses such paradigms.

  2. What is the future aspects of Big Data with LLM. Suggest me some good research papers or articles.

2 Comments
2024/04/09
04:17 UTC

2

A Questionnaire on Big Data and Digital Governance

Dear Folks!

I am a PhD Research Scholar at Central University of Punjab. I am seeking your expert opinion on some questions. Here is the attached link to the questionnaire. This will take approximately 10-20 minutes to complete. Your input would be greatly appreciated.

Thanks for your kind cooperation.

Link: https://forms.gle/FVEdAQoUkrzqd1JbA

0 Comments
2024/04/09
03:59 UTC

1

Companies to apply for

Suggest companies that have a stable data team in Egypt and employ junior/fresh regularly or remote companies that provide Internship in data science/engineering

1 Comment
2024/04/09
02:22 UTC

1

Smart Factory Case Study

0 Comments
2024/04/05
07:10 UTC

1

Technical Theme for 17 April: 745-9am PST. 'Cyber teams leading with 100% Private LLM's: A cyber/ CISO perspective on Large Language Models. '

You are cordially invited to join us for an Invite- Only, zoom session. Limited to CISO's, CIO's, CTO's and Cloud SME's. (No Sales Executives please).

Technical Theme for 17 April: 745-9am PST. 'Cyber teams leading with 100% Private LLM's: A cyber/ CISO perspective on Large Language Models. '

Presenter: Tim Rohrbaugh: (former CISO of Jet Blue) thanks DLH

0 Comments
2024/04/02
12:55 UTC

2

Invitation: Technical Theme for 17 April: 745-9am PST. 'Cyber teams leading with 100% Private LLM's: A cyber/ CISO perspective on Large Language Models. '

[ Removed by Reddit on account of violating the content policy. ]

1 Comment
2024/04/02
12:53 UTC

2

Metrics-Focused Data Strategy with Model-First Data Products

Purpose-driven data, metric dependency trees, product prototypes, and more!

Concept: Quenching the perpetual business curiosity
Tech: Collection of Model-First Data Products
💡 The Metric Dependency Tree
💠 The Model-First Data Product(s)
Step 1: Retrieving Requirements
Step 2: Modeling the Product Prototype
Step 3: Validating the Prototype
Step 4: Materialising Product from Prototype

Read more: https://moderndata101.substack.com/p/model-first-data-products

0 Comments
2024/04/02
12:40 UTC

9

Apache Hive 4.0 has been released

Hi Guys,

Apache Hive 4.0 has been released . It's a really cool project , do check it out.

https://github.com/apache/hive

https://hive.apache.org/general/downloads/

https://hive.apache.org/

5 Comments
2024/03/30
18:38 UTC

2

Smart Supply Chains: Driving eCommerce Success with IoT

0 Comments
2024/03/29
05:38 UTC

2

Academic Survey, asks about the Challenges of Big Data Security

https://forms.office.com/r/3qC141C9Xd
This survey asks for your opinion on how concerning (On a scale from 1-5) are a number of specified challenges in Big Data security. It also asks for what you believe to be challenges to it in open ended questions that allow for long form answers. I'll admit that this survey is not at a high level as some of the other posts on this Sub, but I am just a student and this is my first foray into Big Data in a serious manner.
This survey takes just under 5 minutes to complete, and you are the exact demographic that I would love to hear from. Thank you in advance.

0 Comments
2024/03/29
02:48 UTC

1

Apache Ranger UserSync Configuration HELP!!

I am trying to configure Apache ranger usersync with unix ! and Iam stuck at this point !:

After i execute this : sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ ./setup.sh

Then this error pops up:

teka@t3:/usr/local/ranger-usersync$ sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ./setup.sh

[sudo] password for teka:

INFO: moving [/etc/ranger/usersync/conf/java_home.sh] to [/etc/ranger/usersync/conf/.java_home.sh.28032024144333] .......

Direct Key not found:SYNC_GROUP_USER_MAP_SYNC_ENABLED

Direct Key not found:hadoop_conf

Direct Key not found:ranger_base_dir

Direct Key not found:USERSYNC_PID_DIR_PATH

Direct Key not found:rangerUsersync_password

Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper

at org.apache.ranger.credentialapi.CredentialReader.getDecryptedString(CredentialReader.java:39)

at org.apache.ranger.credentialapi.buildks.createCredential(buildks.java:87)

at org.apache.ranger.credentialapi.buildks.main(buildks.java:41)

Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper

at java.net.URLClassLoader.findClass(URLClassLoader.java:387)

at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)

at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

... 3 more

ERROR: Unable update the JCKSFile(/etc/ranger/usersync/conf/rangerusersync.jceks) for aliasName (usersync.ssl.key.password)

Can any one help me with that ?

Tools Iam using:

Host Device: MacBook m1

Guest Device: Ubuntu 20.04 LTS

Apache Ranger: 2.4 (Build from source code)

0 Comments
2024/03/28
14:55 UTC

1

Seeking Opninions: What's the Equivalent of my Degree Internationally?

Hi everyone!

3 years ago i completed my bachelor 's degree in what would be translated to information science. However after some research it seems more like what other countries would call computer science.

I have been wondering what to call my degree when applying for international jobs.
I was therefore hoping someone here could give me a pointer of what my degree equates to, and how it stacks up against similar bachelor´s degrees around the world?

Here are a few examples of the courses I've taken during my degree:

  • Programming (Basic and advanced Python)
  • Machine Leaning
  • Artificial Intelligence
  • Data Management
  • System Developement
  • Knowledge graphs
  • Knowledge representation and Reasoning
  • Human-Computer Interaction

Thanks in advance for any opinions!

1 Comment
2024/03/27
17:47 UTC

Back To Top