/r/bigdata
For all bigdata gurus everywhere from hedgefunds (quant finance) to biotech (drug discovery) to social media (twitter) to discuss the latest trends, topics, career opportunities and tricks of the trade!
Rules: No advertising, don't blatantly link to your own product(s). Posts must be relevant to big data technologies or discussions.
Related subreddits:
/r/bigdata
I am planning to buy a laptop and confused which one to pick. Considering high performance, budget under 40k. Thanks in advance!
So I have a csv containing football data about goals where each goal has a scorer, GCA1(the player that gave assist), GCA2(the player that gave the pass to the assister)
I want to discover patterns of player positions that lead to a goal AKA buildups to a goal
Example: RB passed to a CAM which assisted a goal scored by a ST, or CB passed to a RW which assisted a goal scored by a LW
I want to find the most frequent buildups, think of it as finding frequent itemsets for a supermarket to derive discount decisions. Except my goal is to know which buildups are most common and make up coaching plans to better strengthen the relationship between the players in those buildups
I was thinking of using APRIORI algorithm or FP-Growth, I tried CHATGPT but it didn't help me that much (I'm getting only one association between FW players and no one, or sort of saying forward players scoring solo, which is definitely not logical based on my dataset) and gemini is the most awful AI out there. Seriously my grandma can do better, I gave it a prompt and rephrased it 3 times and it still gave me 'Rephrase your prompt and try again'
So does anyone know a way I can do this, or if there is a way to do it better. I'm still a junior data scientist so I'm still learning and I would gladly appreciate any feedback or advice.
I’m an OSS developer (primarily working on Dask) and lately I’ve been talking to users about how they’re using Dask for ETL-style production workflows and this inspired me to make something myself. I wanted a simple example that met the following criteria:
- **Run locally (optionally)**. Should be easy to try out locally and easily scalable.
- **Scalable to cloud**. I didn’t want to think hard about cloud deployment.
- **Python forward**. I wanted to use tools familiar to Python users, not an ETL expert.
The resulting data pipeline uses Prefect for workflow orchestration, Dask to scale the data processing across a cluster, Delta Lake for storage, and Coiled to deploy Dask on the cloud.
I really like the outcome, but wanted to get more balanced feedback since lately I’ve been more on the side of building these tools rather than using them heavily for data engineering. Some questions I’ve had include:
- **Prefect vs. Airflow vs. Dagster?** For the users I’ve been working with at Coiled, Prefect is the most commonly used tool. I also know Dagster is quite popular and could easily be swapped into this example.
- **DeltaLake or something else?** To be honest I mostly see vanilla Parquet in the wild, but I’ve been curious about Delta for a while and mostly wanted an excuse to try it out (pandas and Dask support improved a lot with delta-rs).
Anyway, if people have a chance to read things over and give feedback I’d welcome constructive critique.
Blog post: https://docs.coiled.io/blog/easy-scalable-production-etl.html
Code: https://github.com/coiled/etl-tpch
$100 Amazon gift card opportunity for participating in a 10-minute survey. We're inviting students from universities across the globe to participate in a brief survey conducted by Valley Consulting Group at UC Berkeley, in collaboration with Oracle Corporation.Your valuable perspectives will contribute to understanding database technology instruction in higher education globally. As a token of our appreciation, participants who complete the survey will be entered into a drawing for a chance to win a $100 Amazon gift card!
I'm Semester 4 of computer science student in University , I went to be data science after 6 months, I went to fastest way to get data science knowledge, I went to spend more time 4-6 hours every day to study data science
Big Data Paradigms such as Batch Processing, Stream Processing, Hybrid Processing etc. are used by big companies like Google, Apple etc for which application? Where these companies uses such paradigms.
What is the future aspects of Big Data with LLM. Suggest me some good research papers or articles.
Dear Folks!
I am a PhD Research Scholar at Central University of Punjab. I am seeking your expert opinion on some questions. Here is the attached link to the questionnaire. This will take approximately 10-20 minutes to complete. Your input would be greatly appreciated.
Thanks for your kind cooperation.
Suggest companies that have a stable data team in Egypt and employ junior/fresh regularly or remote companies that provide Internship in data science/engineering
You are cordially invited to join us for an Invite- Only, zoom session. Limited to CISO's, CIO's, CTO's and Cloud SME's. (No Sales Executives please).
Technical Theme for 17 April: 745-9am PST. 'Cyber teams leading with 100% Private LLM's: A cyber/ CISO perspective on Large Language Models. '
Presenter: Tim Rohrbaugh: (former CISO of Jet Blue) thanks DLH
[ Removed by Reddit on account of violating the content policy. ]
Purpose-driven data, metric dependency trees, product prototypes, and more!
Concept: Quenching the perpetual business curiosity
Tech: Collection of Model-First Data Products
💡 The Metric Dependency Tree
💠 The Model-First Data Product(s)
Step 1: Retrieving Requirements
Step 2: Modeling the Product Prototype
Step 3: Validating the Prototype
Step 4: Materialising Product from Prototype
Read more: https://moderndata101.substack.com/p/model-first-data-products
Hi Guys,
Apache Hive 4.0 has been released . It's a really cool project , do check it out.
https://github.com/apache/hive
https://forms.office.com/r/3qC141C9Xd
This survey asks for your opinion on how concerning (On a scale from 1-5) are a number of specified challenges in Big Data security. It also asks for what you believe to be challenges to it in open ended questions that allow for long form answers. I'll admit that this survey is not at a high level as some of the other posts on this Sub, but I am just a student and this is my first foray into Big Data in a serious manner.
This survey takes just under 5 minutes to complete, and you are the exact demographic that I would love to hear from. Thank you in advance.
I am trying to configure Apache ranger usersync with unix ! and Iam stuck at this point !:
After i execute this : sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ ./setup.sh
Then this error pops up:
teka@t3:/usr/local/ranger-usersync$ sudo JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64 ./setup.sh
[sudo] password for teka:
INFO: moving [/etc/ranger/usersync/conf/java_home.sh] to [/etc/ranger/usersync/conf/.java_home.sh.28032024144333] .......
Direct Key not found:SYNC_GROUP_USER_MAP_SYNC_ENABLED
Direct Key not found:hadoop_conf
Direct Key not found:ranger_base_dir
Direct Key not found:USERSYNC_PID_DIR_PATH
Direct Key not found:rangerUsersync_password
Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper
at org.apache.ranger.credentialapi.CredentialReader.getDecryptedString(CredentialReader.java:39)
at org.apache.ranger.credentialapi.buildks.createCredential(buildks.java:87)
at org.apache.ranger.credentialapi.buildks.main(buildks.java:41)
Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 3 more
ERROR: Unable update the JCKSFile(/etc/ranger/usersync/conf/rangerusersync.jceks) for aliasName (usersync.ssl.key.password)
Can any one help me with that ?
Tools Iam using:
Host Device: MacBook m1
Guest Device: Ubuntu 20.04 LTS
Apache Ranger: 2.4 (Build from source code)
Hi everyone!
3 years ago i completed my bachelor 's degree in what would be translated to information science. However after some research it seems more like what other countries would call computer science.
I have been wondering what to call my degree when applying for international jobs.
I was therefore hoping someone here could give me a pointer of what my degree equates to, and how it stacks up against similar bachelor´s degrees around the world?
Here are a few examples of the courses I've taken during my degree:
Thanks in advance for any opinions!