58,235 Subscribers

Perplexity Pro 1 Year Vouchers (activate then pay)

Get a 1-Year Perplexity Pro Voucher for just $29 (regular price $200) through my service provider.

This includes access to advanced models like:

Claude 3.5 Sonnet, Claude 3.5 Haiku (Opus Removed), Grok-2
GPT-4o, o1 Mini for Reasoning & Llama 3.1
Image generators: Flux.1, DALL-E 3, Playground v3 Stable Diffusion XL

Works globally and payments are accepted via PayPal for buyer protection.

How It Works:

Join the 👉Discord of 230+ Members
Pay via PayPal
Promo link redeemed, (activate first, pay after so if you don't pay then you're the scammer?)

Vouch from Buyers, Feedback 2, Feedback 3, Feedback 4, Feedback 5

0 Comments

2024/12/04
11:43 UTC

Future of Data Science Technologies and Trends

This read caters to deciphering the future of data science. Make it a priority to understand these core nuances before diving in as a seasoned data scientist! Explore the to know more.

https://preview.redd.it/83az58h8os4e1.jpg?width=1080&format=pjpg&auto=webp&s=ff2a6e5367de2ea4028036402e0f66613ca013d7

0 Comments

2024/12/04
08:43 UTC

Rollstack Product Updates December 2024, AI-Powered Data Insights, Collections, and More

1 Comment

2024/12/03
18:47 UTC

HOW TO BUILD YOUR ORGANIZATION DATA MATURE?

Take your organization from data exploring to data transformed with this comprehensive guide to data maturity. Discover the four key elements that determine data maturity and how to develop a data-driven culture within your organization. Start your journey to data transformation with this insightful guide. Become USDSI® Certified to lead your team in creating a data-driven culture.

https://reddit.com/link/1h4rugm/video/mxjthjhrue4e1/player

0 Comments

2024/12/02
10:15 UTC

TRANSFORM YOUR CAREER PATH WITH USDSI®'s DATA SCIENCE CERTIFICATION PROGRAM

Take your data science career to the next level with USDSI’s industry relevant certification program. Whether you're a students, professionals, and career switchers, our program offers practical skills and knowledge with minimal time commitment.

https://preview.redd.it/n3q7rz344d4e1.jpg?width=1080&format=pjpg&auto=webp&s=9ffd2bb36307533cde2c6cf2240aa820e1862960

0 Comments

2024/12/02
04:25 UTC

Web scraping booking

Hi folks, I’m working on a data project with a deadline today, and I urgently need help scraping Booking.com for hotel data in the top 20 cities in France. Objective: I need to scrape hotel information such as: Hotel names Average ratings Number of reviews Locations (latitude and longitude, if possible) Issues I’m Facing: My script only fetches results for one city (e.g., Lyon), even though I’m iterating through 20 cities. Some requests return unexpected content, likely due to session or cookie-handling issues. I suspect Booking.com’s anti-scraping measures may be blocking my script. What I’ve Tried: Sending city names dynamically via query parameters with requests.get(). Using headers and cookies to mimic a real browser. Adding delays between requests to reduce the chances of being blocked. What I Need: Guidance on why my requests aren’t fetching results for all cities. Advice on handling anti-scraping measures effectively (e.g., proxies, better headers). Suggestions on switching to Selenium or sticking with Requests and BeautifulSoup. My Environment: Language: Python Libraries used so far: Requests, BeautifulSoup Target cities: python Copier le code ["Paris", "Marseille", "Lyon", "Toulouse", "Nice", "Nantes", "Strasbourg", "Montpellier", "Bordeaux", "Lille", "Rennes", "Reims", "Toulon", "Saint-Étienne", "Le Havre", "Grenoble", "Dijon", "Angers", "Nîmes", "Villeurbanne"] I urgently need assistance since my deadline is today. Any advice, code examples, or alternative approaches would be incredibly appreciated. Thank you so much for your help!

1 Comment

2024/12/01
14:29 UTC

Unfolding the Role of Black Box and Explainable AI in Data Science

Drive greater progression with Black Box and Explainable AI in Data Science; facilitating data-driven decision-making for business worldwide. Enhance with popular machine learning models today.

https://preview.redd.it/bluysafwd04e1.jpg?width=1922&format=pjpg&auto=webp&s=1a5939b533219fbdd8ae738e907f2b6e880357c6

0 Comments

2024/11/30
09:36 UTC

Hive Setting Lookup

Setting up hive queries in .HQL file , anyone have recommendations how I can look up all Setting options and explanations ?

Example : SET mapreduce.job.reduces=10;

0 Comments

2024/11/30
02:43 UTC

I have a data processing scenario. suggested architectural choices

The total amount of data is expected to be around 2-4 billion/hour.

I need to GROUP BY by hour. the result after GROUP BY will be insert into the repository(or file system). It is expected that there will be 2-4 aggregations that will use all of the data, and 10 aggregations that will use part of the data (estimated 1/4).

The result data will be used in subsequent calculations (it is not clear how much the data will be compressed). Raw data will no longer be required.

The current scenario I have in mind:

use Spark, but need to build distributed file system, scheduling service.
use OLAP database (e.g. Clickhouse) and utilize Insert select inside the database.

The company is expected to provide only 13 processing nodes (SSD), so it feels difficult to deploy both Spark and OLAP at the same time?

It is still in the preliminary research stage. Anything is possible.

Want to hear some experience advice.

1 Comment

2024/11/29
03:32 UTC

Domain

Hi everyone, I have a domain name called bigdataexplained.com

The idea was to create a website to talk about big data, but I don't have time. It's a premium domain and I'm selling it for a very good price. If anyone is interested, just go to the website. There you can find instructions on how to buy everything correctly. I thought it would be interesting to post on this forum. Thanks!

2 Comments

2024/11/28
22:55 UTC

Geospatial Data Analysis with Teradata Vantage

Discover how data scientists and developers use Teradata Vantage to perform geospatial and time-series analysis for telecommunication providers!

What is covered:

Seamless integration of geospatial and customer data in Teradata Vantage.

Advanced geospatial analysis with Python tools like GeoPandas, Plotly, and TeradataML.

Quick insights and visualization of customer journeys and signal strength using Teradata’s Unbound Analytics Framework (UAF) and TrackingOp function

Hands-On Demo:

Analyzing cell tower locations and signal quality.

Visualizing coverage areas with GeoJSON data.

Mapping customer journeys with Teradata's UAF TrackingOp function.

0 Comments

2024/11/28
15:11 UTC

Boost Your Credibility with Data Science Certifications

Earning a data science certification can significantly boost your data science career. they help you gain new data science skills that are most in demand among industries, validate your knowledge, and show your commitment to lifelong learning.

There are other untold advantages a student or professional gets after earning certification from top institutes that are credible and recognized among employers. However, you need to be able to find a suitable certification program that matches your career goals and aspirations.

Check out our detailed guide about data science certification, its advantages, factors to consider while choosing the right one for you, top data science certifications, and other interesting facts. Download your copy now -

https://preview.redd.it/2a13ix1ywm3e1.jpg?width=1920&format=pjpg&auto=webp&s=01686a954ae3168d3d4b02caf869a5109ef4e3e1

0 Comments

2024/11/28
12:17 UTC

Looking for API/Database to Identify Companies Behind IP Addresses (Not ISPs)

We’re building a tool that needs to identify specific companies behind IP addresses, but we’re running into a common issue: most services, like IPinfo, only return the ISP (e.g., Ziggo, Telenet) instead of the actual business using the IP address.

The Challenge:

For larger organizations, it's easier to identify the company behind the IP, but when it comes to smaller businesses using common ISPs or shared/dynamic IPs, we only get the ISP information. We're specifically after the company data, not just the internet service provider.

What We Need:

We need an API or a database that can accurately identify the company behind an IP address, even when that company is using a dynamic IP provided by an ISP.

Self-hosted or independent solutions are preferred. We're not interested in using another service like Leadfeeder. Instead, we want control over the data and how it integrates into our tool.

We want to find a solution that offers the best balance between price and quality.

What We’ve Tried:

We’ve used IPinfo.io, which aggregates data from sources like WHOIS records, but it often returns only the ISP for smaller businesses. We even tried the IP-to-company data API.

Reverse DNS lookups similarly lead back to the ISP instead of the company.

Our Goal:

We want to find an API or data source that provides the actual business behind an IP, not the ISP.

Alternatively, we’re open to building our own database if there's a reliable method to aggregate and map business information to IP addresses.

Questions:

Does anyone know of an API or data provider that can reliably return company-level data behind IP addresses?
Has anyone had success in creating a custom database to map businesses to IPs? If so, how did you gather and maintain this data?
Are there any other data sources or techniques we should be looking at to solve this problem?

Any advice or recommendations would be greatly appreciated. Thanks in advance for your help!

0 Comments

2024/11/27
18:45 UTC

Achieving Sub-Second Latency with S3 Storage—A Kafka Alternative Using Pathway

Hey everyone,

I've been working on simplifying streaming architectures in big data applications and wanted to share an approach that serves as a Kafka alternative, especially if you're already using S3-compatible storage.

You can skip description and jump to the code here: https://pathway.com/developers/templates/kafka-alternative#building-your-streaming-pipeline-without-kafka

The Identified Gap Addressed Here

While Apache Kafka is a go-to for real-time data streaming, it comes with complexities and costs—setting up and managing clusters, incurring high costs in Confluent cloud (~2k monthly for the use case here).

Getting Streaming Performance with your Existing S3 Storage without Kafka

Instead of Kafka, you can leverage Pathway alongside Delta Tables on S3-compatible storage like MinIO. Pathway is a Pythonic stream processing engine with an underlying Rust engine.

Detailed Guide:

For the technical details, including code walkthrough and benchmarks, check out this article: Python Kafka Alternative: Achieve Sub-Second Latency with Your S3 Storage Without Kafka Using Pathway

Why Consider This Setup?

Sub-Second Latency: Benchmarks show that you can get stable sub-second latency for workloads up to 60,000 messages per second.
Cost-Effective: Eliminates the need for Kafka clusters, reducing both complexity and operational costs.
Simplified Architecture: Fewer components to manage, leveraging your existing S3 storage.
Scalable Performance: Handles up to 250,000 messages per second with near-real-time latency (~3-4 seconds).

Use Cases

This setup is suitable for various applications:

IoT and Logistics: Collecting data from numerous sensors or devices.
Financial Services: Real-time transaction processing and fraud detection.
Web and Mobile Analytics: Monitoring user interactions and ad impressions.

0 Comments

2024/11/27
17:16 UTC

Unlock the Power of Data Science Framework for Business Growth

Data science frameworks are pivotal in managing the vast amounts of data generated today. With tools like Python and R at the forefront, they enable organizations to automate tasks and extract valuable insights that drive business decisions.

https://preview.redd.it/1mud01prbe3e1.jpg?width=1080&format=pjpg&auto=webp&s=24566620436b5112c3a0be9f576c83ef8c3063cc

0 Comments

2024/11/27
07:25 UTC

Unlock the Power of Machine Learning in Data Science

Discover the immense potential of Machine Learning in Data Science! ML automates analysis, from simple linear models to complex neural networks, unlocking valuable insights as data grows. Embrace ML's power for a data-driven future. Master Data Science and ML through our comprehensive courses, earn Data Science certifications and start your career transformation today. Enroll now to become USDSI® certified. Register today.

https://reddit.com/link/1gzb2tw/video/vx7v7us7bz2e1/player

1 Comment

2024/11/25
04:55 UTC

NameNode not working

Hi im trying to download hadoop for my exam and the namenode part in hdfs isnt working in cloudera. youtube is of no help either. pls help if anyone knows what to do.

4 Comments

2024/11/24
06:25 UTC

Powerful Data Science Frameworks

Data science technology needs no introduction. Organizations have been using it for a long time now to make data-driven decisions and boost their business. Students aspiring to become successful data scientists know the importance of this technology in transforming industries and their applications.

However, it is rare among beginners that they are aware of the heart of data science – the powerful data science frameworks. These are the tools that streamline complex processes and make the life of data science professionals easier to explore and analyze data and build efficient models.

Data science frameworks, to put simply, are the collection of data science tools and libraries that make various kinds of data science tasks easier. Whether it is data collection, data processing, or data visualization, data science professionals can utilize popular data science frameworks to accomplish their tasks easily.

USDSI® brings a detailed infographic guide highlighting the importance of data science frameworks, their benefits, top data science frameworks, and various factors that one must consider while choosing one.

Check out the infographic below, and learn from TensorFlow to PyTorch, what they are and what they are best suitable for. Moreover, data science certifications from USDSI® can boost your data science learning endeavors. Explore these too.

https://preview.redd.it/fjxoidqltn2e1.jpg?width=1200&format=pjpg&auto=webp&s=a41a7324a8e11031d75a23896a873d8be4b5629f

0 Comments

2024/11/23
14:17 UTC

Descarga Datos de Precios y Productos de Google Shopping

El comercio electrónico es un campo que siempre será competitivo. Hemos tratado varios temas relacionados con el raspado de datos de determinados sitios de comercio electrónico como Amazon, Shopify, eBay, etc. Sin embargo, la realidad es que muchos minoristas pueden tener varias estrategias de marketing en diferentes plataformas, incluso para un solo artículo. Si desea comparar la información de los productos en diversas plataformas, el scraping de Google Shopping le ayudará a ahorrar mucho tiempo.

Conocido anteriormente como Product Listing Ads, Google Shopping es un servicio online proporcionado por Google que permite a los consumidores buscar y comparar productos a través de plataformas de compra online. Google Shopping permite a los usuarios comparar fácilmente los detalles de varios productos y sus precios de diferentes proveedores. Este post mostrará lo que ofrece y cómo se pueden extraer datos de Google Shopping.

Hablando de extracción de datos web, mucha gente podría asumir que la extracción de datos web requiere conocimientos de codificación. Con el avance de las herramientas de raspado web, este punto de vista podría ser alterado. Ahora la gente puede extraer datos fácilmente con estas herramientas, independientemente de la experiencia de codificación.

Si es la primera vez que utiliza Octoparse, puede registrarse para obtener una cuenta gratuita e iniciar sesión. Octoparse es una herramienta fácil de usar diseñada para que todo el mundo pueda extraer datos. Puede descargarla e instalarla en su dispositivo para su futuro viaje de extracción de datos. A continuación, puede seguir los pasos que se indican a continuación para extraer información de productos de Google Shopping con Octoparse.

Plantilla de raspado de datos en línea de Google Shopping

Puede encontrar plantillas de raspado de datos en línea de Octoparse, que le permiten extraer datos directamente introduciendo varios parámetros. No necesita descargar e instalar ningún software en su dispositivo, simplemente pruebe el siguiente enlace para raspar datos de listados de productos de Google Shopping fácilmente.

plantilla de scraping google shopping

Con Google Shopping, puede detectar fácilmente las tendencias del mercado. Puede utilizarlo para recopilar datos sobre su mercado objetivo, sus consumidores y sus competidores. Ofrece información sobre tantas plataformas distintas, en particular, que es posible que tenga que dedicar mucho tiempo a recopilar el mismo tipo de datos de varios sitios web. Con sólo CUATRO pasos, puede raspar Google Shopping con Octoparse. Esta herramienta también está disponible en una amplia gama de sitios web de comercio electrónico. Consulte los artículos siguientes para obtener más guías.

Ref: Cómo Extraer Datos de Precios y Productos de Google Shopping

0 Comments

2024/11/20
03:22 UTC

Newbie in Big data

As I’m a 23 yr old grad student in data science, my question professor given me a project where I must use databricks community edition and pysprak for applying machine learning algorithms. I’m very near to the deadline I need some project ideas and help as I’m a beginner.

1 Comment

2024/11/17
00:25 UTC

Avoid Costly Data Migrations: 10 Factors for Choosing the Right Partner

Most data migrations are complex and high-stakes. While it may not be an everyday task, as a data engineer, it’s important to be aware of the potential risks and rewards. We’ve seen firsthand how choosing the right partner can lead to smooth success, while the wrong choice can result in data loss, hidden costs, compliance failures, and overall headaches.

Based on our experience, we’ve put together a list of the 10 most crucial factors to consider when selecting a data migration partner: 🔗 Full List Here

A couple of examples:

Proven Track Record: Do they have case studies and references that show consistent results?
Deep Technical Expertise: Data migration is more than moving data—it’s about transforming processes to unlock potential.

What factors do you consider essential in a data migration partner? Check out our full list, and let’s hear your thoughts!

0 Comments

2024/11/15
17:34 UTC

58,235 Subscribers

Perplexity Pro 1 Year Vouchers (activate then pay)

How It Works:

Future of Data Science Technologies and Trends

Rollstack Product Updates December 2024, AI-Powered Data Insights, Collections, and More

HOW TO BUILD YOUR ORGANIZATION DATA MATURE?

TRANSFORM YOUR CAREER PATH WITH USDSI®'s DATA SCIENCE CERTIFICATION PROGRAM

Web scraping booking

Unfolding the Role of Black Box and Explainable AI in Data Science

Hive Setting Lookup

I have a data processing scenario. suggested architectural choices

Domain

Geospatial Data Analysis with Teradata Vantage

Boost Your Credibility with Data Science Certifications

Looking for API/Database to Identify Companies Behind IP Addresses (Not ISPs)

Achieving Sub-Second Latency with S3 Storage—A Kafka Alternative Using Pathway

Unlock the Power of Data Science Framework for Business Growth

Unlock the Power of Machine Learning in Data Science

NameNode not working

Powerful Data Science Frameworks

Top 7 Data Conversion Mistakes and Solutions to Avoid Them

Descarga Datos de Precios y Productos de Google Shopping

Plantilla de raspado de datos en línea de Google Shopping

Newbie in Big data

Avoid Costly Data Migrations: 10 Factors for Choosing the Right Partner