/r/algorithms
Computer Science for Computer Scientists
Computer Science for Computer Scientists
Other subreddits you may like:
Does this sidebar need an addition or correction? Tell me here
/r/algorithms
I'm trying out this codewars problem. I have to simulate the following game - There are 2 players and an array of integers. A player can only pick an integer from either the start or end of the array.
Once a player picks an int, it gets removed from the array and the value of that int is added to that player's score. Then the next player does the same with the now smaller array.
They keep taking turns until the array is empty.
It is the goal of each player to make their score as big as possible. So it is not just a question of seeing which end of the array has the bigger value........it is also a question of asking yourself "What will my opponent get in the next turn if I pick this value?"
So I've built a recursive function, where a player will first considering picking the value at the start of the array, and then assume the mutated new array. It will then assume both possibilities of the opponent picking off the leftside of the mutated array and the rightside. Repeat this entire process until the array is empty. There is a boolean variable called "skip" which decides if its Player A's turn to pick or Player B.
Then it will do all this again after considering picking the last value of the array.
Then there is a showdown where it will see which journey produced the maximum score. And it will pick the index accordingly and return it to a main loop so that we know which part of the array to use.
function distributionOf(paramArr){
let arr = paramArr; let aTotal = 0; let bTotal = 0;
while(arr.length > 0){
let mockArrA = createArrCopy(arr);
let aix = grab(mockArrA, 0, false, 0);
aTotal = aTotal + arr[aix];
arr.splice(aix, 1);
if(arr.length <= 0){
break;
}
let mockArrB = createArrCopy(arr);
let bix = grab(mockArrB, 0, false, 0);
bTotal = bTotal + arr[bix];
arr.splice(bix, 1);
}
console.log("Answer: " + aTotal + " , " + bTotal);
return [aTotal, bTotal];
function grab(arr, total, skip, count){
//emergency base case
if(arr.length <= 0){
return 0;
}
//base case
if(arr.length == 1){
if(count == 0){
return 0;
}
else if(count > 0){
return (total+arr[0]);
}
}
//leftside
let currLsTotal = total + arr[0];
let lsArr = createArrCopy(arr);
lsArr = lsArr.splice(0, 1);
let futureLsTotal;
if(skip == false){
futureLsTotal = grab(lsArr, currLsTotal, true, count+1);
}
else if(skip == true){
futureLsTotal = grab(lsArr, total, false, count+1);
}
//rightside
let currRsTotal = total + arr[arr.length-1];
let rsArr = createArrCopy(arr);
rsArr = rsArr.splice(rsArr.length-1, 1);
let futureRsTotal;
if(skip == false){
futureRsTotal = grab(rsArr, currRsTotal, true, count+1);
}
else if(skip==true){
futureRsTotal = grab(rsArr, total, false, count+1);
}
//showdown
if(futureLsTotal > futureRsTotal){
if(count > 0){
//return value
return futureLsTotal;
}
else if(count == 0){
return 0;
}
}
else if(futureLsTotal <= futureRsTotal){
if(count > 0){
return futureRsTotal;
}
else if(count == 0){
return arr.length-1;
}
}
}
function createArrCopy(arr){
let destArr = new Array();
for(let x=0; x<arr.length; x++){
destArr.push(arr[x]);
}
return destArr;
}
}
However I'm finding that my answer is not the same one that codewars tell me.
For the array "[4,7,2,9,5,2]", codewars tells me the answer should be [18,11]. However my answer is [15,14]. I have even tried it manually on paper, and it seems logical for it to be 15,14.
Could someone please tell me where I'm going wrong in my approach?
Here's the codewars problem: https://www.codewars.com/kata/59549d482a68fe3bc2000146
Anyone has suggestions on how to simulate a wind vector field, from an atmospheric pressure vector field and terrain description? I am looking to create varying wind maps for a serious game that about sailing.
I figured I could generate a field of simluated atmospheric pressures (anticyclones, depressions) and derivate it to infer wind vectors in every point of the map. Yet, I wonder how to account for terrain (wind blocked by the coast line). Thank you for any suggestion or pointer.
I have started a rust
project to perform a Tseiting transformation This includes a parser and lexer for boolean expressions as well as functionality to Tseitin-transform these and store the Tseitin-transformed boolean expression in DIMACS-format.
This transformation is usefully if we want to check the satisfiability of boolean formulas which are not in CNF
.
The project is hosted on github.
Let's say I have 5 sequences with the following lengths: 10, 6, 6, 4, 3
I would need to cut these sequences so that the total length of these sequences is 23 and all sequences are as equal as possible.
In this example, the end result could be: 6, 5, 5, 4, 3
Any ideas ?
In the question I wrote below (it doesn't let me attach images) I was wondering why the worst case runtime of insertion sort as Theta(n^2), it says "most precise" so I'm not sure if O(n^2) just wasn't an option and it's supposed to be a trick question? I just don't fully understand why it's right
Question: Match the best (most precise) complexity statements.
Worst-case runtime of insertion sort is: Theta(n^2)
Best-case runtime of insertion sort is: Theta(n)
Hi Everyone, If you are developing algorithms in TypeScript, there is a TypeScript Data Structures library you might find useful. It is zero-dependency, fast, lightweight, and fully tested. See: https://github.com/baloian/typescript-ds-lib
I have a robot that needs to travel to a given coordinate in the fastest possible time. The map has a grid and so I can simply use BFS to reach the given coordinate... But I was wondering if I could instead travel along the diagonals in some way and decrease the travel time? Travelling diagonally (the hypotenuse) would be much quicker than travel on the grid lines.
Is there an algorithm or a modified version of BFS that could solve such problem?
Iterative Fractal-Enhanced Error Correction (IF-ECC): Integrating the Mandelbrot Set with Error Correction Codes for Adaptive Data integrity
Iterative Fractal-Enhanced Error Correction (IF-ECC): Integrating the Mandelbrot Set with Error Correction Codes for Adaptive Data Integrity Iterative Fractal-Enhanced Error Correction (IF-ECC): Integrating the Mandelbrot Set with Error Correction Codes for Adaptive Data Integrity.
By : Dustin Sean Coffey
Abstract
This paper proposes a novel approach to data resilience by integrating the recursive nature of fractal mathematics, specifically the Mandelbrot set, with traditional error correction codes (ECC). This integration creates a unified error correction model capable of dynamically adjusting to complex and noisy environments, which we term the Iterative Fractal-Enhanced Error Correction (IF-ECC) model. By leveraging the self-similarity properties of fractals and adaptive parity distribution, IF-ECC introduces an innovative error correction method that holds potential for applications in digital communications, data storage, and emerging fields such as quantum computing. Initial theoretical analysis suggests this approach can enhance data integrity and adaptability in challenging environments, with further investigation recommended into algorithm optimization, parameter tuning, and practical deployment.
Introduction
In the modern world, maintaining data integrity across various media and channels is essential for reliable communication and storage. Error correction codes (ECC), such as Hamming codes, Reed-Solomon codes, and Low-Density Parity-Check (LDPC) codes, are widely used in digital systems to detect and correct errors that may arise during data transmission or storage. However, these methods often encounter limitations in highly dynamic or noisy environments, where static redundancy levels may fail to adapt to changing error rates.
This paper explores an unconventional approach by introducing fractal mathematics—specifically the Mandelbrot set—into ECC systems. The Mandelbrot set is known for its recursive, self-similar nature, which could provide enhanced redundancy and adaptability to the error correction process. We propose the Iterative Fractal-Enhanced Error Correction (IF-ECC) model, which iteratively combines fractal and ECC properties to dynamically adjust redundancy. The IF-ECC model introduces a new layer of adaptability, using fractal properties to enhance traditional error correction and potentially expand ECC capabilities in data communication, multimedia, and quantum computing.
2.1 Fractal Mathematics and the Mandelbrot Set
The Mandelbrot set is a complex fractal structure generated through iterative application of the function:
z_{n+1} = z_n2 + c
where and are complex numbers, and . A point belongs to the Mandelbrot set if the sequence remains bounded as (Mandelbrot, 1983). This recursive process produces a fractal pattern, exhibiting self-similarity and structural complexity that has been explored in areas like graphics and physics. Fractals, by their nature, offer multiple scales of redundancy, which we hypothesize can serve as a dynamic error-correcting mechanism.
2.2 Error Correction Codes
Error correction codes involve appending redundancy to transmitted data to detect and correct potential errors. Methods like Hamming and Reed-Solomon codes work by encoding data with additional parity bits, which enable receivers to correct errors within certain bounds (Hamming, 1950; Reed & Solomon, 1960). Despite their reliability, traditional ECCs are limited by static redundancy levels and may not be optimal in highly dynamic or unpredictable conditions. This research seeks to bridge this gap by combining ECC with fractal-inspired recursion for a more adaptable error correction solution.
The IF-ECC model applies the recursive, self-similar characteristics of the Mandelbrot set to generate adaptive parity information across iterations. This iterative, fractal-based approach adds an additional layer of error resilience by distributing redundancy in a dynamic manner. Here, we present the steps involved in the encoding and decoding processes.
3.1 Encoding Process
Mapping Data to Complex Plane: Given a data set , data bits are mapped to a complex constant :
c = \sum_{i=1}{m} d_i \cdot w_i
where are complex weights designed to uniquely map each bit sequence.
z_{n+1} = z_n2 + c + e_n
where is an error correction term derived from the parity bits of previous iterations.
e_n = \sum_{k=1}{r} p_k \cdot z_{n-k}
where represents parity bits derived from data and prior iterations, providing redundancy.
3.2 Decoding Process
The decoding process involves reconstructing the data from the received codeword , potentially corrupted by errors:
Reconstruction of Iterations: Fractal iterations are used to reconstruct the values.
Error Correction: Discrepancies are identified and corrected using parity bits based on fractal self-similarity.
Data Recovery: The original data is recovered by reversing the mapping process on the corrected complex constant .
Advantages of the IF-ECC Model
Enhanced Redundancy: Self-similar fractal properties add natural layers of redundancy.
Adaptive Error Correction: Dynamic adjustments to redundancy provide improved resilience in variable error conditions.
Scalability: Parameters like the number of iterations and redundancy factor can be adjusted to balance error resilience and resource demands.
The IF-ECC model theoretically provides a recursive, adaptable approach to error correction. Initial analysis suggests that this fractal-enhanced model could address errors in dynamic environments more effectively than traditional ECC alone. Potential applications include digital communications, data storage, and multimedia encoding, with significant promise for use in fields requiring adaptive error resilience, such as deep-space communication and quantum computing.
While promising, implementing IF-ECC presents several challenges:
Complexity: The recursive nature of the model may increase computational complexity.
Parameter Tuning: Parameters like and must be optimized for different data environments.
Hardware Constraints: Deploying fractal-based ECC in real-time systems may require specialized hardware accelerators.
Digital Communications: Enhancing resilience to noise in wireless and deep-space channels.
Data Storage: Increasing reliability of data storage on optical and magnetic media.
Quantum Computing: Potential use in stabilizing quantum states against decoherence.
Algorithm Optimization: Further research on algorithm efficiency and scalability in complex environments.
Conclusion
The Iterative Fractal-Enhanced Error Correction (IF-ECC) model presents a promising new approach to data integrity, combining the strengths of fractal mathematics and error correction codes. By leveraging the Mandelbrot set's recursive properties, IF-ECC could enable adaptive error correction that dynamically adjusts to changing error rates. While practical implementation requires further research and optimization, the theoretical foundation laid by this model has the potential to improve resilience in data communication, storage, and emerging fields such as quantum computing.
References
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147-160.
Mandelbrot, B. B. (1983). The fractal geometry of nature. New York: W. H. Freeman.
Reed, I. S., & Solomon, G. (1960). Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics, 8(2), 300-304.
I’m having trouble understanding the De Bruijn Sequence problem on CSES. Here’s the link to the problem: CSES De Bruijn Sequence Problem.
One approach I thought of was to model each n-length binary string as a node and then connect the nodes based on whether they can follow each other. This way, the problem would essentially become about finding a Hamiltonian path, but given the constraints and time limits, I’m not sure this would be feasible.
The other approach I came across was from the USACO guide, which seems to be about Eulerian paths and circuits. I’m not able to fully understand the proof and why this method works in this context. Can anyone explain why this approach is valid and why it must work?
I was reading about how certain AI systems, like AISHE, can be trained to make trading decisions autonomously. The idea is that these algorithms use historical market data to improve their decision-making. I’m curious how these types of algorithms are evolving and whether they could one day replace traditional trading methods. Any thoughts on how effective this could be in a real-world trading environment?
A, B, C, D, E, F, and G are decision problems. Assume that G is NP-complete and that polynomial Karp reductions between the problems are known as follows (a reduction from A to B is written here as A → B).
Which of the problems must be in and which of the problems must be NP-hard and respective NP-complete?
I hope you guys can correct and teach me.
This was my reasoning:
NP-hard, if G is NP-complete it means that all problems that G can be reduced to is NP-hard which means that the problems D, F, C and A are NP-hard which according to the solution is correct. But is my reasoning right?
Then it comes to checking if the problme is in NP, that is where I am having hard time.
The definition of NP problem is a problem that can be verified in polynomial time or alternatively can be solved in polynomial time by a NDTM. But here I don't really know much about the problems. I guess if we can reduce a problem X to G, then it is in NP but I cannot quite understand why? It is correct by the way, as in the answer is B, D, E.
The only NP-complete problem here is D since it is both NP-hard and in NP so that one is easy.
Hi all,
I’m working on a problem involving a directed graph where each node is assigned a number and optionally a char. Some nodes have fixed numbers or chars that cannot change (e.g., input/output nodes), while other nodes can have their numbers and chars adjusted—but under certain constraints.
None
) don’t add transitions.get_number(node)
.am trying to solve a variation of the coin change problem. Given a target amount k and a list of n coins with infinite supply (e.g., c[1] = 2, c[2] = 5, c[3] = 10), the goal is to determine the minimum number of coins needed to reach an amount that is either exactly k or the smallest amount greater than k.
This differs from the standard coin change problem, as the solution must account for cases where it's not possible to form the exact amount k, and in those cases, the smallest possible amount greater than V should be considered.
I need to understand the recurrence relation and how to handle cases where k cannot be exactly reached.
I tried adapting the standard coin change problem by adding conditions to handle amounts greater than k, but I am not sure how to properly modify the recurrence relation. I expected to find a clear way to define the state transition when the exact amount cannot be reached, but my approach doesn't seem to work in all cases.
I have a problem that seems to be a Knapsack problem, however I find it hard to apply the Knapsack algorithm because all the weights change depending on what is already in the Knapsack.
The problem is: I have a DB of movie directors and their movies. And I have a db of streaming providers. I want to select one or multiple movie directors and find the cheapest combination of streaming services that allows me to watch the most movies from those directors.
Brute-forcing through all the possible streaming services is infeasible. Applying Knapsack doesn't work because one movie can be streamed by multiple platforms. So the value that putting streaming platform A into the Knapsack depends on all the items already in the Knapsack.
Is there a way to transform this problem into a Knapsack problem or how can i approach this problem?
Let T(n) = 8T(n/2) + f(n) where f(n) ∈ Θ(n) and T(1) is a constant. Is it true that T(n) ∈ Ω(n^2)?
First of all, since f(n) belongs to theta(n), then shouldn't big Oh will also be O(n)? Then I can use master theorem like log_2(8) = 3, and 3>1 which means that the time complexity is n^log_2(8) = n^3 and since T(n) ∈ O(n^3) then it is by default O(n^2). Is this a correct reasoning because the answer explani it in a bit different way.
Hi everyone,
I’m diving into the Traveling Salesperson Problem (TSP) and am curious to learn about the most efficient mathematical formulations. I know efficient is a wide concept, maybe by that I mean in term of minimizing the number of variables, it fits perfect for some powerful algorithm or something similar. I saw on the internetl some formulations (Miller-Tucker-Zemlin and the Dantzig–Fulkerson–Johnson), but I wonder if there is known best formulation. I could not find anything.
Additionally, what are the best solvers currently known for tackling huge TSP instances (e.g., thousands of cities)? I’m particularly interested in both exact solvers and heuristics/metaheuristics. If you have experience with tools like OR-Tools, Gurobi, or specialized algorithms, I'd love to hear your recommendations. I also consider exploring heuristic solver (Simulated Annealing, Genetic Algorithm...)
Thanks in advance!
I mean the sort of place where people are interested in finding efficient algorithmic solutions to new problems.
Hi everyone,
I’m currently taking an algorithms course at university, and while the professor is great, I feel like I’m only scratching the surface of the subject. With the Christmas break coming up, I want to use this short time effectively to deepen my understanding.
My goal is to really grasp the key ideas and concepts. Since the break is short, I’m looking for compact, high-impact resources that balance theory and practical application. I don’t want to review what we‘ve learned but try to also understand it from other ressources.
Here’s my background: • I’m a computer science student and familiar with Java and C, although our course doesn’t involve coding. • We’ve covered topics like sorting algorithms, divide and conquer, Selection Algos, binary trees, napsack, SAT-problems, dynamic programming, Poly-reduction, complexity, P/NP, NTM
I’d love suggestions for: 1. Concise online courses or video tutorials that cover algorithms in a digestible way. 2. Books or PDFs that are structured for quick learning. 3. Interactive tools or platforms for practicing, coding or visualizing algorithms efficiently.
Thanks so much for your help! I want to make the most of this short break, and your recommendations would mean a lot.
Sebastian
To preface this I want to clarify I am not interested in collision-free routing (which the title might lead you to believe, due to the popular constraints based search algorithm for such problems).
We are given a graph with undirected edges and weights on them. We have a number of agents that have a start and an end goal. We have some objective function (let's say minimise the longest path an agent takes). We have a set of constraints such as maximum and minimum number of visits each vertex needs to have for all agent paths. We need to find a valid solution (collection of paths one for each agent) that together satisfy all constraints. We aim to find the minimum such solution.
Does a formulation of such a problem exist? If yes are there algorithms that can somewhat efficiently solve this (I am aware it's an NP-hard problem).
So far, I have yet to come across a technique other than genetic algorithm to solve symbolic regression. Is there any promising research on this problem?
Given thousands of data points, where each point has time, position, pointing vector (i.e. an array of sensors looking at a car jumping over a canyon), what's a good way to filter out only those points that contribute to the observation of this moving car/object? Having trouble with the concept of altitude with only having position/vector.
I'd like to put into practice something in python as a learning project for now if possible
TIA
In most the problems I am tasked to prove that a problem A is NP-complete. I show that A is in NP, then I reduce NP-hard problem B to A. Then I am required to prove that a yes instance in B is a yes instance in A. But also it says that I need to prove that a yes instance in A will be a yes instance in B. This is a bit confusing because isn't it basically the same thing from another angle?
I also got this understanding that all yes instances in A will not be yes instances in B. Given that the reduction is from B to A, all yes instance inputs of A won't even be defined for B unless I also reduce A to B. What am I supposed to do when asked to prove that yes in A -> yes in B?
I am not at all a specialist in sorting algorithms, so I am wondering if there is some gold standard solution for this very specific case, where the constraints are not the usual ones. I am going to present the problem context, its constraints, and an attempt at a solution. I would appreciate any feedback, both positive and negative.
The problem context:
1: There is a sporting competition, where the entrants are club teams from various countries.
2: The federations of the countries with club teams entered all have intra-national club rankings.
3: This initial sorting, based on the match results in the initial rounds, should result in an initial cross-national ranking which is then used for the subsequent rounds. We do not have to concern ourselves with those subsequent rounds, that is a matter for another day. Also, that is a far easier problem.
4: In each round n number of matches are played. The total number of entered teams is significantly higher than 2*n. Each match is played on exactly one pitch/court.
The constraints:
Given the constraint list above, the following is what I have come up with:
After all of this, let me make examples which hopefully will make the whole thing clearer.
Let us, for the sake of the example, assume that we have a floorball competition. Assume that we have ten teams each from Sweden, Finland, USA, Canada, and also lesser numbers of teams from other countries. Assume that we have ten floorball courts available. The choice of floorball of an example sport is intentional, for reasons which hopefully become appearent soon.
Assume that some of you are tasked with creating an overall ranking which fulfills all the listed constraints. You are – unless you come from a small number of countries, not including USA, Canada, and most of the rest of the world – well versed in sorting algoritm usage, selection and optimization, but completely ignorant of the specifics of floorball.
If you select Sweden and Finland to play in the beginning, and match them up so that court #1 will feature the match between SWE1 – FIN1, court #2 having SWE2 – FIN2, and so on, you will have created a set of matches that will overall be a fairly good set of matches, and that without knowing anything about floorball. Likewise if you create a set of USA – CAN matchups. Starting from those results, anyone with a reasonable knowledge of sorting algorithms would reach the desired initial sorting of those synthetic nations in short order, even without any knowledge of floorball.
However, the same idea would break down – massively – if you alloted all ten courts to matchups featuring teams from one side of the Atlantic versus the other. You would get ten blowouts, and waste a lot of time and court space on getting information that anyone knowledgeable with floorball could have told you beforehand.
A quicker way to arrive at transatlantic ranking would be to pit the SWE10 against USA1 (or, for that matter, CAN1) and watch the carnage on the court when stars&stripes gets absolutely shellacked against the also-rans of the big blond machine. Yes, there would be a blowout, but only one game, and then we would have an initial sorting of the synthetic nation of Greater Minnesota which looks like this: SWE1---SWE10-USA1---USA10.
(As an aside: There have been several matches featuring Sweden and USA national teams, in both genders and for both age categories. USA has never won a match. USA has never reached a tied result. USA has never lost a nailbiter. USA has never lost by merely clear and convincing numbers. Every single match has ended in an absolute slamdanger, with blue&yellow on top. USA would not have a realistic chance of winning a game featuring USA 20+ age category players against SWE juniors, provided that both countries play with teams of the same gender. Testosterone is one h-ll of a drug, so a game featuring USA men versus SWE women is not a foregone conclusion. However, your men would have their hands absolutely full against our women, and I would hold our team as the slight favorite. We are that dominant. End of aside.)
However, that facile matchup, even if it results in a quick sorting, is not acceptable. People would be livid about the competition leadership creating matchups in which a planeload of players end up not playing a single match in the beginning, without them having any say in the matter. So that is a non-starter with regard to stakeholder acceptance.
If one instead transfers the decision power regarding which teams play against each other to the respective team captains, then one bypasses that problem. It is more difficult to accuse someone else of corrupt choices, if you yourself are making said choices. Foist the decision on the team captain for USA 1 team, and no one else is responsible.
That, and the other constraints/criteria listed above, is why I came up with the system listed above. Has anyone seen this set (or something similar) of constraints/critera before? Do you see any faults that I have overlooked?
A related optimization problem: Assuming that the mergesort-adjacent idea outlined above is not fatally flawed, what is the best way to pair up countries? If one does (USA/CAN)-(SWE/FIN) one will have two mergings in the beginning that will require a bit of match resources, but in the end one will have two larger lists which will be quickly sorted into the final list. In either of the two other possible ways to pair those countries up, one will start out with very little resources used to get the two larger lists, but then there will be more work to get the final merging right.
Any idea on what is the right approach, and how one would find that right approach (or at least one that is not especially bad) in the more general case? Assume that the person doing that deciding has good knowledge of national team results – which are indicative of club team results – but no useful data on club team performances against teams outside of their country aside from the very top teams.
Given 3D bin packing problem (a special case where the box is cube and everything you wanna fill it with are cuboids and there should be no empty space once you are done). The input it takes is the side of the cube box and cuboids.
I decided to use subset sum to reduce it to the bin packing. I decided to make all the cuboids have the same sides except for the width which will be decided by the numbers that would have made the subset sum of the target. So the if I can get a subset sum of target t, then I could fill in the cube with cuboids. To sum up the cuboids side will be, subset element, target, target etc.. Does it sound logical or am I missing something?
Let's define a playlist as a collection of song IDs: [20493, 19840, 57438, 38572, 09281]. Order doesn't matter. All playlists are the same length.
Playlists are considered near-duplicates if, for some minimum distance D, they differ by less than D elements. For the example above, [20493, 19840, 57438, 47658, 18392] is a near-duplicate for minimum distance 3 since the playlists differ by two elements each (the last two songs). All playlists are the same length, so the near-duplicate relationship is reciprocal: if A is a near-duplicate of B, B is a near-duplicate of A.
The playlists are sorted by some metric (let's say popularity) and I want to remove all near-duplicates, leaving a list that is still sorted by popularity but that has playlists that are meaningfully different from each other.
Right now I'm just doing a quadratic algorithm where I compare each new playlist to all of the non-near-duplicate playlists that came before it. Sorting the song IDs ahead of time lets you do a fairly efficient linear comparison of any two playlists, but that part of the algorithm isn't the problem anyway since playlists tend to be short (<10 songs). The issue is that as number of playlists gets very high (tens of thousands), the quadratic part is killing the performance.
However, I don't see an easy way to avoid comparing to all previous non-near-duplicates. Order doesn't matter, so I don't think a trie would help. The song IDs have no relation to each other, so I don't see how I could use something like an r-tree to do a nearest neighbors search.
My other idea was to store a map from song ID -> collection of playlists that contain it. Then for each new playlist, I could maintain a temporary map from candidate near-duplicate playlists to number of overlapping songs. As I pass through the new playlists's song IDs, I'd lookup the playlists that contain it, adjust my temporary map, and then eventually be able to tell if the new playlist was a near-duplicate of the previous ones. Does that seem reasonable? Are there other approaches?
# of distinct songs: ~3k
# of playlists: up to 100k
# of songs per playlist: up to 15
I know Fibonacci trees is one. What else?