Java - Finding duplicate string values in over ten thousand element array

**Rakjavik** · 09-30-2011 04:52 PM

There's got to be a quicker way to do these comparisons.

for (int x = 0; x < subjects.size(); x++)
{
for (int y = 0; y < subjects.size(); y++)
{

System.out.println(contents.size()-x + " more to go");
try
{
if (subjects.get(x).equals(subjects.get(y)))
{
if (contents.get(x).equals(contents.get(y)))
{
if (x != y)
{
System.out.println("Match found - " + subjects.get(x) + " Contents - " + contents.get(x));
matches ++;
}
}
}
}
catch(NullPointerException ex)
{

}

}
}

System.out.println("Total matched - " + matches);

**Marvo** · 09-30-2011 05:19 PM

Use MYSQL?

Java is generally pretty quick when doing stuff like that, since it does some pretty smart stuff for optimization, but I think using an actual DB language would be much more efficient. I don't know how to make it work with java, but there's probably a way

**Rakjavik** · 09-30-2011 05:49 PM

Good idea. Now I just need to find out how to build a query to do that :/

**reci** · 09-30-2011 07:38 PM

If the list of strings is sorted, it should be relatively quick to find duplicates.

**Marvo** · 09-30-2011 09:43 PM

SQL Tutorial

I believe the SELECT command automatically sorts out duplicates.

**Arra** · 09-30-2011 10:30 PM

If I understand this correctly, you're going through a single list and searching for duplicates. But there are two actual arrays that you're going through, because subjects[3] corresponds to contents[3], etc., and you need to make sure both the subjects and the contents of the two indices you're comparing are the same.
You want it to find a match at index 3 and 5, only if subjects[3] = contents[3] and subjects[5] = contents[5].

If that's right, I'd say you could double the efficiency by changing this line:

for (int y = 0; y < subjects.size(); y++)
to this:
for (int y = x; y < subjects.size(); y++)

since in your version, when y is from 0 to x-1, you are going through elements you already went through.

And as someone else said, if the arrays are sorted in some way, you could do it a lot more efficiently, but it would be a bit more complicated to code.

**ninja9578** · 10-01-2011 12:51 AM

Wow, so far the advice is terrible.

Putting it in a database? It's still gonna be an n squared routine in there, plus you have the overhead of the database.

The double loops advice is just as bad because it still is an n squared routine.

Sort the array using a quicksort.
Iterate the array and check each item only against the item preceding it.
That is (n log n) + n, which is a hell of a lot better than n ^ 2

Sorting it, then searching it is much better than brute force searching it.

**Marvo** · 10-01-2011 01:28 AM

That's a way better solution, do that. Not sure why I didn't think of just using some sorting technique. Makes everything simpler

**Arra** · 10-01-2011 01:37 AM

I suppose that is a better method if you have a lot of elements and are going to be using the program a lot. I'm disappointed in myself for not thinking of it. But if not, the extra time and effort it would take to write the quick sort algorithm and then the searching algorithm might not be worth it.

**ninja9578** · 10-01-2011 02:01 AM

Actually, I'm sure java has a quicksort algorithm for you, other than in college there is no reason to write a quicksort algorithm.

**Replicon** · 10-02-2011 10:57 PM

wtf - just put it into a HashSet. 10K is not that big anyway.

**ninja9578** · 10-02-2011 11:42 PM

Replicon, I'm assuming this is homework, homework rarely gives you the choice of the containers

I would have suggested a binary tree myself for only 10K items.

**Replicon** · 10-03-2011 06:05 AM

Unless they specifically said "no containers" I'd say they're fair game.

If you don't have access to the basic containers, you probably don't get to quicksort either. Meh I guess OP will have to clarify the rules. If it doesn't specify, then I don't see why he'd want to do anything other than that obvious 3 line solution with a hash set.

Also, loving the "stick it into a database and use mysql" thing - Is there a "poe's law" for tech advice?

**ninja9578** · 10-04-2011 02:19 AM

Oh... whoa, where did I get the idea that he was using an array? O.o Yeah, a hash is the best. In fact, the best way to do it I think would be implement part of a radix sort. Keep bucketing the items recursively until you have 2 items in the bucket, then just compare those.

Since this sounds like homework, I'm not going you java code, but for the challenge, I decided to do it in C++. I thing this algorithm is as good as you can get. Replicon, see any room for improvement? (In the algorithm, not the code.) Very very few string comparisons actually have to happen due to the bucketing.

Code:

#include <iostream>
#include <string>
#include <vector>
#include <map>
using namespace std;
void find_dups(const vector<string> & bucket, size_t pos){
	vector<string>::const_iterator i = bucket.begin();
	vector<string>::const_iterator end = bucket.end();
	map< char, vector<string> > buckets;
	
	//fill in the buckets
	for(; i != end; ++i){
		if (i -> length() > pos){
			char key = (*i)[pos];
			if (buckets.find(key) == buckets.end()){
				buckets[key] = vector<string>();
				buckets[key].push_back(*i);
			} else if (buckets[key].at(0) == (*i)){
				cout << "Found duplicate: " << *i << " at search depth " << pos << endl;
			} else {
				buckets[key].push_back(*i);
			}
		}
	}
	
	//go through each bucket and see if we can ignore them, or have to recurse
	map< char, vector<string> >::iterator i2 = buckets.begin();
	while(i2 != buckets.end()){
		//if its two, I know they arent equal or the above check would have caught it
		if (i2 -> second.size() > 2){
			//recurse on this bucket
			find_dups(i2 -> second, pos + 1);
		}
		buckets.erase(i2);  //destroy the memory to make it RAM efficient too
		i2 = buckets.begin();
	}
}

int main(){
	vector<string> list;
	list.push_back("12 Bar Original");
	list.push_back("A Beginning");
	list.push_back("A Day In The Life");
	list.push_back("Act Naturally");
	list.push_back("Ain't She Sweet");
	list.push_back("All I've Got To Do");
	list.push_back("All My Loving");
	list.push_back("Across The Universe");
	list.push_back("All Things Must Pass");
	list.push_back("And I Love Her");
	list.push_back("And Your Bird Can Sing");
	list.push_back("Anna (Go To Him)");
	list.push_back("Another Girl");
	list.push_back("A Hard Day's Night");
	list.push_back("Across The Universe");
	list.push_back("All Together Now");
	list.push_back("All You Need Is Love");
	list.push_back("A Shot Of Rhythm and Blues");
	list.push_back("A Taste Of Honey");
	list.push_back("Across The Universe");
	list.push_back("All Together Now");
	list.push_back("A Hard Day's Night");
	list.push_back("All You Need Is Love");
	list.push_back("Any Time At All");
	list.push_back("Ask Me Why");
	find_dups(list, 0);
	return 0;
}

**Rakjavik** · 10-05-2011 03:59 PM

With how many people said this sounds like homework, it's making me embarrassed as hell. I'm actually a Junior Java Developer (very entry level). I have no education in Java and only did it as a hobby for a while and this new team gave me a shot at a junior position.

Thanks everyone for your advice. The more I learn the better!

**ninja9578** · 10-09-2011 01:27 PM

Oh, we're sorry. We didn't mean to embarrass you, you gotta admit though, it does sound like something a professor would assign.

If it's for production product, use my code, it's the fastest.

**Replicon** · 10-11-2011 07:14 AM

Originally Posted by ninja9578

If it's for production product, use my code, it's the fastest.

I beg to differ.

Not so much with the "fastest" comment, but rather that it would be a good idea to use that in production.

By the way, this is not meant to be any kind of peacocking or slap in the face, but I think there is a really great lesson here that our budding junior engineer would be good to learn early on in his career.

I had some free time this evening, so I decided to put your "fastest" claim to the test. I took your code, as-is, and compiled and ran it. Then, I changed it extremely slightly (not the algo, mind you). My changes consist of:

1) Instead of the bunch of list.push_backs, I changed it to read strings from stdin:

Code:

        string input;
        while(cin >> input) {
            list.push_back(input);
        }

2) I shortened your output line to be:

Code:

cout << "Found duplicate: " << *i << endl;

You will see why later.

Then, I put together a simple hash table based implementation, using the unordered_set library that comes pretty standard with any C++ distribution these days, and find_dups came out looking like this (I kept the vector around for consistency, but really, I'd want to use a list here):

Code:

void find_dups(const vector<string> & list){
    unordered_set<string> stringSet;
    for(vector<string>::const_iterator i = list.begin(); i != list.end(); ++i) {
        if(stringSet.count(*i) > 0) {
            cout << "Found duplicate: " << *i << endl;
        } else {
            stringSet.insert(*i);
        }
    }
}

Note that my output format is exactly the same as that I changed the one in ninja.cpp to.

Now, we need a decent source of input data. I thought, what better input data to use, than my all time favourite work of fiction: the king james bible! (oh snap!)

So let's run our benchmarks.

First, the "do nothing" which is just to time reading the biggish input file:

Code:

time cat kjv12.txt | ./donothing 
Reading input...
Done reading 821160 strings.

real	0m0.807s
user	0m0.748s
sys	0m0.044s

Ok great, basically takes about .8 seconds on my laptop.

Now the two solutions:

Code:

time cat kjv12.txt | ./ninja | wc -l
628031

real	0m4.144s
user	0m2.916s
sys	0m1.172s

Code:

time cat kjv12.txt | ./replicon | wc -l
787572

real	0m3.251s
user	0m1.612s
sys	0m1.616s

So looks like my short-and-sweet solution beat yours, time-wise. At first, I wasn't using unordered_set. I started with a regular map, which as you know, is an rbtree under the hood, and the two were performing just about the same.

Then, just to be fair, I tried compiling with -O4, to squeeze all the sweet optimization out of it. In that scenario, the two were a bit faster, and it was quite a bit closer. In fact, ninja.cpp beat replicon.cpp by almost 0.2 seconds consistently (0m2.943s to my 0m3.120s), so it would not come as a surprise that your algorithm is, in fact, the fastest way, or at least, better-approaching the asymptotic, theoretical fastest for this problem.

But there's one little peculiarity. Notice that the output is different. We should be finding pretty much the same duplicates, so why the difference? Let's investigate more closely, with just the first 100 lines of the holy buy-bull:

Code:

head kjv12.txt -n100 | ./ninja | sort | uniq | wc -l
72

head kjv12.txt -n100 | ./replicon | sort | uniq | wc -l
81

Odd, these REALLY should be the same! Let's investigate further:

Code:

head kjv12.txt -n100 | ./ninja | sort | uniq > ninja_out

head kjv12.txt -n100 | ./replicon | sort | uniq > replicon_out

diff ninja_out replicon_out 
2a3
> Found duplicate: a
4a6,7
> Found duplicate: And
> Found duplicate: be
13a17
> Found duplicate: divide
16a21
> Found duplicate: earth
27a33
> Found duplicate: he
41a48
> Found duplicate: light
52a60
> Found duplicate: seed
65a74
> Found duplicate: waters

Hmm, it looks like my code is finding duplicates that yours is not, right now.

Sure enough, when I open the first 100 lines of the bible, I see the word "waters" appear 8 times. "divide" appears twice, "seed" 4 times, etc.

Now to be fair, if ninja were coding this for a production system and not some random DV thread, he would have tested it better and found the bugs and fixed them, and it would have been wickedly fast. But the lesson is this:

USE THE LIBRARIES YOU HAVE AT YOUR DISPOSAL!!! That's why they're there! If you're writing fairly general purpose software, there is absolutely no reason to roll your own here. Unless you're writing a real-time missile guidance system, low level video game blitting code, or software that must fit into a tiny amount of memory (like, 1kb and such), your best bet really is to use the libraries that are there.

Your code will be readable (remember: others have to maintain it), and less likely to be error-prone, because the standard libraries are thoroughly tested by the community, and bugs are identified and fixed aggressively.

So yeah, in most production systems, you probably wouldn't want to hand-code something that gives you almost no noticeable performance gain, while taking a HUGE hit in terms of maintenance and support burden.

**ninja9578** · 10-11-2011 06:52 PM

I didn't think that the unordered_set would be as fast or memory efficient, I also haven't used java in ages and don't really know if all the STL containers had java equivalents :/ Looks like they were very close. Yeah, if there are bugs in the code its because I didn't test it, I just wanted to show the hashing algorithm.

What is -O4? I don't see any documentation on it, are you using something other than GNU?

**Replicon** · 10-11-2011 08:58 PM

The O is for "optimizer" and the 4 is the level (I think 4 is the highest). I am not sure exactly what the optimizer does, and how it varies per level. I imagine for large software projects, the compile time makes a huge difference, so they iterate/develop on unoptimized, and then the release build is optimized. Like I know at places like Discreet Logic, they do their dev work on some MS compiler, because of speed, but then the final release is built with the better-optimized Intel compiler (or at least, that's how it was 8 years ago when I interviewed to intern there).

I read a while ago in a Game Programming Gems book that even in the video game industry, the STL is quite well optimized for the vast majority of tasks, and it's only in the really high-performance, tight-loopy code that it's worth rolling your own.

**ninja9578** · 10-11-2011 09:24 PM

I know what -Ox is, but -O3 is the highest on GCC. And the other compilers I know use different optimization flags.

**Replicon** · 10-12-2011 04:04 AM

Haha sorry question was vague haha. Yeah when I google, it seems O3 is generally the highest, though older incarnations sometimes supported up to O9.

**ninja9578** · 10-12-2011 07:45 PM

O.o good god. -O3 can take hours, I don't want to think about an O9

**dsr** · 10-21-2011 02:52 AM

If worst-case performance is relevant here, one could do a full LSD radix sort. The running time of a counting sort on a list of n ASCII characters would have the asymptotic upper bound of O(n + 256), which for large values of n is essentially O(n). So if you know your input strings only contain ASCII characters, a radix sort implemented by performing counting sort on each character in a string, from right to left (keeping in mind that unlike with integers, strings of different lengths are aligned on the left, not the right) would have a running time of O(nk) where k is the maximum length of an input string. If you know the maximum length of your input strings and if that value isn't too large, this approach is probably the fastest sort for large values of n. I don't know the nature of the OP's input, but it shouldn't be hard to determine the k value if your corpus is a known text like the King James Bible.