 Originally Posted by ninja9578
If it's for production product, use my code, it's the fastest.
I beg to differ. Not so much with the "fastest" comment, but rather that it would be a good idea to use that in production.
By the way, this is not meant to be any kind of peacocking or slap in the face, but I think there is a really great lesson here that our budding junior engineer would be good to learn early on in his career.
I had some free time this evening, so I decided to put your "fastest" claim to the test. I took your code, as-is, and compiled and ran it. Then, I changed it extremely slightly (not the algo, mind you). My changes consist of:
1) Instead of the bunch of list.push_backs, I changed it to read strings from stdin:
Code:
string input;
while(cin >> input) {
list.push_back(input);
}
2) I shortened your output line to be:
Code:
cout << "Found duplicate: " << *i << endl;
You will see why later.
Then, I put together a simple hash table based implementation, using the unordered_set library that comes pretty standard with any C++ distribution these days, and find_dups came out looking like this (I kept the vector around for consistency, but really, I'd want to use a list here):
Code:
void find_dups(const vector<string> & list){
unordered_set<string> stringSet;
for(vector<string>::const_iterator i = list.begin(); i != list.end(); ++i) {
if(stringSet.count(*i) > 0) {
cout << "Found duplicate: " << *i << endl;
} else {
stringSet.insert(*i);
}
}
}
Note that my output format is exactly the same as that I changed the one in ninja.cpp to.
Now, we need a decent source of input data. I thought, what better input data to use, than my all time favourite work of fiction: the king james bible! (oh snap!)
So let's run our benchmarks.
First, the "do nothing" which is just to time reading the biggish input file:
Code:
time cat kjv12.txt | ./donothing
Reading input...
Done reading 821160 strings.
real 0m0.807s
user 0m0.748s
sys 0m0.044s
Ok great, basically takes about .8 seconds on my laptop.
Now the two solutions:
Code:
time cat kjv12.txt | ./ninja | wc -l
628031
real 0m4.144s
user 0m2.916s
sys 0m1.172s
Code:
time cat kjv12.txt | ./replicon | wc -l
787572
real 0m3.251s
user 0m1.612s
sys 0m1.616s
So looks like my short-and-sweet solution beat yours, time-wise. At first, I wasn't using unordered_set. I started with a regular map, which as you know, is an rbtree under the hood, and the two were performing just about the same.
Then, just to be fair, I tried compiling with -O4, to squeeze all the sweet optimization out of it. In that scenario, the two were a bit faster, and it was quite a bit closer. In fact, ninja.cpp beat replicon.cpp by almost 0.2 seconds consistently (0m2.943s to my 0m3.120s), so it would not come as a surprise that your algorithm is, in fact, the fastest way, or at least, better-approaching the asymptotic, theoretical fastest for this problem.
But there's one little peculiarity. Notice that the output is different. We should be finding pretty much the same duplicates, so why the difference? Let's investigate more closely, with just the first 100 lines of the holy buy-bull:
Code:
head kjv12.txt -n100 | ./ninja | sort | uniq | wc -l
72
head kjv12.txt -n100 | ./replicon | sort | uniq | wc -l
81
Odd, these REALLY should be the same! Let's investigate further:
Code:
head kjv12.txt -n100 | ./ninja | sort | uniq > ninja_out
head kjv12.txt -n100 | ./replicon | sort | uniq > replicon_out
diff ninja_out replicon_out
2a3
> Found duplicate: a
4a6,7
> Found duplicate: And
> Found duplicate: be
13a17
> Found duplicate: divide
16a21
> Found duplicate: earth
27a33
> Found duplicate: he
41a48
> Found duplicate: light
52a60
> Found duplicate: seed
65a74
> Found duplicate: waters
Hmm, it looks like my code is finding duplicates that yours is not, right now.
Sure enough, when I open the first 100 lines of the bible, I see the word "waters" appear 8 times. "divide" appears twice, "seed" 4 times, etc.
Now to be fair, if ninja were coding this for a production system and not some random DV thread, he would have tested it better and found the bugs and fixed them, and it would have been wickedly fast. But the lesson is this:
USE THE LIBRARIES YOU HAVE AT YOUR DISPOSAL!!! That's why they're there! If you're writing fairly general purpose software, there is absolutely no reason to roll your own here. Unless you're writing a real-time missile guidance system, low level video game blitting code, or software that must fit into a tiny amount of memory (like, 1kb and such), your best bet really is to use the libraries that are there.
Your code will be readable (remember: others have to maintain it), and less likely to be error-prone, because the standard libraries are thoroughly tested by the community, and bugs are identified and fixed aggressively.
So yeah, in most production systems, you probably wouldn't want to hand-code something that gives you almost no noticeable performance gain, while taking a HUGE hit in terms of maintenance and support burden.
|
|
Bookmarks