Object Indexing vs. Array Collection
The Setup & Goal
Let's say that we have one large text document and we have a a bunch of keywords that we want to parse the document for. We don't care how many times times the keyword appears -- we just care that it's been used. When we find a keyword, we need to record that keyword been found so that we may check at a later time.
Inefficient Method: Array Collection and Search
The first method to record that a keyword has been found is by pushing the keyword into one array:
//Assume an array called "foundKeywords" was defined above if(shouldSave(keyword)) { foundKeywords.push(keyword); }
At the end of the document search, we'd end up with an array like:
//the foundKeywords array looks like: //['keyword1','keyword2','keyword2',...]
When it comes to checking this array for the existence of a given keyword, this method will prove inefficient. Why? Because we'd need to loop through the array and search until we found the given keyword (if ever). Those are a lot of "wasted" or fruitless cycles, even if we break the loop when the keyword was found. Inefficient is the only word to describe this process.
The Efficient Method: Object with Index
The fastest method of checking a storing keywords for later reference is by object (in JavaScript) or associative array (in PHP). Instead of adding the keyword to an array, we add the keyword as an index to a master object, giving the value as 1:
//Assume an object {} called "foundKeywords" was defined above if(shouldSave(keyword)) { foundKeywords[keyword] = 1; }
Why is this faster? No wasted cycles looking blindly through an array. The check is quick and simple:
if(foundKeywords[keyword]) { //FOUND! //do something }
It's either an index or it's not! In PHP, we'd save the keyword to an associative array:
//Assume an array called "$found_keywords" was defined above if(shouldSave($keyword)) { $found_keywords[$keyword] = 1; } //Later: checking if the keyword was there... if($found_keywords[$keyword]) { //or array_key_exists($keyword,$found_keywords) //FOUND! }
In a word...awesome. Not only fast but simple!
I cannot provide a benchmark because the speed of execution depends on how large the keyword array is. Suffice to say, for the sake of simplicity and speed, using an object with keyword index is definitely the way to go!
PHP has a “in_array()” function…
Yes, but I imagine that internal method does a similar operation.
Indeed, array lookups (like what in_array does) are linear (O(n)) whereas hashtable lookups (like what David does) are close to constant (O(1)). That means that as the array size increases, the in_array search will get slower on average, whereas the hashtable lookup will stay consistant.
in_array() = slow
isset($arr[$key]) = fast
i’ve made a test:
the results are:
“indexes” time: 1.93250012398
“keys” time: 1.93542504311
“manual” time: 1.94083094597
they are almost identical, though “indexes” are little bit faster
2David,
concerning JS, your method has one more lack – as you operating with Objects, but not Arrays, you can’t use “.length” to know the amount of unique found words.
so it is better not just “foundKeywords[keyword] = 1;”, but
but it looks not very pretty (imho).
i’ve little bit changed the test:
i’ve tried it about 30-40 times. the most noticable difference is:
“indexes”. populating time: 1.89304304123
“keys”. populating time: 1.9226899147
it means that the main difference – is the speed of arrays populating. and simple arrays are faster then hashes, but anyway – almost identical.
something has happend with code:
David, it seems that parser is not working. Can you please edit/merge.
… $i < 100000;$i++){
$a[rndstr()]=1;
}
$dt=microtime(true)-$t;
echo ‘”keys”. populating time: ‘.$dt;
?>
I believe using “keyword in foundKeywords” is faster than “foundKeywords[keyword]” in JS.
That’s a good thought. Have metrics?
I just tested it. If the key exists, “obj[key]” is faster, if not, “key in obj” is. However, the difference is negligible. The object size does not seem to affect the speed of either method.
@David, very interesting tip – after reading other comments, it would be nice to see what your test results are.
My opinion:
@seelts: I don’t know what you want to proof with your code. In both cases you just insert to an array, but you never try to access the data! Therefore your runtime will always be O(n) [big O notation] where n is the number of your words you insert. Every single insert is O(1) and your marginal differences in the populating time follow from the runtime of your rand().
@david: theoretical you should be right. If you use indices you will need O(1) to get the data or to know that there is none. If you have to cycle through the array you will need O(n) in the worst case (last or not in the array) or O(1) in the best case (first element) [if the array is not sorted!]. The real tuntimes depend on the implementation of array/object and a lot on the amount of array elements.
Good tip. Although, instead of setting the value to 1, I’d personally set it to “true”. :)
What about
array.indexOf()
?It returns
-1
when none is found, so you can do:It’s bound to be faster than manual index search. Not sure if this is faster (probably not).
Also, for php arrays, unless assuming null values, checking for a key can be done with isset, which is kind of like JS:
but if value is
null
it returnsfalse
.Isset
is extremely fast.I think
indexOf
still does a linear search. A hashtable lookup would be quicker. So something like this:{'a': true, 'c': true, 'dfg: true}
and then check ifdata[whatever]
istrue
. :)I’m pretty sure you’re right (that hash lookup is faster). But the use case is stripped down. In real life, you would find yourself requiring other operations. And for that a true representation of your stack could be much more useful.
For example – Object iteration is much slower than an arrays. And it will lack many array tools, as simple as lacking the
length
property.And although indexOf might be slower, I’m not so sure it will be slower enough to think about dumping it. JS is quite fast nowadays. There is no reason why not to use the native tools it provides instead of hacking it up.
Pretty cool. Going to try these out soon.
As pointed out,
if($found_keywords[$keyword]) { //or array_key_exists($keyword,
requires isset() to test.
Good ideas about checking up the items in JS!
And what about associative Array in JavaScript ?
I’m using this very technique to handle a string match for looking up common words in a dictionary. Basically I want to know does a string exists in a list of terms (in JavaScript). The dictionary is well over 10,000 terms. Even a sorted array with a binary search was slow to look up and since this check would be done on user’s interaction with a keyup event it needed to be fast. So I defined my dictionary as an object literal:
{"foo":true,"bar":true}
. This was fast.However it did raise the question, Just how large could that dictionary object get? And since booleans are represented as an integer (right?) that’s 32 bits of memory per term when all I need is one bit. Is that a waste of memory and should I be concerned with the execution environment. Slim and lean is the key here. Does using null or void 0 offer better memory usage
({"foo":null,"bar":null})
? Which is best:dict["foo"] === null vs dict["foo"] === undefined vs dict["foo"] === true