Object Indexing vs. Array Collection

By  on  

The Setup & Goal

Let's say that we have one large text document and we have a a bunch of keywords that we want to parse the document for.  We don't care how many times times the keyword appears -- we just care that it's been used.  When we find a keyword, we need to record that keyword been found so that we may check at a later time.

Inefficient Method:  Array Collection and Search

The first method to record that a keyword has been found is by pushing the keyword into one array:

//Assume an array called "foundKeywords" was defined above
if(shouldSave(keyword)) {
	foundKeywords.push(keyword);
}

At the end of the document search, we'd end up with an array like:

//the foundKeywords array looks like:
//['keyword1','keyword2','keyword2',...]

When it comes to checking this array for the existence of a given keyword, this method will prove inefficient.  Why?  Because we'd need to loop through the array and search until we found the given keyword (if ever).  Those are a lot of "wasted" or fruitless cycles, even if we break the loop when the keyword was found.  Inefficient is the only word to describe this process.

The Efficient Method:  Object with Index

The fastest method of checking a storing keywords for later reference is by object (in JavaScript) or associative array (in PHP).  Instead of adding the keyword to an array, we add the keyword as an index to a master object, giving the value as 1:

//Assume an object {} called "foundKeywords" was defined above
if(shouldSave(keyword)) {
	foundKeywords[keyword] = 1;
}

Why is this faster?  No wasted cycles looking blindly through an array.  The check is quick and simple:

if(foundKeywords[keyword]) { //FOUND!
	//do something
}

It's either an index or it's not!  In PHP, we'd save the keyword to an associative array:

//Assume an array called "$found_keywords" was defined above
if(shouldSave($keyword)) {
	$found_keywords[$keyword] = 1;
}

//Later: checking if the keyword was there...
if($found_keywords[$keyword]) { //or array_key_exists($keyword,$found_keywords)
	//FOUND!
}

In a word...awesome.  Not only fast but simple!

I cannot provide a benchmark because the speed of execution depends on how large the keyword array is.  Suffice to say, for the sake of simplicity and speed, using an object with keyword index is definitely the way to go!

Recent Features

  • By
    Responsive and Infinitely Scalable JS Animations

    Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() - this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

  • By
    JavaScript Promise API

    While synchronous code is easier to follow and debug, async is generally better for performance and flexibility. Why "hold up the show" when you can trigger numerous requests at once and then handle them when each is ready?  Promises are becoming a big part of the JavaScript world...

Incredible Demos

Discussion

  1. seelts

    PHP has a “in_array()” function…

    • Yes, but I imagine that internal method does a similar operation.

    • Indeed, array lookups (like what in_array does) are linear (O(n)) whereas hashtable lookups (like what David does) are close to constant (O(1)). That means that as the array size increases, the in_array search will get slower on average, whereas the hashtable lookup will stay consistant.

    • EllisGL

      in_array() = slow
      isset($arr[$key]) = fast

    • seelts

      i’ve made a test:

      function rndstr(){
      	$str='';
      	for($i=0;$i<10;$i++){
      		$str.=chr(rand(32,126));
      	}
      	return($str);
      }
      
      // ---------------------------indexes
      $t=microtime(true);
      $b=array();
      for($j=0;$j<100000;$j++){
      	$b[]=rndstr();
      }
      in_array('0123456789',$b);
      $dt=microtime(true)-$t;
      echo '"indexes" time: '.$dt;
      
      echo '';
      
      // ---------------------------keys
      $t=microtime(true);
      $a=array();
      for($i=0;$i<100000;$i++){
      	$a[rndstr()]=1;
      }
      isset($a['0123456789']);
      $dt=microtime(true)-$t;
      echo '"keys" time: '.$dt;
      
      echo '';
      
      // ---------------------------manual
      $t=microtime(true);
      $c=array();
      for($k=0;$k
      

      the results are:
      “indexes” time: 1.93250012398
      “keys” time: 1.93542504311
      “manual” time: 1.94083094597

      they are almost identical, though “indexes” are little bit faster

      2David,
      concerning JS, your method has one more lack – as you operating with Objects, but not Arrays, you can’t use “.length” to know the amount of unique found words.
      so it is better not just “foundKeywords[keyword] = 1;”, but

      if(!foundKeywords[keyword]){
      	foundKeywords[keyword]=1;
      }
      foundKeywords[keyword]++;
      

      but it looks not very pretty (imho).

  2. seelts

    i’ve little bit changed the test:

    <?php
    function rndstr(){
    	$str='';
    	for($i=0;$i<10;$i++){
    		$str.=chr(rand(32,126));
    	}
    	return($str);
    }
    
    // ---------------------------indexes
    $t=microtime(true);
    $b=array();
    for($j=0;$j<100000;$j++){
    	$b[]=rndstr();
    }
    $dt=microtime(true)-$t;
    echo '"indexes". populating time: '.$dt;
    
    echo '';
    
    // ---------------------------keys
    $t=microtime(true);
    $a=array();
    for($i=0;$i
    

    i’ve tried it about 30-40 times. the most noticable difference is:
    “indexes”. populating time: 1.89304304123
    “keys”. populating time: 1.9226899147

    it means that the main difference – is the speed of arrays populating. and simple arrays are faster then hashes, but anyway – almost identical.

  3. seelts

    something has happend with code:

    function rndstr(){
    	$str='';
    	for($i=0;$i<10;$i++){
    		$str.=chr(rand(32,126));
    	}
    	return($str);
    }
    
    // ---------------------------indexes
    $t=microtime(true);
    $b=array();
    for($j=0;$j<100000;$j++){
    	$b[]=rndstr();
    }
    $dt=microtime(true)-$t;
    echo '"indexes". populating time: '.$dt;
    
    echo '';
    
    // ---------------------------keys
    $t=microtime(true);
    $a=array();
    for($i=0;$i
    
    • seelts

      David, it seems that parser is not working. Can you please edit/merge.
      … $i < 100000;$i++){
      $a[rndstr()]=1;
      }
      $dt=microtime(true)-$t;
      echo ‘”keys”. populating time: ‘.$dt;
      ?>

  4. tkazec

    I believe using “keyword in foundKeywords” is faster than “foundKeywords[keyword]” in JS.

    • That’s a good thought. Have metrics?

    • tkazec

      I just tested it. If the key exists, “obj[key]” is faster, if not, “key in obj” is. However, the difference is negligible. The object size does not seem to affect the speed of either method.

  5. @David, very interesting tip – after reading other comments, it would be nice to see what your test results are.

  6. chris

    My opinion:
    @seelts: I don’t know what you want to proof with your code. In both cases you just insert to an array, but you never try to access the data! Therefore your runtime will always be O(n) [big O notation] where n is the number of your words you insert. Every single insert is O(1) and your marginal differences in the populating time follow from the runtime of your rand().

    @david: theoretical you should be right. If you use indices you will need O(1) to get the data or to know that there is none. If you have to cycle through the array you will need O(n) in the worst case (last or not in the array) or O(1) in the best case (first element) [if the array is not sorted!]. The real tuntimes depend on the implementation of array/object and a lot on the amount of array elements.

  7. Good tip. Although, instead of setting the value to 1, I’d personally set it to “true”. :)

  8. What about array.indexOf()?
    It returns -1 when none is found, so you can do:

    ['a','c','dfg'].indexOf('a') //0
    
    ['a','c','dfg'].indexOf('abc') //-1
    

    It’s bound to be faster than manual index search. Not sure if this is faster (probably not).

    Also, for php arrays, unless assuming null values, checking for a key can be done with isset, which is kind of like JS:

    if (isset($arr[$key])){}
    

    but if value is null it returns false. Isset is extremely fast.

    • I think indexOf still does a linear search. A hashtable lookup would be quicker. So something like this: {'a': true, 'c': true, 'dfg: true} and then check if data[whatever] is true. :)

    • I’m pretty sure you’re right (that hash lookup is faster). But the use case is stripped down. In real life, you would find yourself requiring other operations. And for that a true representation of your stack could be much more useful.
      For example – Object iteration is much slower than an arrays. And it will lack many array tools, as simple as lacking the length property.
      And although indexOf might be slower, I’m not so sure it will be slower enough to think about dumping it. JS is quite fast nowadays. There is no reason why not to use the native tools it provides instead of hacking it up.

  9. Pretty cool. Going to try these out soon.

  10. As pointed out,
    if($found_keywords[$keyword]) { //or array_key_exists($keyword,
    requires isset() to test.

    Good ideas about checking up the items in JS!

  11. And what about associative Array in JavaScript ?

  12. I’m using this very technique to handle a string match for looking up common words in a dictionary. Basically I want to know does a string exists in a list of terms (in JavaScript). The dictionary is well over 10,000 terms. Even a sorted array with a binary search was slow to look up and since this check would be done on user’s interaction with a keyup event it needed to be fast. So I defined my dictionary as an object literal: {"foo":true,"bar":true}. This was fast.

    However it did raise the question, Just how large could that dictionary object get? And since booleans are represented as an integer (right?) that’s 32 bits of memory per term when all I need is one bit. Is that a waste of memory and should I be concerned with the execution environment. Slim and lean is the key here. Does using null or void 0 offer better memory usage ({"foo":null,"bar":null})? Which is best: dict["foo"] === null vs dict["foo"] === undefined vs dict["foo"] === true

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!