FAQ: Linear Search: Conceptual - Average Case Performance

wiki-bot · October 25, 2018, 11:34pm

This community-built FAQ covers the “Average Case Performance” exercise from the lesson “Linear Search: Conceptual”.

Paths and Courses
This exercise can be found in the following Codecademy content:

FAQs on the exercise Average Case Performance

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply () below!

Agree with a comment or answer? Like () to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

studley · November 20, 2018, 3:30pm

“Based on Big O simplification rules, which you can learn about in the Big O lesson, we simplify the time complexity in this case to O(N).”

Can you provide a link to the ‘Big O lesson’?

silver-strings · February 14, 2019, 3:38pm

I’m doing this lesson and noticed that the formula is not shown, please either post it up here (while it gets fixed) or fix it for the future peoples.

FLAWED INSTRUCTIONS:

The average case performance is the average number of comparisons. To calculate this, you use this formula:

Therefore, the average case performance for linear search is:

silver-strings · February 14, 2019, 3:42pm

Hi! I’m wondering if they expected you to complete that lesson since that, as well as this, lesson is in the computer science path and comes before this lesson.
@studley
This is a Pro course, so be prepared if you’re not signed up for pro.*
Here is the link https://www.codecademy.com/paths/computer-science/tracks/cspath-asymptotic-notation/modules/cspath-asymptotic-notation/articles/cspath-why-asymptotic-notation *

monochromaticmau · February 17, 2019, 3:47pm

Yeah, I get the alt image too. What is the formula? There is no report a bug option either within the lesson.

EDIT: Found the report Bug under get help tab.

teraninja70698 · February 21, 2019, 11:23pm

same here on the lack of an img file providing an actual formula, which is something i’m actually quite interested in learning, just google ‘average case performance of linear search algorithm’ here’s a link to StackExchange that helped me

mtf · February 22, 2019, 3:43am

This is technical, and tbh, pretty much over my head, but I get the gist and sort of understand the math; your average case, one might say. A broader view of the suggested concept so we see the big picture…

Average-case complexity - Wikipedia

lael.j.h · October 13, 2019, 3:58am

I was curious about this statement:

On average it would be in the middle of the list, that search would take O(N/2) time. Let’s prove this.

So I’ve made this Python script that tests this statement above.
I used numpy’s randint to generate an integer from 0 - 1000.
I have created a 1000 cell random number list by appending randomly generated integers (0-1000) to a list, 1000 times. Let’s call it test_list.
I have created a 10 cell random integer(0-1000) list to use as a search_item_list.
I have run each item from the search_item_list against the test_list and recorded how many steps it took every time.
I compared the number of steps each time to the length of the test_list to get a float between 0 and 1 representing how far the list was traversed to find the number 0 none and 1 meaning all the steppes. So half of the steps would be 0.5, I call it steps_depth.
I’ve wrapped it up in a function, to which I can pass an argument and specify how many times I want the new 1k test array to be created and all these following commands executed on it, it returns one list of all recorder steps_depths

Then I’ve created a function that creates a specified number of lists which I called groups, each filled with an even range of 100/number_of_lists. So if you create two groups then group0 = 0-51, group1 = 51 - 101, etc.

Then I’ve run results from my tests (all recorded step_depths) through another function that checks in which of the created group the step_depth should go and it counts the number of members of each group.

Based on my research usually about 40% of times the number is up to the first half and 60% of times is past the first half. Also surprisingly it’s about 35-40% in the last quarter alone.
And if I make 10 groups (0-11, 11-21, … 91-101) then it’s about 35% in group 10.
The numbers don’t seem to be evenly distributed nor really randomly generated.
I wonder what other “Random” integer methods are available in other packages and how random and evenly distributed would they come out.
Also note that I’ve done these steppes more than 200.000 times.
Each time creating new random array and drawing 10 new random values to search against the array (including scenarios when value was not found at all)

See the result of running the test total 200 times below
each line represent percentage distribution of results according to the part of the array they were found, 1k array, 10 numbers, run 10 times.
[q1, q2, q3, q4] - q is quarter of the array and the number itself represent it’s share in the whole array
So i’ve run this 20 times:

FIRST TEST:

[24.0, 19.0, 7.0, 50.0]

in this test 24% of times the searched number was found in the first quarter of the test_array, 19% of times in the second quarter, 7% of times in the third quarter and 50%(!!!) of times in the fourth quarter.

The other 19 tests:

[24.0, 13.0, 16.0, 47.0]
[25.0, 23.0, 8.0, 44.0]
[25.0, 18.0, 8.0, 49.0]
[13.0, 10.0, 16.0, 61.0]
[21.0, 14.0, 13.0, 52.0]
[23.0, 16.0, 13.0, 48.0]
[24.0, 16.0, 12.0, 48.0]
[18.0, 17.0, 21.0, 44.0]
[22.0, 24.0, 16.0, 38.0]
[23.0, 12.0, 13.0, 52.0]
[26.0, 18.0, 17.0, 39.0]
[21.0, 17.0, 16.0, 46.0]
[24.0, 15.0, 11.0, 50.0]
[28.0, 17.0, 12.0, 43.0]
[18.0, 21.0, 14.0, 47.0]
[23.0, 17.0, 8.0, 52.0]
[22.0, 19.0, 12.0, 47.0]
[20.0, 19.0, 11.0, 50.0]
[18.0, 17.0, 8.0, 57.0]
[24.0, 14.0, 10.0, 52.0]
[14.0, 21.0, 18.0, 47.0]
[26.0, 14.0, 21.0, 39.0]
[22.0, 18.0, 12.0, 48.0]
[20.0, 12.0, 9.0, 59.0]
[31.0, 14.0, 13.0, 42.0]
[22.0, 11.0, 13.0, 54.0]
[20.0, 20.0, 18.0, 42.0]
[20.0, 15.0, 9.0, 56.0]
[22.0, 22.0, 9.0, 47.0]

Average for each group: 22.1, 16.76, 12.8, 48.33

Notice how the last quarter is dominating all the time…

patrickd314 · October 13, 2019, 9:13pm

Hi, @ljh1830181547 – Very interesting! Upon trying to re-create what you have done, I find it to skew heavily towards the “front” (zero index end) of the list.

We create multiple instances of a long list (np_test_list) of random integers of up to three digits, each associated with a shorter list (np_search_item_lst) of random integers. In each “run,” we take, in turn, each element from np_search_item_lst and see how deeply into np_test_list we must go to find the first instance of the element, if present. That depth is given by the index where the element is found.

The script returns a list of indices, which can be rendered as a histogram.

import numpy as np
import matplotlib.pyplot as plt

def find_it(search_item_lst, x):
    ''' return the index of the first occurrence of x
       in search_item_lst via linear search
   '''
    for idx in range(len(search_item_lst)):
        if search_item_lst[idx] == x:
            return idx
    return -1
        
out_search_item_lst = []        
for i in range(100):
    np_test_list = np.random.randint(1000, size = 100)
    np_search_item_lst = np.random.randint(1000, size = 100000)
    for item in np_test_list:
        out_search_item_lst.append(find_it(np_search_item_lst, item))

plot_arr = np.array(out_search_item_lst)
        
my_plt = plt.hist(plot_arr, bins='auto')
  
plt.show()

Here is a typical plot: Figure_1

It clearly skews way towards zero. But this is not surprising, for in each instance, there might well be many instances of the target in the list to be searched, and if they are sprinkled randomly, we would certainly expect to run into an instance somewhere near the front of the list.

I think that the statement you quote, along with its antecedent sentence,

… is simply meant to say that, given many lists, and all other things being equal, we can expect the target item to be, “on average,” near the center of the list; in other words, given that most search items would be found near the center, average complexity would be O(N/2). There is no claim as to how a certain distribution is attained.

lael.j.h · October 13, 2019, 11:21pm

Thanks for a cool perspective. Your code is like 10x more compact than mine and it already includes a graph. Great. I will need to do the data visualization course (after the python search algorithm) it gives a different level of insight… than a console print out
I’ve noticed you created a much larger test list, mine was only 1000 elements long, while yours is 100000, and also your sample was 100 compared to mine - 10. I used 1000 possible elements, you used almost 1000. I think the number of measurements was sufficient in both cases to give a good overview of what kind of results can be expected.
It’s still very curious how in your experiment the numbers were also distributed dominantly towards one end of the given range and it was a regular behavior. I can’t tell from your post how the results looked for each specific run for the i in range(100): but I was monitoring the output on my experiment and there hasn’t been even a single dramatic variation from this tendency. At best the numbers were somewhat evenly distributed, which happened very rarely, but no exception from the fourth quarter always being the largest group.
Do you know any other random numbers generators that could be tested in the same environment and perhaps perform better, in a less predictable way?

EDIT: I think if I plotted my results they would look pretty similar to yours, just flipped. I wonder why the difference.

EDIT2: I’ve run your code an I had the same results as you, the flip maybe a result of how I read and saved the results across different function. Maybe something caused it to flip the order? Anyway, my code isn’t that greatly optimized. I then run your code with my test values and the result was still the same, with distribution mainly toward 0.

patrickd314 · October 13, 2019, 11:42pm

All of my runs had more or less the same shape, just a lot smoother for higher N.

I think that the pseudo-random number generators in Python and Numpy are very robust, both being based on an algorithm called the Mersenne Twister, about which I know not a thing except that it’s commonly used in a great many programming languages. It is almost certainly fine for simulations such as this, although we’re warned not to use it for security or cryptographic systems!

I don’t think that any part of your result or mine is related to a defect in randomization.

lael.j.h · October 13, 2019, 11:48pm

Alright. It makes sense that they disclose it’s not the best for security systems. I will have a look and try to find something truly random or at least as random as possible.
I also wonder, straying from the subject a little bit, about projects such as global consciousness project (and perhaps other, even more scientific projects, cause this one is hippie) that rely on truly random sequences to build their models.
Funny enough the question of a way to achieve randomness seems almost a philosophical one.
EDIT: not as funny as i initally considered

EDIT2: So I’ve reworked my code, to be more readable, so I can post it here.
I made a random fucntion which is supposed to hold other random methods which I am going to test. So far I’ve got numpy.random.randomit and python.random.uniform and they both give almost identical results! Interestingly, they’re all in the other end now!

Here’s the code I used.

import random
import numpy as np
import matplotlib.pyplot as plt 

#random number function wrapper, allows to include random methods from different packages
def random_number_0_to(top_range, method=None):
    rndnum = 0
    if method == None: #use python.random as default
        rndnum = random.uniform(0,top_range)
    if method == 'np': #use numpy.random.randint if method set as np
        rndnum = np.random.randint(top_range+1)
    #ad other methods here with appropriate if statements checks for method value
    return rndnum

#creates a list of random int numbers with specified size, 
#using selected random method from random_number_0_to() function
def create_list(size,method=None):
    test_list = []
    for x in range(size):
        rn = random_number_0_to(1000,method)
        test_list.append(int(rn))
    return test_list

#takes an element and checks for it's first appearance in the list
#returns the number of comparisons it took to find it
def get_search_steps(element, test_list):
    stepper = 0
    for test in test_list:
        stepper += 1
        if test == element:
            break
    return stepper

#sets up and runs the experiment specified number of times, 
#alows to detail the size of the search list, the search sample and the random method
#returns a one list including with chronologically appended results from each run of the experiments
def run_test(run_times=1, t_list_size=1000, s_list_size=10, method=None):
    steps_to_find = []
    for x in range(run_times):
        my_test_list = create_list(1000,method)
        my_search_list = create_list(10,method)
        for search_num in my_search_list:
            steps = get_search_steps(search_num,my_test_list)
            steps_to_find.append(steps)
    return steps_to_find


#running the experiiment
steps_counted_python = run_test(1000)
steps_counted_numpy = run_test(1000,method='np')


plot_arr1 = np.array(steps_counted_numpy)
my_plt1 = plt.hist(plot_arr1, bins='auto')
plt.title('numpy random')
plt.show()

EDIT: I found something called os.urandom, but I haven’t figured out yet how to convert a string into an integer
EDIT: I’m running out of edits here, but heck i found this, supposedly is cryptographicly safe
[https://docs.python.org/3/library/random.html#random.SystemRandom](Python3 Docs - random.SystemRandom) but I can’t figure out how to extract any values from this method.

mtf · October 14, 2019, 1:01am

Randomness is more achievable if can use a product of two random numbers as the seed, then give that seed to the random number generator as first in the random sequence.

The problem with computer generated random numbers is that they always start from the same number sequence and just go from there. It’s a sequence already, so we just have to find a random spot in that sequence to begin from and our program will continue from there. This makes no sense, I know. I’m paraphrasing memories from forty to fifty years ago.

Pretty sure the modern Random is not so constrained, but you never know, and it is definitely a good reason to not rely upon it for security. It is predictable given a determined bot can find the start of the sequence and go from there. Also pretty sure this is messed up, but what the hey?

lael.j.h · October 14, 2019, 1:11am

So the seed is the step from which we start on the sequence to generate the random number?
Do you mean to use one random number generator to create a seed for use in the next iteration? Also considering the fact the number generation is already predictable (at least statistically) means that using the same random generator to generate a seed for itself wont add that much variation. Using different implementation of the method (numpy or python’s built in) doesn’t seem to be any more beneficial (at least in case of these two) because they generate almost the same spectrum intensity.
I’ll see what I learn more after figuring out how to use SystemRandom

mtf · October 14, 2019, 1:27am

So far I haven’t done any of my own investigation into how Python (aka C) generates random numbers. Computers of old had a finite sequence (in the range, 2E-39..2E+38) that always started from the same place on power up or reset.

In my old BASIC programs used a seed generated from the system time times the number returned from Random. Recall that language supported a seed for Random. As I recall, it was RND, a designation I still use to this day.

from random import random as rnd

Again, if it were that Python supported seeding, its own timestamp would suffice…

from datetime import datetime
from time import sleep as pause
stamp = lambda: datetime.timestamp(datetime.now())
print (stamp())
pause(1)
print (stamp())
pause(1)
print (stamp())
pause(1)
print (stamp())

patrickd314 · October 14, 2019, 2:06pm

The digits returned by the random module are absolutely determined by the seed. If you use the same seed each time, you will get the same digits:

import random

for i in range(3):
    random.seed(666)
    out_lst = []
    for j in range(10):
        out_lst.append(random.randint(1, 10))
    print(out_lst)
'''Output:
[8, 7, 7, 5, 9, 1, 9, 9, 6, 2]
[8, 7, 7, 5, 9, 1, 9, 9, 6, 2]
[8, 7, 7, 5, 9, 1, 9, 9, 6, 2]
'''

This can actually be quite useful for testing, and, of course, can be incorporated into a simple one-time pad encryption system.

If the random.seed() method is omitted, the seed by default comes from here:

import time
print(time.time())
'''Output:
1571061095.976216
'''

That number is the number of seconds since the “Epoch” (1 January, 1970, 00:00 UTC). I understand that newer systems provide a seed based on random internal hardware noise, but I don’t know any details.

Again, for statistical modelling, where they find their heaviest use, I don’t think that the Mersenne twister algorithms present any significant deviation from “true” randomness.

If you want, you can download random digits derived from atmospheric noise, or you can build your own nuclear random digit generator based on atomic decay. And (speaking of hippies, as we were a few posts ago), there’s even one based on lava lamps!

mtf · October 14, 2019, 6:31pm

That is good to know. Had I taken the time to read up some more I would have learned this. It is interesting that the default seed actually is the timestamp. Had I realized that the timestamp can be accessed on the time module one could have omitted importing datetime.

Note to self:

It is wiser to find out than to suppose.
Mark Twain

lael.j.h · October 14, 2019, 10:40pm

Thanks, guys! You’re great. It’s such a satisfying feeling to feed own hunger for knowledge.
I understand now that nothing that originates from the computer alone is truly random. The only way to achieve practical randomness seems to be with use of measurements of the practically random events occurring in the physical world.

If you want, you can download random digits derived from atmospheric noise , or you can build your own nuclear random digit generator based on atomic decay.

Great read I recommend!

And (speaking of hippies, as we were a few posts ago), there’s even one based on lava lamps !

And speaking like a hippie - there seems to be no true randomness in the world at the large scale, at best it’s too low computing power to run realistic and faithful models…
Not only the constructs of the physical world aren’t in any way random, they also seem to change in accordance with very specific rules. Where would the “chaotic” element come from? There’s nowhere left

This quote from the “lava lamps” link illustrates the essence of the situation at hand:

John Graham-Cumming, programmer-turned-CTO of Cloudflare, found it in lava lamps.
“It turns out that computers are really bad at generating random numbers. The only thing worse than generating random numbers on computers is humans generating random numbers,” he explains. “You end up having to find interesting ways of getting randomness.”

mtf · October 14, 2019, 10:51pm

When we look to the universe we can see chaos playing out (Law of Thermodynamics) but when we zoom out so our view spans billions of lightyears, there still appears to be some order. What’s random is still very probabilistic. We can find gaseous clouds where it is easy to predict new stars will form in that region, only we can’t say when they will form. It’s easy to point to very hot stars and predict that one day they will go super nova. We can with certainty say that Beelzebub will one day explode, just not when, assuming it hasn’t already. If it happened yesterday, we won’t know for ~650 years.

lael.j.h · October 14, 2019, 11:07pm

It is really wonderous, but aren’t even the laws of thermodynamics possible to be simulated? Isn’t weather becoming more predictable with super-computing? To be honest I don’t remember the last time a forecast was wrong…
Similarly, in the case of the gaseous clouds and star formation, it’s all a matter of calibrating our sciences further and then translating them to computer models.

It went off already.