Python Hash Sets Explained & Demonstrated - Computerphile

100,415
0
Published 2024-02-01
Featuring Mike Pound. Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp --- More below ↓↓↓

Hash Sets in Python work a little bit like the index of a book, giving you a shortcut to looking for a value in a list. Dr Mike Pound explains how they work and demos with some code.

#Python #HashSet #Code #Computerphile

Jane Street’s Academy of Math and Programming is now accepting applications for their summer 2024 program, which will run from June 29th-August 2nd in NYC... Or you can just check out the puzzle for fun too - bit.ly/computerphile-amp (episode sponsor)

www.facebook.com/computerphile
twitter.com/computer_phile

This video was filmed and edited by Sean Riley.

Computer Science at the University of Nottingham: bit.ly/nottscomputer

Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com/

Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com/

All Comments (21)
  • @Davourflave
    "O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D
  • @ejc4684
    Always love a Mike Pound video!!!
  • I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things: 1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically. 2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)
  • @Richardincancale
    Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process
  • @gustafsundberg
    I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.
  • @prkhrsrvstv1
    Bloom filters could be a good follow-up to this.
  • @ibrahimmahrir
    12:02 in the __contains__ function there shouldn't be an else: before the return False at the end, otherwise in case if the list self._data[idx] is not None and the item is not in that list, the return value won't be a boolean.
  • @JaapVersteegh
    Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging
  • @bensmith9253
    I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁
  • @jonny__b
    Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.
  • @jfftck
    I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.
  • The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.
  • @Loki-
    I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol
  • @exchange4918
    Thanks for your videos Mike keep it rolling 🎉
  • @cinderwolf32
    The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!
  • Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.
  • @olivergurka2456
    Hi, could you do a video on characteristics of a good hash function used in hashtables and their evaluation as a followup video?