Index all drives connected to macOS with multithreaded Python 3

Index all drives connected to macOS with multithreaded Python 3

In a recent project, we were tasked with indexing the file paths and filenames of every file in every drive connected to macOS. And we needed to do it quickly. In this example, we'll show you how to rapidly index the file system and create a dictionary with the results. The keys in the dictionary will be the file paths as they will always be unique. The values will be the file base names themselves.

So let's get started. And as always, we're using Python 3.

First, let's import the following:


import os
from threading import  Thread
from datetime import datetime

Next, let's create a dictionary that will hold all of our results.


dict1 = {}

Then we'll create the get_locations function. Its purpose is to create a list of drives and the top-level folders in each drive.  We'll use this list to create a thread for each folder.  But we may encounter actual files at the roots of these drives. Now, of course, we don't want to create a thread for a single file.  We'll just add the files to our dictionary. 


def get_locations():
    locs = os.scandir('/Volumes')
    rtn = []
    for i in locs:
        for entry in os.scandir(i.path):
            if entry.is_dir(follow_symlinks=False):
                rtn.append(entry.path)
            elif entry.is_file(follow_symlinks=False):
                dict1[entry.path] = entry.name
    return rtn

Now let's create the function that will be tasked with walking each of the folders. It receives the path as a param. As it recursively walks the folder, we'll add the paths of the files it finds as keys in the dictionary and the filenames as values.


def walker(location):
    for root, dir, files in os.walk(location, topdown = True):
        for file in files:
            dict1[root+"/"+file] = file

Now let's create the function that will manage our threads. It first gets the target locations from the get_locations function. Then for each location we assign a process and get started with our walker. And then we terminate each process.


def create():
    processes = []   # empty process list is created           
    targetLocations = get_locations()
    for location in targetLocations:
        process1 = Thread(target=walker, args=(location,))
        process1.start()
        processes.append(process1)

    for t in processes:
        t.join() # Terminate the threads

Let's test it out.


t1= datetime.now()
create()
t2= datetime.now()
total =t2-t1
print("Time taken to index " , total)

#In my case with about 2TB of disk we get
#Time taken to index  0:01:29.111082

Now we have a dictionary that we could create a DataFrame with and do some further filtering and analysis. Or perhaps we could send it to Elasticsearch, a database, or store the dictionary in a pickle file to rapidly reload in the future. In this example let's simply create a DataFrame and filter using a regex to get all the .jpg files.


import pandas as pd
df = pd.DataFrame(list(dict1.items()), columns=['Path','File'])
imagesDataFrame = df["File"].str.contains('\.jpg$', case=False, regex=True)

You get the idea. So go out and try it for yourself. You can download the Gist here: https://gist.github.com/CoffieldWeb/57dc5b4dcc01d175335b43de1a16db96

Pingbacks are closed.

Comments are closed.