Forum Archive

Search for files

midas9087

I’m trying to do a pretty simple search: i want to find files that contain 2 words. The words are not contiguous. For example, if I input “iPad iPhone” in the search field, it should identify a file that contains the text “iPads are bigger than iPhones”. The default appears to be a search that matches the query string exactly. Is there a way to do this? Do I use regex?

technoway

It would be complicated to use a regular expression to do this.

If the files are not too large, you can read the file into memory, and check the entire file at once. It's a better general solution to read the lines sequentially though.

If you read the lines sequentially, you might have to read two lines at a time and merge them the first time, and then read one line at a time after that merging each new line with the last line, so you can handle cases like this:

This is line one of the file that ends in iPho-
ne and this is line two that contains the word iPad.

You would always keep the last line around, remove any hyphen at the end, and append the next line, while keeping the original line to do this again after you read the next line. Then you would check this merged line. You'd only need to merge a line if the last line ended with a hyphen character.

Below is code that reads the lines sequentially, but without the line-merge feature, so this code will not handle words that wrap from one line to the next. I used a generator to obtain each line to make it easier to modify this code to add that feature. I just typed this code into an editor, I have not run this code at all, so I don't even know if it's syntactically correct, but it should be very close to what you need.

If you don't need to handle words wrapping between lines, you can remove the generator, and just use a regular for-loop that iterates over the file handle. In that case, check if found_word_count == len_word_list inside of the loop, and break when that expression is true.

def make_iterable_item_generator(iterable_item):
    """ Make a generator for any iterable item. """
    for item in iterable_item:
        yield item

def file_contains_words(infh, word_list):
    """ Return True iff file contains all words in the passed word list.
        There must be no duplicate words in word_list.
        infh is an open file handle to the file to test.
    """
    found_word_count = 0
    # Create a list of boolean flags, one flag for each word in word_list.
    len_word_list = len(word_list)
    word_not_found_list = [True for i in range(0, len_word_list)]
    # Create a generator to read each line of the file.
    ingen = make_iterable_item_generator(infh)
    try:
        # Stop checking when all words are found.
        while found_word_count < len_word_list:
            line = ingen.next()
            for i in range(0, len_word_list):
                if word_not_found_list[i]:
                    if word_list[i] in line:
                        word_not_found_list[i] = False
                        found_word_count += 1
    except StopIteration:
        pass
    return found_word_count == len_word_list
ccc

When you want to both count and get the element than enumerate() is your friend.

# Instead of
            for i in range(0, len_word_list):
                if word_not_found_list[i]:
                    if word_list[i] in line:
                        word_not_found_list[i] = False
                        found_word_count += 1

# you could write:
            for i, word in enumerate(word_list):
                if word_not_found_list[i] and word in line:
                    word_not_found_list[i] = False
                    found_word_count += 1

I also believe that your return value could be: return all(word_not_found_list) but being sure of that would require a bit more testing.
* https://docs.python.org/3/library/functions.html#all

enceladus

Using set may simplify the code.

def search(filename, wordlist):
    return set(wordlist.split()).issubset(open(filename).read().split())

ccc

Great one!! but leaves a file handle open on some Python implementations.

enceladus

ok. I just added code to do proper searching. I need to test this well and need to make it a proper editorial workflow.

import os
import fnmatch

def search_files(wordlist, directory, 
      include_pattern=None, exclude_pattern=None):
    fnlist = []
    for dirpath, dirs, files in os.walk(directory):
        for filename in files:
            fname = os.path.join(dirpath, filename)
            to_include = True
            if exclude_pattern:
                if fnmatch.fnmatch(fname, exclude_pattern):
                    to_include = False
            if include_pattern and to_include:
                if not fnmatch.fnmatch(fname, include_pattern):
                    to_include = False
            if to_include:
                if are_all_words_in_file(fname, wordlist):
                    fnlist.append(fname)
    return fnlist

def get_words(filename):
    with open(filename) as fp:
        for line in fp:
            for word in line.split():
                yield word

def are_all_words_in_file(filename, wordlist):
    return set(wordlist.split()).issubset(get_words(filename))

print (search_files('os class', '.', include_pattern="*.py"))

enceladus

Here is the editorial workflow. I hope it helps
http://www.editorial-workflows.com/workflow/5872578073722880/kmRKF8RqYvQ

ccc
    if ignore_case == '':
        ignore_case = True
    else:
        if ignore_case == 'ON':
            ignore_case = True
        else:
            ignore_case = False

# can be rewritten as:

    ignore_case = ignore_case in ('', 'ON')
enceladus

Thanks @ccc Replacing if statements (particularly nested if statements) by a simple statement is always better