In this tutorial, you’ll learn how to detect when a file has changed with a Python program that automatically executes on an interval via a cronjob. Although this set of instructions specifically shows you how to watch PDFs for changes, the concept of comparing file hashes (specifically a SHA1 hash function) holds true for most other files, both local and at a remote URL.
There is also a notification aspect of the Python program which notifies my phone via a push notification. I didn’t want to mess around with setting up an email server; believe me, it’s a pain in the ass (spam filters nowadays are tough).
Anyway, follow along and maybe you’ll learn a thing or two about how you could possibly write a similar Python program to automate some aspect of your life.
Background Story
Note: Feel free to skip this section if you just want the meat of the tutorial.
Let me set the stage for you: I’m invested in this fund at Charles Schwab https://content.schwabplan.com/funddetail/CBFXP.pdf. Its performance is updated approximately every month. The problem is that I don’t know when the PDF receives its updates. Sometimes it’s in the beginning of the month. Sometimes in the middle of the month. And at least once I saw no update for the entire month.
Rather than manually checking the PDF every day or every week for changes, I decided to write a Python program to automatically check for changes to the file every day and send me a notification if it found updates.
Call me lazy or a nerd, but I like to automate certain things to make my life easier. Here’s how I did it.
Python File Change Detector + Notifier
There are roughly three main aspects to this program which detects and notifies when a file has changed:
- Download the file from a URL
- Hash the downloaded file and compare it to the previous one
- Notify me of any changes to the file
Additionally, there’s technically a fourth aspect which is the scheduler (i.e. a cronjbo) that executes the program on an interval. We’ll talk about that last. But first, let me walk you through the software.
Note: All code in this tutorial is available on GitHub.
1. Downloader
The first aspect of this program is the downloader. This module downloads the PDF from a URL with a date-time stamped filename and compares its hash to the hash of the latest PDF called latest.pdf.
If the hashes are different, this means the files are different and an updated PDF was found. In this case, the newly downloaded file is archived and renamed to lastest.pdf. Additionally, a notification is sent to my phone.
Otherwise if the hashes are the same, we simply delete the newly downloaded file because no change has been detected.
At first run of the program, latest.pdf doesn’t exist, so there’s some additional logic in the try/except block that handles this case.
import os import shutil import urllib.request from datetime import datetime from hasher import sha1 from notifier import send_notification def move_files(filename): shutil.copy(filename, '/root/campbell/updates/') os.rename(filename, '/root/campbell/latest.pdf') def check_for_update(): # get date and time as string now = datetime.now() filename = '/root/campbell/{}.pdf'.format(now.strftime("%Y%d%m%H%M%S")) # download file url = 'https://tonyteaches.tech/test.pdf' urllib.request.urlretrieve(url, filename) # get hashes try: hash_latest = sha1('/root/campbell/latest.pdf') except: move_files(filename) print('First file saved') return hash_new = sha1(filename) # compare hashes if hash_latest != hash_new: print('Found update') move_files(filename) send_notification(url) else: print('No update') os.remove(filename) check_for_update()
2. Hasher
The next file is simply a generic SHA1 hashing function that takes in the path to a file and returns the value of its hash.
import hashlib def sha1(filename): BUF_SIZE = 65536 # read stuff in 64kb chunks! sha1 = hashlib.sha1() with open(filename, 'rb') as f: while True: data = f.read(BUF_SIZE) if not data: break sha1.update(data) return sha1.hexdigest()
3. Notifier
The last piece of this Python program is the notification logic. Like I alluded to earlier, email notifications would be ideal in this case, but setting up an email server is a whole program in itself, so I went with the simplest solution I could find.
Enter Pushsafer, a German push notification service that gives you 50 free calls to their API. Additional API calls can be bought for pretty cheap.
Anyway, I tapped into their API with the following code that sends the app on my Android phone a notification any time a change in the PDF files is detected.
Since I should only receive up to 12 notifications per year (assuming the PDF file only changes one time per month), this free service should last me over four years.
from urllib.parse import urlencode from urllib.request import Request, urlopen def send_notification(url): post_fields = { 'k':'xxxxxxxxxxxxxxxxxxxx', 't':'New PDF found!', 'd':'a', 'm':'A new PDF was found.', 'u':url } ps = 'https://www.pushsafer.com/api' request = Request(ps, urlencode(post_fields).encode()) json = urlopen(request).read().decode()
4. Scheduler
The (technically) final piece to the setup is the cronjob that executes the Python program on an interval. I have it scheduled to run every morning at 6 AM server-time.
0 6 * * * python3 /root/campbell/downloader.py
Next Steps
I’ll be the first to admit that the above code isn’t optimized, can use lots of error handling, and makes me cringe when I see the hardcoded values. I provided it here for basic demonstration purposes.
That being said, if you’re more of a visual learner, here’s a video demo I put together that walks you through this program in action.
2 Responses
Thanks for the tutorial Tony!
Is there also a way of the program to tell what is changed in the pdf file?
I don’t think detecting file changes in a PDF is an easy thing to do because the way information is stored in a PDF file is not plain text. Here’s some information on doing this in python https://qxf2.com/blog/compare-pdfs-python/