MmStats in Scripts

MmStats is a library I created to expose and read statistics, metrics, and debugging information from running Python processes without the overhead of syscalls (eg writing to a socket or file) or threads, and to make sure that as many utilities as you want can read those metrics without affecting the performance of the main process exposing stats.

I released 0.7 today to ease integration into multithreaded apps, but it made me realize a simpler tutorial would probably be helpful.

While I had web apps, job consumers, and other long running daemons in mind when I wrote mmstats, it turns out it’s also excellent for long running scripts.

You know the scripts: maintenance scripts, “fixer” scripts, slow build or deployment scripts, data migration scripts, etc.

If you’re like me, you always forget 2 things every time you write and run one of these scripts:

  1. Run it in screen
  2. Periodic progress output

Luckily for #1 there’s already disown.

For #2 we need an example script. Let’s pretend you have a Django app with users and you need to update their email addresses in a different system with something like this:

import otherdb
from django.contrib.auth import models
 
for user in models.User.objects.all():
    otherdb.update(user.username, email=user.email)

After forgetting to run it in screen, I’d restart it … and sit there … staring at my terminal … hating myself for not having it output anything.

But then these scripts never work the first time, so it’d probably die in flames on the first user without an email or similar exceptional condition I forgot to take into account.

So on my second attempt I’d probably quickly try to cobble together some progress indicator:

import otherdb
from django.contrib.auth import models
 
BATCH = ...
 
for i, user in enumerate(models.User.objects.all()):
    if i % BATCH == 0:
        print '{0} done'.format(i)
 
    # Only update users who have emails! Otherwise otherdb dies.
    if user.email:
        otherdb.update(user.username, email=user.email)

But what should BATCH be? If I have 10,000 users, BATCH = 1000 seems reasonable, but what if otherdb is really slow? In that case a smaller batch like 100 or 50 might be appropriate, so I don’t have to worry if otherdb just became unresponsive or something.

The best option is to always have your precise progress available at your request.

Using MmStats in Scripts

I’ve found mmstats fits this use case beautifully. No more guessing at what might be an appropriate batch size or using the wrong format string in an uncommon case and crashing my script halfway through.

Integrating mmstats is as easy as:

import time
import mmstats
import otherdb
from django.contrib.auth import models
 
# Define your stats in a model
class S(mmstats.MmStats):
    done = mmstats.CounterField(label="done")
    missing_email = mmstats.CounterField(label="missing_email")
    otherdb_timer = mmstats.TimerField(label="otherdb_timer")
    last_user = mmstats.StringField(label="user")
 
# Instantiate the stats model
stats = S(filename="update-emails-{0}.mmstats".format(time.time()), path=".")
 
for i, user in enumerate(models.User.objects.all()):
    # Update the username for readers to see
    stats.last_user = user.username
 
    # Only update users who have emails! Otherwise otherdb dies.
    if user.email:
        with stats.otherdb_timer:
            # Actually do the migration work
            otherdb.update(user.username, email=user.email)
    else:
        stats.missing_email.inc()
 
    # Increment the done counter to show another user has been processed
    stats.done.inc()

That’s it! Now just re-run in screen, pop back into a shell and check on the progress with slurpstats:

schmichael@prod9000:~$ slurpstats *.mmstats
==> ./update-emails-1234567890.mmstats
  done               113
  missing_email      12
  otherdb_timer      0.3601293582
  user               rob
  sys.created        1346884490.7
  sys.pid            10298
  sys.gid            549
  ...

This output would indicate 113 users have been checked, 12 of them had no email, “rob” is the current user being processed, and that otherdb.update(...) takes on average 360ms to complete. By default timers average the last 100 values, but that’s customizable via the size keyword argument.

That’s nice and all, but it’d be more fun to see how many users were updated per second. pollstats is a simple tool for doing just that:

schmichael@prod9000:~$ pollstats done,missing_email *.mmstats
       done         |      missing_email
                213 |                 20
                  3 |                  0
                  5 |                  1
                  1 |                  0
...

pollstats will print out the current value of the given counters initially, and then once per second print the delta. So in our contrived example we’d be processing somewhere between 1 and 5 users per second and less than 1 missing email per second.

Sadly pollstats is extremely simplistic at the moment and lacks the ability to intelligently display non-counter fields. (Patches welcome!)

Even better: if you’re script dies the mmstats file will be left for you to inspect. (Although if you want it perfectly in sync you should probably stats.flush() on each iteration.)

mmstats is still young (pre-1.0 for a reason) and simplistic, but I already find it extremely useful not only in web apps and other daemons, but also in simple – or not so simple – one-off scripts. I hope you find it useful as well!

This entry was posted in Python and tagged , . Bookmark the permalink.
  • aartur

    I use mmstats in worker processes consuming messages from RabbitMQ, for statistics and performance monitoring. I love the low-overhead design. Highly recommended.

  • Matthew Webber

    This looks great, but I notice there’s no Python 3 support. Is that on your roadmap (I only need 3.3 support)? Thanks.

  • http://blog.schmichael.com/ Michael Schurter

    Not yet, but now that 3.3 is out I should really try. 2.6, 2.7, and 3.3 support would be ideal.