Thursday, December 1, 2016

A Host of Issues: Migrating from UserVoice To GitHub Issues with Canopy and

This post is part of the F# Advent Calendar for 2016. Thanks to Sergey for putting this on, it's always wonderful to see what everyone highlights this time of year!

On October 6th, Krzysztof Cieslak had had enough. He'd grown tired of spending time working on submissions to UserVoice for F# language enhancements and after some prodding submitted THE SUGGESTION THAT WOULD LIVE IN INFAMY.  It immediately generated some frenzy, and I'd had a bad day at work, so on a whim I decided to finally learn Canopy, a Selenium library for F#, and try my hand at a first pass of a migration of issues from UserVoice to GiHub Issues.

Canopy is really a fantastic library.  In essence it's a tightly-crafted DSL on top of the Selenium WebDriver library, which allows for easier automation of web browsers though code.  It's most commonly used for UI acceptance tests in my experience, but we're going to pervert its intentions a bit.

But first, some links:

And please don't judge to code quality too harshly, Jared and I were kinda blazing through this thing in off hours!

So really what we had to accomplish was three things:
  1. Discover the list of all issues on UserVoice
  2. Pull over metadata for each of the issues:
    1. Submitter
    2. Content
    3. Date
    4. Votes
      1. Submitter
      2. Date
    6. Official Responses
      1. Responder
      2. Date
      3. Content
  3. Use that metadata to create a matching issue on GitHub

So to start with, after hashing out some requirements we landed with the following set of base models:

With these models in hand, we can now go parse sections of the each page out to get each Idea.  All of the code directly related t scraping the pages is here.  An interesting point is that there's no master list of all issues for a forum, you've got to go scrape through each different category of issues to get the entire list, so that's what we've done in the discoverIdeas function.

Next was the grunt work of cycling through each of the links we just found and parsing them.  This is a hairy chunk, but relatively straightforward:

After everything was parsed, we did a quick Json.Net dump of the Idea list into a file so we wouldn't have to scrape that again.  It took about 20 minutes to scrape all of UserVoice because Canopy seems to implicitly have a single webdriver context.  I thought about parallelizing by looking into passing contexts around to use, but I couldn't find a way to do that and once we had the data everything else fell into place.

Having secured the data, we then needed to make markdown issue templates to render the ideas into a form that would look good on the GitHub Issue.  My preferred templating library in .Net is DotLiquid for its simple setup and reliability.  If you'd like to see what the template looks like, take a quick detour and check it out here. Simple, no?  I did have to do a bit of configuration to get the templating system to recognize F# records, but luckily that work had already been done by (I think) Tomas Petricek and Henrik Feldt. Basically what we have to do to use the templates in a nice way is register the public members of any type we want to use as a model in the template, and there are some special cases around Seq, List, and Option that have to be handled for dotLiquid to work. The code looks like this:

Finally came the hard part - interacting with the GitHub api.  Thankfully, we had available to smooth over calling the api directly, but as everyone knows the GitHub api is a bit aggressive when it comes to rate limiting consumers. You're limited to 20 calls per minute, and a daily allotment of 5000 calls overall.  The initial version of the upload code was incredibly parallel, queuing up all the work making use of Asyncs all over the place.  That code bombed almost immediately due to burning through our api allowance.  We eventually settled on including delays periodically to slow down the overall rate of upload, but the rate limit made the upload process crawl to hours instead of what we had hoped would be minutes.  At this moment there are two separate attempts to manage throttling more intelligently in the Github.fs file, however none were good enough.  In fact, months after we did this work, I've started a MailboxProcessor-based version that would handle caps on my fork of the repo but it's still not up to par.  So if anyone has any pointers to intelligently manage rate limiting in a general/intelligent way that changes based on output of the api client after each call, let me know!

The code for uploading to GitHub isn't really complicated at all, and lives here. It's mostly a straightforward mapping of  Idea -> Issue GitHub Model and then we POST that model up.  We do do a few interesting things like assign labels and close the issue automatically if the issue was closed/rejected/etc on UserVoice, as one would expect.

Overall, I'd say that two people spent ~10 hours each working on this, and A LOT of that time was waiting on uploads to break.  We spent some time tweaking the issue templates for formatting, but that was a very iterative process.  Canopy made it incredibly simple to grab the data we needed to do the migration, DotLiquid made it easy to render nice markdown for the Issues, and OctoKit...OctoKit wasn't the worst.

My overall hope with this post is to show that F# is as great for relatively simple, one-off tasks as it is for more in-depth projects like dependency management, finance, microservices, and any number of other realms (though I do use it there as well). So I encourage everyone who's looking for ways to include it at work to try and tackle your next annoying problem with a fsx script. That's how this project started, until it grew too unwieldy.

Happy Holidays, and Merry F#ing!