BlackDog Foundry Bookmark This page

Introducing ScrapeKit

I recently needed to scrape some data from a website (from a Mac OS X application), and one of the things I was very conscious of is that the layout of the website could change occasionally and I would be left having to re-distribute a new executable version of my application.

This was very unappealing, so I started thinking about how I could define a very simple set of text-based rules that could define how my application scrapes data from a page. In this way, the application could periodically poll a file on my server to make sure it had the latest set of rules and it could continue on its merry way.

So, without further ado, allow me to introduce ScrapeKit. An example of how to use it is shown below:

Imagine that your input looks like:

<ol>
  <li>abc</li>
  <li>def</li>
  <li>ghi</li>
</ol>

To extract out the list items, your general logic would be:

  • Create an array to hold the resulting items
  • Look for text between <li> and </li> tags
  • Repeat whileever there are more tags

A script to achieve this might look something like:

@main
  createvar NSMutableArray elements
  pushbetween <li> exclude </li> exclude
  iffailure end
  :loop
    popIntoVar elements
    pushbetween <li> exclude </li> exclude
    iffailure end
    goto loop
  :end

And to invoke ScrapeKit to use this script, you would use (assuming ARC):

#import <ScrapeKit/ScrapeKit.h>
 
NSString *script = ...;
NSString *input  = ...;
 
SKEngine *engine = [[SKEngine alloc] init];
[engine compile:script error:nil];
[engine parse:input];
 
NSMutableArray *elements = [engine variableFor:@"elements"];
for (NSString *element in elements)
  NSLog(@"List element = %@", element);

For more info, please see the ScrapeKit readme and associated documentation.

1 Comment »

  1. Devin Pigera says:

    Just tried this code sample out and it’s awesome. Exactly what I was looking.

    Solid work!

Leave a Comment »




Categories

Copyright © 2012 BlackDog Foundry