Extracting code from blog posts using F# – Part 1

I mentioned some weeks ago that I was looking to automate the validation of a bunch of source files I have on disk before uploading them to GitHub. I decided to get started on that, this week, to see what was involved. As mentioned in the previous post, I wanted to programmatically query posts from Typepad, extract any embedded code from them and compare it with what I had on disk, to see which files were already correct and which needed to be created or removed.

The first step was to choose a language. The fact this is really an OS-level task – with a bunch of string processing and file I/O, but nothing whatsoever to do with AutoCAD – gave me some freedom to choose pretty much any language that can be run on OS X or Windows. I briefly considered Python or Ruby, but ended up going back to F#: it's nicely integrated into Visual Studio (my primary development tool) and a functional approach makes a lot of sense for this kind of problem. I also had the itch to do more with F# after my recent (too brief) foray into machine learning. One final factor in the decision process – not that it precludes the use of F#, at all – is that as this is a one-off activity I'm not at all worried about performance or memory constraints. It doesn't matter if it takes 10 minutes to complete, for instance. Just as long as it completes. 😉

Armed with F#, I then started looking at the problem itself. First up was pulling data down from the blog: I needed an API to access my blog's content on Typepad – thankfully there's a simple REST API which gives access to its posts' complete content – and then had some additional choices to make about how to access the data.

To simplify working with the blog's post data, I decided to use F# Type Providers. These allow you to code against data-oriented services as if they had local object models. Which is exactly what happens, I guess: when the code instantiates a JSON Type Provider against a particular resource – I downloaded a sample JSON file from the Typepad API for this – it's contents are then accessible via a locally-generated set of objects and properties.

The next problem was the "screen scraping": we need to extract HTML code and convert it to plain text to compare with the local files. I opted for the HtmlAgilityPack for this: it comes with an Html2Txt sample that I converted to F#. I haven't followed it exactly, as I wanted to keep some amount of whitespace in the generated plain text, but it got me a good part of the way there.

In a general sense, here's what the algorithm needs to do:

  1. Parse the various C# files on disk and create an index that links a comma-separated command-list with the source filename
    • The order isn't significant: if the commands don't come in the same order then the files will be different
  2. Extract the post content from my blog and parse it for code fragments
    • I've used the same CopyAsHtml tool since this blog's inception, so all code sections are enclosed in a similar-looking <div>
    • For now I only care about posts with a single code section. There are certainly posts where I've used this tool to copy smaller fragments for illustrative purposes, so at some point I need the code to pick these up, too
  3. For the posts with a single code section, extract the code and convert it to plain text
  4. Extract the commands implemented in the code and check for which files have the same commands – in the same sequence – on disk
  5. Perform a lower-level comparison between the code extracted from HTML with the files of disk
    • This still needs some work: there are files which should match that for some reason don't, right now, but overall it's working quite well
    • For debugging purposes I'm currently writing any unmatched code fragments as files in another folder, so that I can go through and see what problems are worth fixing
  6. The ultimate output is a list of post titles with the matching local filename
    • It'll be a simple matter to copy these files programmatically into a local folder that will sync with GitHub

Here's the code I have, so far:

(*

#r "Z:/GitHub/FSharp.Data/bin/FSharp.Data.dll"

#r "packages\HtmlAgilityPack.1.4.9\lib\Net45\HtmlAgilityPack.dll"

#r "System.Xml"

*)

 

open FSharp.Data

open HtmlAgilityPack

open System.Xml

open System

open System.IO

 

let codeHeader = "<div style="

let blogRoot = "http://api.typepad.com/blogs/6a00d83452464869e200d83452baa169e2/post-assets.json"

let csFolder = @"Z:\data\Blogs\Projects\Basic C# app"

let tmpFolder = @"Z:\data\Blogs\Projects\Basic C# app\Notfound"

let csTest =

  @"Z:\data\Blogs\Projects\Basic C# app\enumerate-sysvars.cs"

let cmdAttrib = "[CommandMethod("

 

type Post = JsonProvider<"data.json">

 

// Use TypePad's REST API to retrieve batches of posts

 

let getPosts m n =

  let url = String.Format("{0}?max-results={1}", blogRoot, m)

  let url2 =

    match n with

    | 0 -> url

    | _ -> url + "&start-index=" + (m * n).ToString()

  let doc = Post.Load(url2)

  doc.Entries

 

// Count the number of times a substring appears in a string

 

let countOccurrences (sub:string) (text:string) =

  match sub with

  | "" -> 0

  | _ ->

    (text.Length - text.Replace(sub, @"").Length) / sub.Length

 

// These are HTML entity codes etc. that need to be replaced

// as we convert from HTML to plain text

 

let reps =

  [("&#0160;"," ");("&#160;"," ");("&nbsp;"," ");("&gt;",">");

   ("&lt;","<");("&#39;","'");("&quot;", "\"");("&ndash;","-");

   ("&amp;","&");("Â","")]

 

let convertText (t : string) =

  List.fold

    (fun (a : string) (b : string, c : string) -> a.Replace(b,c))

      t reps

 

// Use the HtmlAgilityPack to convert from HTML to plain text

 

let rec convertTo (node : HtmlNode ) =

  match node.NodeType with

  | HtmlNodeType.Comment -> ""

  | HtmlNodeType.Document ->

      Seq.map convertTo node.ChildNodes |>

        Seq.fold (fun r s -> r + s) ""

  | HtmlNodeType.Text ->

      // script and style must not be output

      let parentName = node.ParentNode.Name

      if parentName = "script" || parentName = "style" then

        ""

      else

        // get text

        let html = (node :?> HtmlTextNode).Text;

 

        // is it in fact a special closing node output as text?

        if HtmlNode.IsOverlappedClosingElement(html) then

          ""

        else

          convertText html

  | HtmlNodeType.Element ->

      if node.Name = "p" then

        if node.HasChildNodes then

          (Seq.map convertTo node.ChildNodes |>

            Seq.fold (fun r s -> r + s) "") + "\r\n"

        else

          "\r\n"

      else if node.HasChildNodes then

        Seq.map convertTo node.ChildNodes |>

          Seq.fold (fun r s -> r + s) ""

      else

        ""

  | _ -> ""

 

// Take post data and extract the HTML fragment representing code

 

let extractCode (content : string) =

  let start = content.IndexOf(codeHeader)

  let finish = content.LastIndexOf("</div>") + 6

  let html = content.Substring(start, finish - start)

  let doc = new HtmlDocument()

  doc.LoadHtml(html)

  convertTo doc.DocumentNode

 

// If a post contains only 1 code segment, we'll extract it

 

let processPost (ent : Post.Entry) =

  let count = countOccurrences codeHeader ent.Content

  ent.Title,

  count,

  match count with

  | 1 -> extractCode ent.Content

  | _ -> ""

 

// List the files conforming to a pattern in a folder

 

let filesInFolder pat folder =

  try Directory.GetFiles(folder, pat, SearchOption.TopDirectoryOnly)

    |> Array.toList

  with | e -> []

 

// Get the indices at which a substring occurs in a string

 

let stringIndices (pat:string) (text:string) =

  let rec getIndices (pat:string) (text:string) (start:int) =

    match text.IndexOf(pat, start) with

    | -1 -> []

    | x -> x :: getIndices pat text (x+1)

  getIndices pat text 0

 

// Extract the command name from a CommandMethod attribute

 

let extractCommandName (text : string) =

  let delim = "\""

  let count = countOccurrences delim text

  match count with

  | 0 -> ""

  | 1 -> ""

  | 2 -> text.Substring(1, text.LastIndexOf(delim) - 1)

  | 3 -> ""

  | _ ->

      let idxs = stringIndices delim text

      text.Substring(idxs.[2] + 1, idxs.[3] - idxs.[2] - 1)

 

// Extract the various command names from a code segment

 

let rec commandsFromCode (text : string) =

  match text.Contains(cmdAttrib) with

  | false -> []

  | true ->

    let start = text.IndexOf(cmdAttrib) + cmdAttrib.Length

    let finish = text.IndexOf(")", start + 1)

    let name =

      text.Substring(start, finish - start) |> extractCommandName

    name :: commandsFromCode (text.Substring finish)

 

// Create a comma-separated string from a list of strings

 

let rec commaSepString (cmds : string list) =

  match cmds with

  | [] -> ""

  | x::[] -> x

  | x::xs -> x + "," + commaSepString xs

 

// Get the commands for a particular file on disk as a

// comma-separated list and return them with the filename

 

let commandsForFile file =

  File.ReadAllText file |>

  commandsFromCode |>

  commaSepString |>

  (fun x -> (x, file))

 

// Get the command names for a set of files on disk

 

let rec commandsForFiles files =

  match files with

  | [] ->
[]

  | file::xs -> commandsForFile file :: commandsForFiles xs

 

// Create an index from commands to files for a particular folder

 

let indexCommands (folder : string) =

  filesInFolder "*.cs" folder |> commandsForFiles

 

// From our index, get the files associated with a command-set

 

let filesForCommandsFromIndex index cmds =

  index |>

  List.filter (fun (a,b) -> a = cmds && a <> "") |>

  List.map (fun (a,b) -> b)

 

// Strip blank lines from a sequence of strings

 

let stripBlanks (s : seq<string>) =

  Seq.filter (fun x -> not(String.IsNullOrWhiteSpace(x))) s

 

// Compare sequences of strings, ignoring non-relevant whitespace

 

let compareSequences (s1 : seq<string>) (s2 : seq<string>) =

  Seq.compareWith

    (fun (a:string) (b:string) -> String.Compare(a.Trim(), b.Trim()))

    (stripBlanks s1) (stripBlanks s2)

 

// Write code to a temp file - for debugging only

 

let writeToTmpFile (code:string) =

  let rec getTmpFile i =

    let file = tmpFolder + "\\" + i.ToString() + ".cs"

    if not(File.Exists(file)) then

      file

    else

      getTmpFile (i+1)

  use wr = new StreamWriter((getTmpFile 0))

  wr.Write(code)

 

// Take a code fragment and a file and check them for equivalence

 

let checkCodeAgainstFile (code:string) (file:string) =

  let clines = code.Split("\n\r".ToCharArray())

  let flines = File.ReadAllLines(file)

  let s1 = Seq.ofArray clines

  let s2 = Seq.ofArray flines

  if compareSequences s1 s2 = 0 then

    [file]

  else

    []

 

// Take a code fragment and a set of files and see if one matches

 

let checkCodeAgainstFiles code files =

  let rec checkAgainstFiles code files =

    match files with

    | [] -> []

    | x::xs ->

      checkCodeAgainstFile code x :: checkAgainstFiles code xs

  checkAgainstFiles code files |> List.concat

 

// Our main function

 

[<EntryPoint>]

let main argv =

 

  // Build an index from commands to source files on the hard drive

 

  let index = indexCommands csFolder

 

  // Pull down post information from TypePad and process it

 

  let posts =

    [|0..25|] |>

    Array.map (getPosts 50) |> // Get 1250 posts in batches of 50

    Array.concat |>            // Flatten the nested arrays

    Array.map processPost      // Process the posts

 

  // Separate the posts into posts with code and those without

 

  let postsWith, postsWithout =

    Array.partition (fun (a,b,c) -> b > 0) posts

 

  // Separate the posts with code into those with one section

  // and those with more

 

  let postsWithOne, postsWithMore =

    Array.partition (fun (a,b,c) -> b = 1) postsW
ith

 

  printfn

    "%d posts with zero, %d posts with one, %d posts with more"

    postsWithout.Length

    postsWithOne.Length

    postsWithMore.Length

 

  // We'll take the posts with a single code section and process

  // them

 

  let res =

    postsWithOne |>

    Array.map (fun (a,b,c) -> commandsFromCode c) |> // Get commands

    Array.map commaSepString |> // Make a comma-delimited cmd list

    Array.map (filesForCommandsFromIndex index) |> // Use our index

    Array.map2 (fun (a,b,c) d -> (a,c,d)) postsWithOne |> //

    Array.filter (fun (a,b,c) -> b <> "") |> // Strip codeless

    Array.map

      (fun (a,b,c) ->

        a,

        let x = checkCodeAgainstFiles b c

        if x = [] then writeToTmpFile b // This is for debugging

        x) |>

    Array.filter (fun (a,b) -> b <> []) // Strip fileless

 

  0 // return an integer exit code

Right now it finds 108 source files that are "correct" on disk. This is a reasonable start, but there are certainly more to be found.

By the way, while I don't currently have a second part of this series planned, specifically, I know I'm going to need one to share the final version of the code. Which I'll also place on GitHub, of course. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *