Extracting code from blog posts using F# – Part 1

I mentioned some weeks ago that I was looking to automate the validation of a bunch of source files I have on disk before uploading them to GitHub. I decided to get started on that, this week, to see what was involved. As mentioned in the previous post, I wanted to programmatically query posts from Typepad, extract any embedded code from them and compare it with what I had on disk, to see which files were already correct and which needed to be created or removed.

The first step was to choose a language. The fact this is really an OS-level task โ€“ with a bunch of string processing and file I/O, but nothing whatsoever to do with AutoCAD โ€“ gave me some freedom to choose pretty much any language that can be run on OS X or Windows. I briefly considered Python or Ruby, but ended up going back to F#: it's nicely integrated into Visual Studio (my primary development tool) and a functional approach makes a lot of sense for this kind of problem. I also had the itch to do more with F# after my recent (too brief) foray into machine learning. One final factor in the decision process โ€“ not that it precludes the use of F#, at all โ€“ is that as this is a one-off activity I'm not at all worried about performance or memory constraints. It doesn't matter if it takes 10 minutes to complete, for instance. Just as long as it completes. ๐Ÿ˜‰

Armed with F#, I then started looking at the problem itself. First up was pulling data down from the blog: I needed an API to access my blog's content on Typepad โ€“ thankfully there's a simple REST API which gives access to its posts' complete content โ€“ and then had some additional choices to make about how to access the data.

To simplify working with the blog's post data, I decided to use F# Type Providers. These allow you to code against data-oriented services as if they had local object models. Which is exactly what happens, I guess: when the code instantiates a JSON Type Provider against a particular resource โ€“ I downloaded a sample JSON file from the Typepad API for this โ€“ it's contents are then accessible via a locally-generated set of objects and properties.

The next problem was the "screen scraping": we need to extract HTML code and convert it to plain text to compare with the local files. I opted for the HtmlAgilityPack for this: it comes with an Html2Txt sample that I converted to F#. I haven't followed it exactly, as I wanted to keep some amount of whitespace in the generated plain text, but it got me a good part of the way there.

In a general sense, here's what the algorithm needs to do:

  1. Parse the various C# files on disk and create an index that links a comma-separated command-list with the source filename
    • The order isn't significant: if the commands don't come in the same order then the files will be different
  2. Extract the post content from my blog and parse it for code fragments
    • I've used the same CopyAsHtml tool since this blog's inception, so all code sections are enclosed in a similar-looking <div>
    • For now I only care about posts with a single code section. There are certainly posts where I've used this tool to copy smaller fragments for illustrative purposes, so at some point I need the code to pick these up, too
  3. For the posts with a single code section, extract the code and convert it to plain text
  4. Extract the commands implemented in the code and check for which files have the same commands โ€“ in the same sequence โ€“ on disk
  5. Perform a lower-level comparison between the code extracted from HTML with the files of disk
    • This still needs some work: there are files which should match that for some reason don't, right now, but overall it's working quite well
    • For debugging purposes I'm currently writing any unmatched code fragments as files in another folder, so that I can go through and see what problems are worth fixing
  6. The ultimate output is a list of post titles with the matching local filename
    • It'll be a simple matter to copy these files programmatically into a local folder that will sync with GitHub

Here's the code I have, so far:

(*

#r "Z:/GitHub/FSharp.Data/bin/FSharp.Data.dll"

#r "packages\HtmlAgilityPack.1.4.9\lib\Net45\HtmlAgilityPack.dll"

#r "System.Xml"

*)

 

open FSharp.Data

open HtmlAgilityPack

open System.Xml

open System

open System.IO

 

let codeHeader = "<div style="

let blogRoot = "http://api.typepad.com/blogs/6a00d83452464869e200d83452baa169e2/post-assets.json"

let csFolder = @"Z:\data\Blogs\Projects\Basic C# app"

let tmpFolder = @"Z:\data\Blogs\Projects\Basic C# app\Notfound"

let csTest =

  @"Z:\data\Blogs\Projects\Basic C# app\enumerate-sysvars.cs"

let cmdAttrib = "[CommandMethod("

 

type Post = JsonProvider<"data.json">

 

// Use TypePad's REST API to retrieve batches of posts

 

let getPosts m n =

  let url = String.Format("{0}?max-results={1}", blogRoot, m)

  let url2 =

    match n with

    | 0 -> url

    | _ -> url + "&start-index=" + (m * n).ToString()

  let doc = Post.Load(url2)

  doc.Entries

 

// Count the number of times a substring appears in a string

 

let countOccurrences (sub:string) (text:string) =

  match sub with

  | "" -> 0

  | _ ->

    (text.Length - text.Replace(sub, @"").Length) / sub.Length

 

// These are HTML entity codes etc. that need to be replaced

// as we convert from HTML to plain text

 

let reps =

  [("&#0160;"," ");("&#160;"," ");("&nbsp;"," ");("&gt;",">");

   ("&lt;","<");("&#39;","'");("&quot;", "\"");("&ndash;","-");

   ("&amp;","&");("ร‚","")]

 

let convertText (t : string) =

  List.fold

    (fun (a : string) (b : string, c : string) -> a.Replace(b,c))

      t reps

 

// Use the HtmlAgilityPack to convert from HTML to plain text

 

let rec convertTo (node : HtmlNode ) =

  match node.NodeType with

  | HtmlNodeType.Comment -> ""

  | HtmlNodeType.Document ->

      Seq.map convertTo node.ChildNodes |>

        Seq.fold (fun r s -> r + s) ""

  | HtmlNodeType.Text ->

      // script and style must not be output

      let parentName = node.ParentNode.Name

      if parentName = "script" || parentName = "style" then

        ""

      else

        // get text

        let html = (node :?> HtmlTextNode).Text;

 

        // is it in fact a special closing node output as text?

        if HtmlNode.IsOverlappedClosingElement(html) then

          ""

        else

          convertText html

  | HtmlNodeType.Element ->

      if node.Name = "p" then

        if node.HasChildNodes then

          (Seq.map convertTo node.ChildNodes |>

            Seq.fold (fun r s -> r + s) "") + "\r\n"

        else

          "\r\n"

      else if node.HasChildNodes then

        Seq.map convertTo node.ChildNodes |>

          Seq.fold (fun r s -> r + s) ""

      else

        ""

  | _ -> ""

 

// Take post data and extract the HTML fragment representing code

 

let extractCode (content : string) =

  let start = content.IndexOf(codeHeader)

  let finish = content.LastIndexOf("</div>") + 6

  let html = content.Substring(start, finish - start)

  let doc = new HtmlDocument()

  doc.LoadHtml(html)

  convertTo doc.DocumentNode

 

// If a post contains only 1 code segment, we'll extract it

 

let processPost (ent : Post.Entry) =

  let count = countOccurrences codeHeader ent.Content

  ent.Title,

  count,

  match count with

  | 1 -> extractCode ent.Content

  | _ -> ""

 

// List the files conforming to a pattern in a folder

 

let filesInFolder pat folder =

  try Directory.GetFiles(folder, pat, SearchOption.TopDirectoryOnly)

    |> Array.toList

  with | e -> []

 

// Get the indices at which a substring occurs in a string

 

let stringIndices (pat:string) (text:string) =

  let rec getIndices (pat:string) (text:string) (start:int) =

    match text.IndexOf(pat, start) with

    | -1 -> []

    | x -> x :: getIndices pat text (x+1)

  getIndices pat text 0

 

// Extract the command name from a CommandMethod attribute

 

let extractCommandName (text : string) =

  let delim = "\""

  let count = countOccurrences delim text

  match count with

  | 0 -> ""

  | 1 -> ""

  | 2 -> text.Substring(1, text.LastIndexOf(delim) - 1)

  | 3 -> ""

  | _ ->

      let idxs = stringIndices delim text

      text.Substring(idxs.[2] + 1, idxs.[3] - idxs.[2] - 1)

 

// Extract the various command names from a code segment

 

let rec commandsFromCode (text : string) =

  match text.Contains(cmdAttrib) with

  | false -> []

  | true ->

    let start = text.IndexOf(cmdAttrib) + cmdAttrib.Length

    let finish = text.IndexOf(")", start + 1)

    let name =

      text.Substring(start, finish - start) |> extractCommandName

    name :: commandsFromCode (text.Substring finish)

 

// Create a comma-separated string from a list of strings

 

let rec commaSepString (cmds : string list) =

  match cmds with

  | [] -> ""

  | x::[] -> x

  | x::xs -> x + "," + commaSepString xs

 

// Get the commands for a particular file on disk as a

// comma-separated list and return them with the filename

 

let commandsForFile file =

  File.ReadAllText file |>

  commandsFromCode |>

  commaSepString |>

  (fun x -> (x, file))

 

// Get the command names for a set of files on disk

 

let rec commandsForFiles files =

  match files with

  | [] ->
[]

  | file::xs -> commandsForFile file :: commandsForFiles xs

 

// Create an index from commands to files for a particular folder

 

let indexCommands (folder : string) =

  filesInFolder "*.cs" folder |> commandsForFiles

 

// From our index, get the files associated with a command-set

 

let filesForCommandsFromIndex index cmds =

  index |>

  List.filter (fun (a,b) -> a = cmds && a <> "") |>

  List.map (fun (a,b) -> b)

 

// Strip blank lines from a sequence of strings

 

let stripBlanks (s : seq<string>) =

  Seq.filter (fun x -> not(String.IsNullOrWhiteSpace(x))) s

 

// Compare sequences of strings, ignoring non-relevant whitespace

 

let compareSequences (s1 : seq<string>) (s2 : seq<string>) =

  Seq.compareWith

    (fun (a:string) (b:string) -> String.Compare(a.Trim(), b.Trim()))

    (stripBlanks s1) (stripBlanks s2)

 

// Write code to a temp file - for debugging only

 

let writeToTmpFile (code:string) =

  let rec getTmpFile i =

    let file = tmpFolder + "\\" + i.ToString() + ".cs"

    if not(File.Exists(file)) then

      file

    else

      getTmpFile (i+1)

  use wr = new StreamWriter((getTmpFile 0))

  wr.Write(code)

 

// Take a code fragment and a file and check them for equivalence

 

let checkCodeAgainstFile (code:string) (file:string) =

  let clines = code.Split("\n\r".ToCharArray())

  let flines = File.ReadAllLines(file)

  let s1 = Seq.ofArray clines

  let s2 = Seq.ofArray flines

  if compareSequences s1 s2 = 0 then

    [file]

  else

    []

 

// Take a code fragment and a set of files and see if one matches

 

let checkCodeAgainstFiles code files =

  let rec checkAgainstFiles code files =

    match files with

    | [] -> []

    | x::xs ->

      checkCodeAgainstFile code x :: checkAgainstFiles code xs

  checkAgainstFiles code files |> List.concat

 

// Our main function

 

[<EntryPoint>]

let main argv =

 

  // Build an index from commands to source files on the hard drive

 

  let index = indexCommands csFolder

 

  // Pull down post information from TypePad and process it

 

  let posts =

    [|0..25|] |>

    Array.map (getPosts 50) |> // Get 1250 posts in batches of 50

    Array.concat |>            // Flatten the nested arrays

    Array.map processPost      // Process the posts

 

  // Separate the posts into posts with code and those without

 

  let postsWith, postsWithout =

    Array.partition (fun (a,b,c) -> b > 0) posts

 

  // Separate the posts with code into those with one section

  // and those with more

 

  let postsWithOne, postsWithMore =

    Array.partition (fun (a,b,c) -> b = 1) postsW
ith

 

  printfn

    "%d posts with zero, %d posts with one, %d posts with more"

    postsWithout.Length

    postsWithOne.Length

    postsWithMore.Length

 

  // We'll take the posts with a single code section and process

  // them

 

  let res =

    postsWithOne |>

    Array.map (fun (a,b,c) -> commandsFromCode c) |> // Get commands

    Array.map commaSepString |> // Make a comma-delimited cmd list

    Array.map (filesForCommandsFromIndex index) |> // Use our index

    Array.map2 (fun (a,b,c) d -> (a,c,d)) postsWithOne |> //

    Array.filter (fun (a,b,c) -> b <> "") |> // Strip codeless

    Array.map

      (fun (a,b,c) ->

        a,

        let x = checkCodeAgainstFiles b c

        if x = [] then writeToTmpFile b // This is for debugging

        x) |>

    Array.filter (fun (a,b) -> b <> []) // Strip fileless

 

  0 // return an integer exit code

Right now it finds 108 source files that are "correct" on disk. This is a reasonable start, but there are certainly more to be found.

By the way, while I don't currently have a second part of this series planned, specifically, I know I'm going to need one to share the final version of the code. Which I'll also place on GitHub, of course. ๐Ÿ™‚

Leave a Reply

Your email address will not be published. Required fields are marked *