Extracting code from blog posts using F# – Part 1

I mentioned some weeks ago that I was looking to automate the validation of a bunch of source files I have on disk before uploading them to GitHub. I decided to get started on that, this week, to see what was involved. As mentioned in the previous post, I wanted to programmatically query posts from Typepad, extract any embedded code from them and compare it with what I had on disk, to see which files were already correct and which needed to be created or removed.

The first step was to choose a language. The fact this is really an OS-level task – with a bunch of string processing and file I/O, but nothing whatsoever to do with AutoCAD – gave me some freedom to choose pretty much any language that can be run on OS X or Windows. I briefly considered Python or Ruby, but ended up going back to F#: it's nicely integrated into Visual Studio (my primary development tool) and a functional approach makes a lot of sense for this kind of problem. I also had the itch to do more with F# after my recent (too brief) foray into machine learning. One final factor in the decision process – not that it precludes the use of F#, at all – is that as this is a one-off activity I'm not at all worried about performance or memory constraints. It doesn't matter if it takes 10 minutes to complete, for instance. Just as long as it completes. 😉

Armed with F#, I then started looking at the problem itself. First up was pulling data down from the blog: I needed an API to access my blog's content on Typepad – thankfully there's a simple REST API which gives access to its posts' complete content – and then had some additional choices to make about how to access the data.

To simplify working with the blog's post data, I decided to use F# Type Providers. These allow you to code against data-oriented services as if they had local object models. Which is exactly what happens, I guess: when the code instantiates a JSON Type Provider against a particular resource – I downloaded a sample JSON file from the Typepad API for this – it's contents are then accessible via a locally-generated set of objects and properties.

The next problem was the "screen scraping": we need to extract HTML code and convert it to plain text to compare with the local files. I opted for the HtmlAgilityPack for this: it comes with an Html2Txt sample that I converted to F#. I haven't followed it exactly, as I wanted to keep some amount of whitespace in the generated plain text, but it got me a good part of the way there.

In a general sense, here's what the algorithm needs to do:

Parse the various C# files on disk and create an index that links a comma-separated command-list with the source filename
- The order isn't significant: if the commands don't come in the same order then the files will be different
Extract the post content from my blog and parse it for code fragments
- I've used the same CopyAsHtml tool since this blog's inception, so all code sections are enclosed in a similar-looking <div>
- For now I only care about posts with a single code section. There are certainly posts where I've used this tool to copy smaller fragments for illustrative purposes, so at some point I need the code to pick these up, too
For the posts with a single code section, extract the code and convert it to plain text
Extract the commands implemented in the code and check for which files have the same commands – in the same sequence – on disk
Perform a lower-level comparison between the code extracted from HTML with the files of disk
- This still needs some work: there are files which should match that for some reason don't, right now, but overall it's working quite well
- For debugging purposes I'm currently writing any unmatched code fragments as files in another folder, so that I can go through and see what problems are worth fixing
The ultimate output is a list of post titles with the matching local filename
- It'll be a simple matter to copy these files programmatically into a local folder that will sync with GitHub

Here's the code I have, so far:

(*
#r "Z:/GitHub/FSharp.Data/bin/FSharp.Data.dll"
#r "packages\HtmlAgilityPack.1.4.9\lib\Net45\HtmlAgilityPack.dll"
#r "System.Xml"
*)
 
open FSharp.Data
open HtmlAgilityPack
open System.Xml
open System
open System.IO
 
let codeHeader = "<div style="
let blogRoot = "http://api.typepad.com/blogs/6a00d83452464869e200d83452baa169e2/post-assets.json"
let csFolder = @"Z:\data\Blogs\Projects\Basic C# app"
let tmpFolder = @"Z:\data\Blogs\Projects\Basic C# app\Notfound"
let csTest =
  @"Z:\data\Blogs\Projects\Basic C# app\enumerate-sysvars.cs"
let cmdAttrib = "[CommandMethod("
 
type Post = JsonProvider<"data.json">
 
// Use TypePad's REST API to retrieve batches of posts
 
let getPosts m n =
  let url = String.Format("{0}?max-results={1}", blogRoot, m)
  let url2 =
    match n with
    | 0 -> url
    | _ -> url + "&start-index=" + (m * n).ToString()
  let doc = Post.Load(url2)
  doc.Entries
 
// Count the number of times a substring appears in a string
 
let countOccurrences (sub:string) (text:string) =
  match sub with
  | "" -> 0
  | _ ->
    (text.Length - text.Replace(sub, @"").Length) / sub.Length
 
// These are HTML entity codes etc. that need to be replaced
// as we convert from HTML to plain text
 
let reps =
  [("&#0160;"," ");("&#160;"," ");("&nbsp;"," ");("&gt;",">");
   ("&lt;","<");("&#39;","'");("&quot;", "\"");("&ndash;","-");
   ("&amp;","&");("Â","")]
 
let convertText (t : string) = 
  List.fold
    (fun (a : string) (b : string, c : string) -> a.Replace(b,c))
      t reps
 
// Use the HtmlAgilityPack to convert from HTML to plain text
 
let rec convertTo (node : HtmlNode ) =
  match node.NodeType with
  | HtmlNodeType.Comment -> ""
  | HtmlNodeType.Document ->
      Seq.map convertTo node.ChildNodes |>
        Seq.fold (fun r s -> r + s) ""
  | HtmlNodeType.Text ->
      // script and style must not be output
      let parentName = node.ParentNode.Name
      if parentName = "script" || parentName = "style" then
        ""
      else
        // get text
        let html = (node :?> HtmlTextNode).Text;
 
        // is it in fact a special closing node output as text?
        if HtmlNode.IsOverlappedClosingElement(html) then
          ""
        else
          convertText html
  | HtmlNodeType.Element ->
      if node.Name = "p" then
        if node.HasChildNodes then
          (Seq.map convertTo node.ChildNodes |>
            Seq.fold (fun r s -> r + s) "") + "\r\n"
        else
          "\r\n"
      else if node.HasChildNodes then
        Seq.map convertTo node.ChildNodes |>
          Seq.fold (fun r s -> r + s) ""
      else
        ""
  | _ -> ""
 
// Take post data and extract the HTML fragment representing code
 
let extractCode (content : string) =
  let start = content.IndexOf(codeHeader)
  let finish = content.LastIndexOf("</div>") + 6
  let html = content.Substring(start, finish - start)
  let doc = new HtmlDocument()
  doc.LoadHtml(html)
  convertTo doc.DocumentNode
 
// If a post contains only 1 code segment, we'll extract it
 
let processPost (ent : Post.Entry) =
  let count = countOccurrences codeHeader ent.Content
  ent.Title,
  count,
  match count with
  | 1 -> extractCode ent.Content
  | _ -> ""
 
// List the files conforming to a pattern in a folder
 
let filesInFolder pat folder =
  try Directory.GetFiles(folder, pat, SearchOption.TopDirectoryOnly)
    |> Array.toList
  with | e -> []
 
// Get the indices at which a substring occurs in a string
 
let stringIndices (pat:string) (text:string) =
  let rec getIndices (pat:string) (text:string) (start:int) =
    match text.IndexOf(pat, start) with
    | -1 -> []
    | x -> x :: getIndices pat text (x+1)
  getIndices pat text 0
 
// Extract the command name from a CommandMethod attribute
 
let extractCommandName (text : string) =
  let delim = "\""
  let count = countOccurrences delim text
  match count with
  | 0 -> ""
  | 1 -> ""
  | 2 -> text.Substring(1, text.LastIndexOf(delim) - 1)
  | 3 -> ""
  | _ ->
      let idxs = stringIndices delim text
      text.Substring(idxs.[2] + 1, idxs.[3] - idxs.[2] - 1)
 
// Extract the various command names from a code segment
 
let rec commandsFromCode (text : string) =
  match text.Contains(cmdAttrib) with
  | false -> []
  | true ->
    let start = text.IndexOf(cmdAttrib) + cmdAttrib.Length
    let finish = text.IndexOf(")", start + 1)
    let name = 
      text.Substring(start, finish - start) |> extractCommandName
    name :: commandsFromCode (text.Substring finish)
 
// Create a comma-separated string from a list of strings
 
let rec commaSepString (cmds : string list) =
  match cmds with
  | [] -> ""
  | x::[] -> x
  | x::xs -> x + "," + commaSepString xs
 
// Get the commands for a particular file on disk as a
// comma-separated list and return them with the filename
 
let commandsForFile file =
  File.ReadAllText file |>
  commandsFromCode |>
  commaSepString |>
  (fun x -> (x, file))
 
// Get the command names for a set of files on disk
 
let rec commandsForFiles files =
  match files with
  | [] ->

 []
  | file::xs -> commandsForFile file :: commandsForFiles xs
 
// Create an index from commands to files for a particular folder
 
let indexCommands (folder : string) =
  filesInFolder "*.cs" folder |> commandsForFiles
 
// From our index, get the files associated with a command-set
 
let filesForCommandsFromIndex index cmds =
  index |>
  List.filter (fun (a,b) -> a = cmds && a <> "") |>
  List.map (fun (a,b) -> b)
 
// Strip blank lines from a sequence of strings
 
let stripBlanks (s : seq<string>) =
  Seq.filter (fun x -> not(String.IsNullOrWhiteSpace(x))) s
 
// Compare sequences of strings, ignoring non-relevant whitespace
 
let compareSequences (s1 : seq<string>) (s2 : seq<string>) =
  Seq.compareWith
    (fun (a:string) (b:string) -> String.Compare(a.Trim(), b.Trim()))
    (stripBlanks s1) (stripBlanks s2)
 
// Write code to a temp file - for debugging only
 
let writeToTmpFile (code:string) =
  let rec getTmpFile i =
    let file = tmpFolder + "\\" + i.ToString() + ".cs"
    if not(File.Exists(file)) then
      file
    else
      getTmpFile (i+1)
  use wr = new StreamWriter((getTmpFile 0))
  wr.Write(code)
 
// Take a code fragment and a file and check them for equivalence
 
let checkCodeAgainstFile (code:string) (file:string) =
  let clines = code.Split("\n\r".ToCharArray())
  let flines = File.ReadAllLines(file)
  let s1 = Seq.ofArray clines
  let s2 = Seq.ofArray flines
  if compareSequences s1 s2 = 0 then
    [file]
  else
    []
 
// Take a code fragment and a set of files and see if one matches
 
let checkCodeAgainstFiles code files =
  let rec checkAgainstFiles code files =
    match files with
    | [] -> []
    | x::xs ->
      checkCodeAgainstFile code x :: checkAgainstFiles code xs
  checkAgainstFiles code files |> List.concat
 
// Our main function
 
[<EntryPoint>]
let main argv = 
 
  // Build an index from commands to source files on the hard drive
 
  let index = indexCommands csFolder
 
  // Pull down post information from TypePad and process it
 
  let posts =
    [|0..25|] |>
    Array.map (getPosts 50) |> // Get 1250 posts in batches of 50
    Array.concat |>            // Flatten the nested arrays
    Array.map processPost      // Process the posts
 
  // Separate the posts into posts with code and those without
 
  let postsWith, postsWithout =
    Array.partition (fun (a,b,c) -> b > 0) posts
 
  // Separate the posts with code into those with one section
  // and those with more
 
  let postsWithOne, postsWithMore =
    Array.partition (fun (a,b,c) -> b = 1) postsW

ith
 
  printfn
    "%d posts with zero, %d posts with one, %d posts with more"
    postsWithout.Length
    postsWithOne.Length
    postsWithMore.Length
 
  // We'll take the posts with a single code section and process
  // them
 
  let res =
    postsWithOne |>
    Array.map (fun (a,b,c) -> commandsFromCode c) |> // Get commands
    Array.map commaSepString |> // Make a comma-delimited cmd list
    Array.map (filesForCommandsFromIndex index) |> // Use our index
    Array.map2 (fun (a,b,c) d -> (a,c,d)) postsWithOne |> //
    Array.filter (fun (a,b,c) -> b <> "") |> // Strip codeless
    Array.map
      (fun (a,b,c) ->
        a,
        let x = checkCodeAgainstFiles b c
        if x = [] then writeToTmpFile b // This is for debugging
        x) |>
    Array.filter (fun (a,b) -> b <> []) // Strip fileless
 
  0 // return an integer exit code

Right now it finds 108 source files that are "correct" on disk. This is a reasonable start, but there are certainly more to be found.

By the way, while I don't currently have a second part of this series planned, specifically, I know I'm going to need one to share the final version of the code. Which I'll also place on GitHub, of course. 🙂

Through the Interface

Extracting code from blog posts using F# – Part 1

Leave a Reply Cancel reply