Welcome!

.NET Authors: Liz McMillan, Eric Carter, Maureen O'Gara, Elizabeth White, Dana Gardner

Related Topics: .NET

.NET: Article

Corporate Wiki Part 1: Building Your First Wiki Parser

Corporate Wiki Part 1: Building Your First Wiki Parser

  • Corporate Wiki Part 2: Writing Your Own Wiki Search Engine
  • Corporate Wiki Part 3: Files Up and Down - Adding More Features

    Simplicity is the watchword. Before you spend thousands of dollars on a collaboration and knowledge management system, try a corporate wiki. You may find that it fills your needs.

    After a brief introduction to the origin and nature of wikis, this article, the first in a series on building a corporate wiki, will focus on creating a wiki parser, the heart of any wiki.

    Each article in this series will leave you with a complete, usable application. The code listings here illustrate the primary topic of discussion, but you can download the entire solution source code from www.sys-con.com/dotnet/sourcec.cfm.

    Many hypertext navigation and authoring systems have been developed over the years, but not until Ward Cunningham created the WikiWikiWeb did the concept really take off on the Internet. Wiki is reportedly the Hawaiian word for "quick," an alternative to calling it QuickWeb, a name that's not nearly as fun.

    Eight years later, search for "wiki" on Google and you'll get nearly 3 million results. Many personal and enthusiast sites now have a wiki where all visitors may read and contribute.

    There are now as many definitions for "wiki" as there are open source and commercial implementations of wiki Web apps. Along with a variety of features, these applications provide two fundamental services: view and edit.

    The wiki viewer fetches a topic's text and parses it, transforming it to HTML for display (see Figure 1). The editor allows you to edit any topic and save it to the data store from which the viewer fetches topics.

     

    The first step in building your wiki parser is to decide exactly which plain-text syntax you will support. I have chosen the syntax rules found in Table 1, as they seem to be most common in the wiki sites I have visited. You are certainly free to define your own or modify these.

    If you are already familiar with wikis, you will notice that I have made a strong departure from the traditional Pascal case of wiki topic titles in favor of using case-insensitive topic titles connected or trailed by an underscore "_" character.

    After experimenting with a wiki in my own corporate culture, I chose to do this because users were more comfortable with the underscore. Of course, you could easily modify the code to support any topic title formatting scheme you want to use.

    Some of the plain-text syntax rules here may not be found in Cunningham's original wiki implementation, but I have found all of them in one wiki site or another, with the exception of the ::code:: tag I created to let me insert HTML preformatted text into a topic.

    Given the rules established in Table 1, you will discover that there are three types of text blocks possible: (1) standard plain text, one paragraph per line; (2) bulleted or numbered lists, one bullet per line; and (3) code sections, set off by the ::code:: text on a unique line.

     

    Each text block type requires a slightly different parsing approach, so I found that it was much simpler to make a quick pass through the text to isolate each block and then iterate through that collection of blocks to parse each by type. To help with this, I created the WikiTextBlock class and WikiTextBlockType enum to handle the job (see Listing 1).

    The parser makes quick work of creating an ArrayList containing each of the text blocks and iterates through the array using a StringBuilder to store the results of each call to the specific parser code that handles that particular block type. Once all of the blocks are parsed, the StringBuilder's ToString() method is called to return the formatted HTML to the page code.

    The only requirements for the Code block type are to replace HTML tag characters ( <, >, and & ) with their HTML-escaped cousins and to surround the text block with the <pre></pre> tag set. That's the easy block type.

    The FormatListWikiText method handles the List type and is the most complex to parse because you have to deal with indentation and closing up the <ul> or <ol> tag sets. This is handily done with a Stack where I push the closing tags of each list when the list opening tag is created, then pull those same closing tags when the list is completed.

    After the HTML bullet tags are handled, the code calls the common parser method FormatStandardWikiLine to prepare the text of each line in the code block. This is also the primary formatting method used for the Plain type text block.

    The FormatStandardWikiLine method executes specific methods, used to easily break down the task in careful order, to transform the text of each line using regular expressions to find and replace the plain-text syntax with HTML formatting.

    In some cases, if the required order of transformations is not followed, unexpected results will occur. The steps in converting the plain text of the wiki entry into HTML are:

    1. Topic links
    2. Hyperlinks
    3. Horizontal rules
    4. Bold italics - phrase
    5. Bold italic underline - word
    6. Headings
    7. Block quotes
    Topic Links
    To parse for topic titles and convert them to HTML links, you need to do two things: (1) find the title in the line; and (2) create a link to the view page or edit page, depending on whether the topic already exists.

    First I created the regular expression string pattern and the instance of the Regex class, using that pattern as static members of the parser (see Listing 2). The parser is entirely static, making it more efficient and easier to use throughout the application.

    The FormatTopicLinks method is called for each line and passed as a reference in the text block. The code (see Listing 3) finds all topic matches in the text using the Regex object, called RxTopic. It then iterates through each match.

    If the match is not the same as the topic of the page being viewed, you process the match to create the link to either the view or edit page. Otherwise, you convert the match to plain text without the underscores, since there is no sense in linking a page to itself.

    To create the link, the match is checked against the TopicManager's RevCount hash table collection of existing topics to determine whether the matched topic exists. If it does exist, the match is transformed using the Regex.Replace static method to create a link to the view.aspx page using the formatted topic title for the text in the link.

    If no match exists, the link points to the edit.aspx page with a different CSS-style class called out and a link title element of "create this topic". In this way, links that do not exist can be easily distinguished from those that do.

    Linking URLs
    Standard wiki formatting for URLs is a simple <A HREF> with the text of the link being a copy of the actual URL. For simple and short URLs, this works well, but so many URLs are quite lengthy. Consider the URL for Microsoft's MSDN coverage of regular expressions (see resources).

    To solve the long URL problem, the FormatHyperLinks method (see Listing 4, available at www.sys-con.com/dotnet/source.cfm) performs three types of text transformations: (1) a link to a URL preceded by descriptive text between [ ] brackets; (2) a simple URL; and (3) a mailto URL.

    Finishing the Parsing
    The remaining steps in parsing each line are simpler, but each takes advantage of regular expression language that probably looks more like chicken scratches to someone unfamiliar with this arcane parsing syntax. Even for seasoned regexers, it's very helpful to keep a reference guide handy. For my own reference, I copied a number of pages from the Microsoft online guide into a static HTML page and stuck it right on my desktop.

    Once the horizontal rules, word and phrase formatting, headings, and block quotes are set into HTML, the code adds the formatted string to the StringBuilder instance in the Convert WikiTextToHTML method of the parser, which iterates through each text block sequentially to get the entire topic completed. The StringBuilder's ToString() method is called to return the completely formatted HTML to the calling page (in this case, the view.aspx page).

    Source Code
    Due to space limitations, not all of the code can be printed here, so download the sample code from www.sys-con.com/dotnet/sourcec.cfm and get started building your own wiki. The source includes the VS.NET 2003 solution source code and the MS SQL 2000 create and stored procedure scripts.

    I have only scratched the surface of the regular expressions language. Check out Microsoft's MSDN reference on regular expressions for .NET.

    This initial stab at a wiki will get you going on your own corporate wiki, but business users will undoubtedly want more. In future installments in this series I will walk you through building the bells and whistles your users will want, while keeping your wiki simple and easy to use. After all, if it's not simple, it's not wiki.

    Next you'll get search, recent changes, revision history, and like topics lookup, as well as delete functionality to remove topics. Future articles will cover uploading and downloading files, parsing images into topics, and implementing teams with forms security and data-based user and groups management.

    Last, I'll help you create a subscription-based topic change e-mail notification service. This will allow users to get immediate e-mail notification when topics in which they are interested are changed by other users.

    Until next time, good luck and have fun building your own corporate wiki.

    Resources

  • Ward Cunningham's WikiWikiWeb: http://c2.com/cgi/wiki
  • .NET Framework Regular Expressions: Click Here !
  • Sparx Systems Enterprise Architect: www.sparxsystems.com.au

    Corporate Wiki Architecture
    When you need a garage, a Sistine Chapel model might be overkill. The architectural framework of the corporate wiki presented here is simple and effective. The view and edit pages use several convenient classes for accessing configuration information, retrieving and persisting topics, parsing topic text to HTML, and maintaining a list of existing topics (see Figure 2).

     

    Parser
    The parser is the heart of the wiki. It performs the transformation of the easy-to-enter plain wiki topic text into standard HTML. Its static methods and properties make it easier to use and improve the performance of the regular expressions, as they are compiled once and used over and over.

    Topic Manager
    Named simply for its primary functionality, the TopicManager maintains a static list of existing topics with a revision count value in a hash table accessible via the RevCount property. RevCount is incremented when the topic is updated in the SaveWiki method after updating the database. This class also provides the GetWikiTopic method to retrieve the latest version of a specific topic.

    WikiTextBlock and WikiTextBlockType
    The WikiTextBlock provides a simple data container object that gets stuck into an ArrayList collection on the parser's first pass through a topic's text. This makes for much easier parsing of different types of text blocks. WikiTextBlockType isn't mentioned.

    WikiTopic
    Another data container class, a WikiTopic object, is returned by the TopicManager in the GetWikiTopic method. This simplifies the page code that deals with the presentation of the data contained in topic when it's retrieved from the database.

    DataAccess
    The DataAccess class provides simplified access to the database's three stored procedures. It uses Microsoft.ApplicationBlocks.Data's SqlHelper class to make using the stored procedures even easier.

    Config
    I like to create a Config class in all of my ASP.NET applications. This little bit of work makes it very easy to keep your application highly configurable and introduces a convenient layer of abstraction between the web.config file and your code.

    UML Design Tools
    If you're looking for a great UML tool at a great price, look no further. While I have used Rational Rose for building UML models, I prefer to use Sparx Systems Enterprise Architect (EA). Surprisingly affordable compared to other tools, EA provides an IDE-like user interface with all the resources a software architect needs. The image in Figure 1 was created with EA.

  • More Stories By Tyler Jensen

    Engrossed in enterprise application architecture and development for over ten years, Tyler Jensen is a senior technical consultant in a large health intelligence company, designing and developing claims processing and analysis software. In his spare time he does a little writing and outside consulting.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.