Microsoft Cloud Authors: Pat Romanski, Andreas Grabner, Nick Basinger, Kevin Benedict, Liz McMillan

Related Topics: Microsoft Cloud

Microsoft Cloud: Article

Corporate Wiki Part 2: Writing Your Own Wiki Search Engine

Corporate Wiki Part 2: Writing Your Own Wiki Search Engine

  • Corporate Wiki Part 1: Building Your First Wiki Parser
  • Corporate Wiki Part 3: Files Up and Down - Adding More Features

    With the corporate wiki you built using the first article in this series (.NETDJ, Vol. 2, issue 1), you have everything you need to create, edit, and navigate content, but now you need something to help users find what they are looking for. In this article I walk you through creating your own search engine for your corporate wiki.

    To review, the wiki is a content creation and navigation hypertext system in which all readers are potential editors - a true content democracy. In my previous article I led you through the necessary steps to create your own wiki. While that article gives you the most basic features required in any wiki (parse, edit, and navigate), most wiki implementations provide four other basic features: search, recently edited topics, like or similar topics, and topic history.

    This article will focus on the wiki search engine. The other three features are easy to implement, so I will cover them briefly in the accompanying sidebar. The full source code for the complete Corporate Wiki project and the Wiki Index Service is available for download from www.sys-con.com/dotnet/sourcec.cfm.

    In my first wiki implementation, I used the SQL Server full-text index service and the FreeTextTable T-SQL keyword in a query of medium complexity. While the Microsoft Search service and full-text indexing on SQL Server work quite well, producing superbly ranked results, there is one obvious drawback to using it in this case. The search service indexes all rows in a table. The WikiTopic table in my wiki contains all versions of every topic, so the search service would index new and old topic revisions without prejudice. Thus, search results would contain current and obsolete versions of same topic.

    Like Recent History, Dude
    In this update to the Corporate wiki project, in addition to search I have added three features that are common to most wiki implementations: like, recent, and history pages. Each of these pages adds a different but quick and easy way for the user to find something in which he or she is interested. The code for each of these is rather simple, so I will leave you to download it for yourself from www.sys-con.com/dotnet/sourcec.cfm.

    For short list pages such as these, I like to use a simple HTML table with a runat="server" attribute in the table tag to dynamically fill the table with the table controls in the System.Web.UI.HtmlControls namespace (see Listing 8).

    The Like page contains two tables and executes two stored procedures to fill them. One procedure looks for topics that begin with the same word as the existing topic from which the Like link was clicked. The second pulls up those topics that end with the same word as the current topic. For one-word topics, these tables are the same. The tables also display the time and date the topic was last modified.

    The Recent page lists recently edited topics in descending order by the date and time edited. This provides users with a simple way to see where the activity is. A simple form lets you change the number of days to go back in the query. The default is 10. The page lists the topic (linked to the view page), the author of the most recent edit, and the date and time the page was edited.

    The History page displays each of the major revisions to the given topic in order from most recent to the original. The only trick here is that the link to the most recent topic version is different from the others. The older versions of the topic are linked to the viewversion.aspx page, which is similar to the standard view page but instead of an Edit link, it has a link to the current version of the topic. This is done to prevent the illusion that a user can edit a previous version, which is not possible in the current project and not likely a feature you would care to implement.

    The quick and obvious solution to this problem is to move previous versions of a topic to another table when a new version is added to the table, but that comes with its own set of problems. Solving those problems would probably not be too difficult - and if you took that road, no one would blame you. But it's not nearly as much fun as building your own index and search solution - your own wiki search engine.

    A year ago I did some research into search engine technology, but let me be the first to disclaim any great level of understanding of this topic or experience in building sophisticated indexing and search routines. There are many off-the-shelf solutions that may be used to provide a similar or even better result. But again, what's the fun in that?

    The wiki search engine consists of two primary elements: the WikiIndexer, a Windows Service project; and the wiki query mechanism, an ASP.NET page in the wiki that queries a view in the database. This solution requires the addition of four tables and one view in the database, along with a number of procedures to allow them to be used from the index service and the wiki application. The new tables are WikiWords, WikiWord Index, WikiTopicRank, and WikiSummary.

    The WikiWords table is a unique index of every searchable word in the wiki. The indexer ignores one- and two-letter words, as well as a set of common three-letter words (see Listing 1). The indexer loads this entire table into a hashtable object so that it will only have to write to the table when words are found in the wiki that do not already exist in the table. You might worry that this table would grow too large, given the several hundred thousand words in the English language claimed by most unabridged dictionaries; however, optimistic estimates of commonly used vocabulary are much lower. I found one reference to a study that determined that people of extraordinary intelligence had an average working vocabulary of only 20,000 words. In my own tests of the indexer, I randomly pulled about 50 news stories from the Internet and ended up with just over 4,000 unique words.

    The WikiWordIndex table is a cross-index table containing only three integer columns. The first two are foreign keys into the WikiWords and WikiTopic tables. The third is a count column that holds the number of times a given word occurs in a given topic.

    The Search stored procedure pulls from the WikiSearch view with a specified weight given to the rank of a given topic and to the number of times the words in the search phrase occur in the topic. The WikiTopicRank table contains a rank value for each current topic in the wiki. Previous versions are removed from the index tables upon completion of the indexing routines. The topic rank is simply a count of how many times other wiki topics link to the topic in question. This gives you an indication of how popular this topic is in the wiki.

    The WikiSummary table is a simple repository that keeps track of when a topic was last indexed and stores a summary of the topic. Currently, the summary simply takes the first 256 characters in a topic. In future versions of your wiki, you may wish to enhance the index service with a more sophisticated summarizing routine.

    The WikiSearch view draws data from the three tables just described, as well as the WikiTopic table. This view is used by the Search stored procedure, which has an inner select on the WikiWords table, so each of the tables in the database is used to accomplish the search. With a smaller content dataset this query is quite fast, but as time goes on and your content grows, you will probably want to examine the efficiency of the Search stored procedure and make adjustments to the indexes to improve its performance.

    The indexing process is relatively simple. First I load a list of unique topic names. Then I iterate through each topic, retrieving the most recent version of each, and index the text contained in the topic. If I find new words, I add them to the words hashtable and the WikiWords table. I then pull the WikiWordIndex records for this topic into a hashtable so that I can find out whether each word in the topic has been previously indexed. It also allows me to compare the number of times that word occurred previously - if at all - with the number of times it occurs in the topic now. If the word did not exist previously, a new record is written to the WikiWordIndex table (see Listing 2). If the number of occurrences in a topic changes for a previously indexed word, the record is updated. If nothing changes, no write to the database occurs.

    Once the words are counted and accounted for, I write the summary of the topic to the WikiSummary table. Then the fun begins. I used code from the wiki parser in the wiki ASP.NET application to find the wiki topic links in the topic being indexed at the moment (see Listing 3). When a link to another topic is found, I update the linked topic's hashtable count value to be used once I have looped through each current topic.

    Now that I have completed my loop through the current topics, I turn my attention to the topic ranks. I pull a list of existing topic ranks into another hashtable and compare the ranks that already exist with those that have been counted during this run of the indexer (see Listing 4). If the rank has been changed, an update is run on the WikiTopicRank table. If the rank found during indexing does not exist, then a new record is written to the table.

    The ranking has been updated now, but there may be topics in the index that no longer belong because they have been updated with major revisions since the indexer last updated the index. The cleanup routine loads the existing topic IDs into a hashtable and compares them with the IDs of current topics just indexed. If the topic ID in the existing indexed topic hashtable does not exist in the most recently indexed topic hashtable, then rows referencing that topic ID are deleted from the index tables (see Listing 5).

    The WikiIndexer service is easy to install using the .NET Framework's InstallUtil.exe utility. Once you compile the indexer, just copy the contents of the bin directory to a nice permanent home. Open a command prompt, change directories to your permanent home, and execute the install utility. You can find many tutorials on how to create a Windows Service using VS.NET, so I won't waste your time here with that. As long as the service has access to the database, it doesn't really matter where you run it. The XML config file contains the two values you need to configure in order for the service to run properly. The first is the connection string and the second is the number of hours between each run of the indexer. You may wish to create a more sophisticated scheduling mechanism for running your indexer, but that's outside the scope of this article.

    Implementing the ASP.NET search page in the wiki is easy enough. The new search.aspx page consists of a simple form to take the user's search query with a Search button. The button click handler prepares and validates the search terms and then executes the search on the database using the WordSearch stored procedure (see Listing 6). The results of the query are bound to a simple ASP.NET Repeater control that contains a hyperlink and literal control for the link and the summary of each search result.

    In the ItemDataBound event of the Repeater, I create the link and, using the wiki parser, I parse the summary into the literal control (see Listing 7). This can push the results page to a fairly long scroll so you may wish to consider changing how you display your search results. To keep things simple, I did not add a paging mechanism to the search page, nor did I apply special formatting to the topic link, which means you get it as it exists in the database.

    Now you have a good general-purpose wiki. In Part 3 of this series, I will add some of the corporate-like features your users in the workplace will undoubtedly want, including the ability to add a file repository to your wiki that will let users upload files to the wiki and an expanded parser that will allow users to reference those files in their topics for download or, in the case of GIFs and JPGs, for display within the topic itself. Until then, enjoy building your own search engine.

  • More Stories By Tyler Jensen

    Engrossed in enterprise application architecture and development for over ten years, Tyler Jensen is a senior technical consultant in a large health intelligence company, designing and developing claims processing and analysis software. In his spare time he does a little writing and outside consulting.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

    IoT & Smart Cities Stories
    Moroccanoil®, the global leader in oil-infused beauty, is thrilled to announce the NEW Moroccanoil Color Depositing Masks, a collection of dual-benefit hair masks that deposit pure pigments while providing the treatment benefits of a deep conditioning mask. The collection consists of seven curated shades for commitment-free, beautifully-colored hair that looks and feels healthy.
    The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
    The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
    We all love the many benefits of natural plant oils, used as a deap treatment before shampooing, at home or at the beach, but is there an all-in-one solution for everyday intensive nutrition and modern styling?I am passionate about the benefits of natural extracts with tried-and-tested results, which I have used to develop my own brand (lemon for its acid ph, wheat germ for its fortifying action…). I wanted a product which combined caring and styling effects, and which could be used after shampo...
    The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
    There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
    Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
    At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
    Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
    BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.