| By Joel Pobar | Article Rating: |
|
| May 26, 2008 06:00 AM EDT | Reads: |
13,018 |
Joel Pobar's WeblogI was reminiscing recently about the good ‘ol days tinkering with computers: Commodore 64’s, GWBASIC, Turbo Pascal 5.0, DOOM and the Autoexect.bat config.sys hacking required to get it running on underprivileged 486’s, Amiga 500’s, broken Linux 1.0 kernel compiles, EGA video cards and more Sierra games than I can remember. Getting stuff running was hard. Understanding how stuff worked was heaps of fun. Connectivity to other likeminded communities was basically non-existent, so a great book on the topic of interest was like striking gold in Ballarat.

It got me thinking though – if I were to start again in 2007, what would be the equivalent to learning about the flat memory address space of a Commodore 64, or breaking open a copy Borland’s new Turbo Pascal IDE? I had to ignore my first thought of being mindlessly hooked on Facecrack getting nothing done, and push through to what I believe to be the 8 most interesting software engineering pursuits of the next 5 years – things that really light me up, something worthy of dedicating years of sleepless nights to.
I’m going to make this an 8 part series. Before I started this, I imagined it to be a few pages of lightweight material to get my point across and clarify my thinking – now that I’m finished, it’s a fairly dense 8000 word essay. ;) We'll start with the list, and then I'll talk about my thoughts on each of the technologies one by one over the series.
Joel's 8 most excellent software engineering adventures (in no particular order):
- Comprehending the Cloud (taking HTML and making programmatic sense of it)
- Infrastructure Scalability (scale in the massive sense: Amazon EC2, Grid Computing, GFS, MapReduce, HAMMER, S3 etc.)
- Functional languages (going mainstream baby!)
- Client side parallel programming models (PLINQ, PFX, GPU Programming)
- OS Hardware Virtualization (Cloud, Virtual Machines as OS's)
- Machine learning and Data Mining
- Search (Algorithms)
- Compilers, Languages, and DSL's (Compiler implementation, Phoenix, the Sptectrum of languages)
1. Comprehending the Cloud
True programmatic comprehension of the ‘cloud’ (that thing we call the Internet) is only just starting to get underway. We’ve got movement in microformat’s, RSS, well formed XHTML, web services and javascript, but there’s still a long way to go. One goal of comprehension is extensibility: the ability to programmatically extend a website or URI endpoint to create value for both the source and the extender. We’ve known about this value add system for years, it’s why Windows is so dominant, and apps like Photoshop keep their lead through the bazillion extensions you can buy.
Another goal is simplicity. I want to be able to hit a website, pass in my identity, retrieve the data I care about, and have that data loosely bind to other data I care about. Consider the following:
Here in my fake scenario, I’m slurping down the business news for the day, converting it to a list of company names and stock codes, and then sucking down the latest prices of that list from my broker – all in less than 10 lines.
I equivocate this experience of slurping data from websites to that of hitting a database and retrieving rows of data I care about. Let’s ignore extensibility for now, and focus on getting at that data.
What are the challenges?
Descriptive formats for software and components have been around since the dawn of operating systems. On the DOS/Windows platforms, we had the .EXE/.COM/.DLL packaging formats which allowed a very limited amount of extensibility and interaction, then we moved to software-to-software messaging systems and shared memory (DDE Dynamic Data Exchange, was the first attempt of this on Windows). Through the years we’ve evolved these packaging and messaging formats to be descriptive, and very extensible (VBX/ActiveX/COM/DCOM/ and finally .NET/Java etc.).
Formats and languages for data arguably have been around for longer, as Databases have traditionally enforced constraints through schema adherence, and query languages.
Noting this, the challenge should be clear by now: how do we make cloud comprehension as easy as loading a URI endpoint, reflecting over it, and then slurping down the data that we care about in a structured way? How do we then apply all we know about the evolution of software components to the web? Versioning? Bindings? Reliability?
Then, how do we get there today, using as the base the current minimum standard – unstructured HTML? Jason Kottke recently wrote that “open and messy trumps closed and controlled in the long run”, I tend to think that this is may be true for HTML vs. structured markup (at least in the short term). Sure, we’ll have a bunch of the later, but the former is always going to be there.
Solutions?
Dapper (http://www.dapper.net) takes a social approach: create a community where people tell the Dapper screen scraper where to find the data in the rendered version of the web pages, and convert that back to descriptive XML. There are issues with accuracy, and when the website layout changes, Dapper breaks, so it’s not terribly reliable either. A novel approach nevertheless.
Another approach is to embed semantic “helpers” in to the rendering engines themselves: bulletin boards, blog engines, mailing lists etc, and so when scraper API’s walk the site, they find navigating to the data easier.
Markup formats like RDF are also gaining traction, but it’s unrealistic to assume that we’ll retrospectively add RDF against all HTML based URI endpoints.
My guess?
My best guess at the short term solution for the worst case in cloud comprehension (just having bare minimum HTML, no RSS or anything)? Marrying late-bound data binding mechanisms with pattern matching/machine learning. You’d have the pattern matching software build up a loose idea of what it believes to be the interesting data content in the HTML (just like you can train software to understand the parts of speech in a corpus, you can presumably train it to look for content vs. navigation vs. ads etc.). Then pass that loose representation of the data to a language/platform which late-binds to the various metadata elements, and allows for meaningful introspection. To illustrate what I’m on about, consider the following imaginary HTML slurped down from a business website:
It’s ugly and unstructured. Clearly, I want something clean, something I can walk over and look at. Let’s pass it to our imaginary pattern matching/machine learning platform that dissects the rendered structure, and pulls out what’s interesting:
Much better. And likely something I can code against. This imaginary service could render RSS, RDF, or a popular webservice format, I don’t care, just give me something with structure + metadata.
So, clearly this would scale better if the pattern matching & machine learning platform was shared. Anyone that’s tried training a neural net/NLP platform knows that the more accurate training data you have, the more accurate the result. Easy solved. Imagine a HTML->XML web service that allows for incremental training? Developers slurp down the URI endpoint via the webservice, and can let it know where it got it wrong (e.g. you thought this block of text was an ad, but it was actually a comment on a blog post). Over time, a URI’s metadata just gets better and better.
Further extending this theme, consider the cases where we need to know about named entities: imagine another shared machine learned webservice, where you hand it semi-structured XML, and it hands you back the same XML but with more tags describing all the companies it found in the data.
You could pass it the following:
And it hands you back the following:

With two imaginary calls, we’ve gone from an unstructured HTML endpoint, to a semi-structured representation of what a machine believes to be the data, then we’ve added richer metadata using a specialized named-entity web service. And so starts the virtuous cycle…
To summarize, we’re using a machine to render metadata about URI’s for us. It’s not going to be brilliantly accurate, and the structure has to be lose and generic by definition, but we can make up for these deficiencies through machine learning, adding metadata incrementally using specialized services, and adding a social aspect to make training more efficient. As for the generic structure: use your favourite late-binding language or query language to grok/filter/sort that structure to make use of it in a reliable way.
More food for thought
We’ve barely touched the surface here – we missed code invocation (i.e. if a URI endpoint has Javascript, what are the semantics for invoking code on that endpoint), handing forms and other “shared memory” like web mechanisms, dealing with embedded non-text content like video players, and how you would go about programmatically exposing that stuff. There’s also the question of consolidation: we already have a bunch of these microformats that are helping us expose URI metadata (RSS is one of them), should we consolidate that stuff? And if so, how would you go about mashing those formats together?
There are a slew of legal issues too: copyright, fair use, adherence to international legislation etc.
Nevertheless, cloud comprehension makes my top 8 because it’s an interesting problem that could blend a bunch of fascinating software engineering technologies: machine learning, pattern matching, social software, scale, and the language late-binding mechanisms to tie it all together. Plenty of curious meat.
And finally, a few links to chew on below. Click away to learn more.
Next in the top 8? Infastructure Scalability. I’ll be talking about Amazon EC2, Grid Computing, Hadoop, GFS, S3 and more.
Stay tuned.
Links
Semantic Web (Google TechTalk)
Semistructred and Structured Data in the Web: Going Back and Forth
Constructing Hierarchical Information Structures of Sub-page Level HTML Documents
Extracting Structures of HTML Documents
Semantic Web Podcast
RDF
What is RDF
SPARQL
SPARQL and the Semantic Web (Podcast)
Late-binding over XML: Visual Basic 9
Volta and Dynamic Languages
Published May 26, 2008 Reads 13,018
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Joel Pobar
Joel Pobar speaks, consults, and teaches .NET technologies: CLR; programming languages; threading; platforms and more. A former Microsoft Program Manager, since leaving Microsoft he has been tinkering with v.next software: machine learning, natural language processing, programming languages and more.
![]() |
Derek Harris 11/26/07 02:14:51 AM EST | |||
Cloud Computing - [...] Haven’t we learned anything about the risks involved in giving new technologies catchy labels that mean different things to different people, and nothing to others?” [...] |
||||
- Kindle 2 vs Nook
- Practical Approaches for Optimizing Website Performance
- SQL Anywhere Server and AJAX
- PowerBuilder Top Feature Picks
- The Difference Between Web Hosting and Cloud Computing
- PowerBuilder 12 and .NET
- Contrary Opinion: Why Silverlight is Good for Adobe
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Wave on Ulitzer: Confessions of a Google Wave Fanboy
- Cloud Computing Best Practices
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Rich Content Rotator for ASP.NET
- RIAs for Web 3.0 Using the Microsoft Platform
- Kindle 2 vs Nook
- Practical Approaches for Optimizing Website Performance
- Social Media Terrorists
- SQL Anywhere Server and AJAX
- SYS-CON's Cloud Expo Adds Two New Tracks
- PowerBuilder Top Feature Picks
- The Difference Between Web Hosting and Cloud Computing
- Google Maps and ASP.NET
- Crystal Reports XI & How It Has Changed
- Converting VB6 to VB.NET, Part I
- Creating Controls for.NET Compact Framework in Visual Studio 2005
- Where Are RIA Technologies Headed in 2008?
- How to Write High-Performance C# Code
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Implementing Tab Navigation with ASP.NET 2.0
- i-Technology Photo Exclusive: Bill Gates & Steve Jobs In "Nerds"
- .NET Archives: Getting Reacquainted with the Father of C#



































