YOUR FEEDBACK
Adobe Flex 2 - Answering Tough Questions About Enterprise Development
A Correct Person wrote: Denis Roebrt commented on the 21 Aug 2006 "Tough Que...

SYS-CON.TV
TOP MICROSOFT .NET LINKS


The World's Eight Most Excellent Software Adventures, Part One
What are the 8 most interesting software engineering pursuits of the next 5 years? Part One: Comprehending the Cloud

Digg This!

Joel Pobar's Weblog

I was reminiscing recently about the good ‘ol days tinkering with computers: Commodore 64’s, GWBASIC, Turbo Pascal 5.0, DOOM and the Autoexect.bat config.sys hacking required to get it running on underprivileged 486’s, Amiga 500’s, broken Linux 1.0 kernel compiles, EGA video cards and more Sierra games than I can remember. Getting stuff running was hard. Understanding how stuff worked was heaps of fun. Connectivity to other likeminded communities was basically non-existent, so a great book on the topic of interest was like striking gold in Ballarat.

It got me thinking though – if I were to start again in 2007, what would be the equivalent to learning about the flat memory address space of a Commodore 64, or breaking open a copy Borland’s new Turbo Pascal IDE? I had to ignore my first thought of being mindlessly hooked on Facecrack getting nothing done, and push through to what I believe to be the 8 most interesting software engineering pursuits of the next 5 years – things that really light me up, something worthy of dedicating years of sleepless nights to.

I’m going to make this an 8 part series. Before I started this, I imagined it to be a few pages of lightweight material to get my point across and clarify my thinking – now that I’m finished, it’s a fairly dense 8000 word essay. ;) We'll start with the list, and then I'll talk about my thoughts on each of the technologies one by one over the series.

Joel's 8 most excellent software engineering adventures (in no particular order):

  1. Comprehending the Cloud (taking HTML and making programmatic sense of it)
  2. Infrastructure Scalability (scale in the massive sense: Amazon EC2, Grid Computing, GFS, MapReduce, HAMMER, S3 etc.)
  3. Functional languages (going mainstream baby!)
  4. Client side parallel programming models (PLINQ, PFX, GPU Programming)
  5. OS Hardware Virtualization (Cloud, Virtual Machines as OS's)
  6. Machine learning and Data Mining
  7. Search (Algorithms)
  8. Compilers, Languages, and DSL's (Compiler implementation, Phoenix, the Sptectrum of languages)
Okay, let’s start with the first of eight - comprehending the Cloud:


1. Comprehending the Cloud

True programmatic comprehension of the ‘cloud’ (that thing we call the Internet) is only just starting to get underway. We’ve got movement in microformat’s, RSS, well formed XHTML, web services and javascript, but there’s still a long way to go. One goal of comprehension is extensibility: the ability to programmatically extend a website or URI endpoint to create value for both the source and the extender. We’ve known about this value add system for years, it’s why Windows is so dominant, and apps like Photoshop keep their lead through the bazillion extensions you can buy.

Another goal is simplicity. I want to be able to hit a website, pass in my identity, retrieve the data I care about, and have that data loosely bind to other data I care about. Consider the following:



Here in my fake scenario, I’m slurping down the business news for the day, converting it to a list of company names and stock codes, and then sucking down the latest prices of that list from my broker – all in less than 10 lines.

I equivocate this experience of slurping data from websites to that of hitting a database and retrieving rows of data I care about. Let’s ignore extensibility for now, and focus on getting at that data.

What are the challenges?

Descriptive formats for software and components have been around since the dawn of operating systems. On the DOS/Windows platforms, we had the .EXE/.COM/.DLL packaging formats which allowed a very limited amount of extensibility and interaction, then we moved to software-to-software messaging systems and shared memory (DDE Dynamic Data Exchange, was the first attempt of this on Windows). Through the years we’ve evolved these packaging and messaging formats to be descriptive, and very extensible (VBX/ActiveX/COM/DCOM/ and finally .NET/Java etc.).

Formats and languages for data arguably have been around for longer, as Databases have traditionally enforced constraints through schema adherence, and query languages.

Noting this, the challenge should be clear by now: how do we make cloud comprehension as easy as loading a URI endpoint, reflecting over it, and then slurping down the data that we care about in a structured way? How do we then apply all we know about the evolution of software components to the web? Versioning? Bindings? Reliability?

Then, how do we get there today, using as the base the current minimum standard – unstructured HTML? Jason Kottke recently wrote that “open and messy trumps closed and controlled in the long run”, I tend to think that this is may be true for HTML vs. structured markup (at least in the short term). Sure, we’ll have a bunch of the later, but the former is always going to be there.

Solutions?

Dapper (http://www.dapper.net) takes a social approach: create a community where people tell the Dapper screen scraper where to find the data in the rendered version of the web pages, and convert that back to descriptive XML. There are issues with accuracy, and when the website layout changes, Dapper breaks, so it’s not terribly reliable either. A novel approach nevertheless.

Another approach is to embed semantic “helpers” in to the rendering engines themselves: bulletin boards, blog engines, mailing lists etc, and so when scraper API’s walk the site, they find navigating to the data easier.

Markup formats like RDF are also gaining traction, but it’s unrealistic to assume that we’ll retrospectively add RDF against all HTML based URI endpoints.

My guess?

My best guess at the short term solution for the worst case in cloud comprehension (just having bare minimum HTML, no RSS or anything)? Marrying late-bound data binding mechanisms with pattern matching/machine learning. You’d have the pattern matching software build up a loose idea of what it believes to be the interesting data content in the HTML (just like you can train software to understand the parts of speech in a corpus, you can presumably train it to look for content vs. navigation vs. ads etc.). Then pass that loose representation of the data to a language/platform which late-binds to the various metadata elements, and allows for meaningful introspection. To illustrate what I’m on about, consider the following imaginary HTML slurped down from a business website:



It’s ugly and unstructured. Clearly, I want something clean, something I can walk over and look at. Let’s pass it to our imaginary pattern matching/machine learning platform that dissects the rendered structure, and pulls out what’s interesting:



Much better. And likely something I can code against. This imaginary service could render RSS, RDF, or a popular webservice format, I don’t care, just give me something with structure + metadata.

So, clearly this would scale better if the pattern matching & machine learning platform was shared. Anyone that’s tried training a neural net/NLP platform knows that the more accurate training data you have, the more accurate the result. Easy solved. Imagine a HTML->XML web service that allows for incremental training? Developers slurp down the URI endpoint via the webservice, and can let it know where it got it wrong (e.g. you thought this block of text was an ad, but it was actually a comment on a blog post). Over time, a URI’s metadata just gets better and better.

Further extending this theme, consider the cases where we need to know about named entities: imagine another shared machine learned webservice, where you hand it semi-structured XML, and it hands you back the same XML but with more tags describing all the companies it found in the data.

You could pass it the following:



And it hands you back the following:



With two imaginary calls, we’ve gone from an unstructured HTML endpoint, to a semi-structured representation of what a machine believes to be the data, then we’ve added richer metadata using a specialized named-entity web service. And so starts the virtuous cycle…

To summarize, we’re using a machine to render metadata about URI’s for us. It’s not going to be brilliantly accurate, and the structure has to be lose and generic by definition, but we can make up for these deficiencies through machine learning, adding metadata incrementally using specialized services, and adding a social aspect to make training more efficient. As for the generic structure: use your favourite late-binding language or query language to grok/filter/sort that structure to make use of it in a reliable way.

More food for thought

We’ve barely touched the surface here – we missed code invocation (i.e. if a URI endpoint has Javascript, what are the semantics for invoking code on that endpoint), handing forms and other “shared memory” like web mechanisms, dealing with embedded non-text content like video players, and how you would go about programmatically exposing that stuff. There’s also the question of consolidation: we already have a bunch of these microformats that are helping us expose URI metadata (RSS is one of them), should we consolidate that stuff? And if so, how would you go about mashing those formats together?

There are a slew of legal issues too: copyright, fair use, adherence to international legislation etc.

Nevertheless, cloud comprehension makes my top 8 because it’s an interesting problem that could blend a bunch of fascinating software engineering technologies: machine learning, pattern matching, social software, scale, and the language late-binding mechanisms to tie it all together. Plenty of curious meat.

And finally, a few links to chew on below. Click away to learn more.

Next in the top 8? Infastructure Scalability. I’ll be talking about Amazon EC2, Grid Computing, Hadoop, GFS, S3 and more.
Stay tuned.

Links

Semantic Web (Google TechTalk)

Semistructred and Structured Data in the Web: Going Back and Forth


Constructing Hierarchical Information Structures of Sub-page Level HTML Documents

Extracting Structures of HTML Documents

Semantic Web Podcast

RDF

What is RDF

SPARQL

SPARQL and the Semantic Web (Podcast)

Late-binding over XML: Visual Basic 9

Volta and Dynamic Languages

About Joel Pobar
Joel Pobar speaks, consults, and teaches .NET technologies: CLR; programming languages; threading; platforms and more. A former Microsoft Program Manager, since leaving Microsoft he has been tinkering with v.next software: machine learning, natural language processing, programming languages and more.

Derek Harris wrote: Cloud Computing - [...] Haven’t we learned anything about the risks involved in giving new technologies catchy labels that mean different things to different people, and nothing to others?” [...]
read & respond »
MICROSOFT .NET LATEST STORIES
Icahn Moves To Force Microsoft & Yahoo Together
Corporate raider Carl Icahn started his proxy fight for control of Yahoo this morning, beginning with the classic Icahn opening, the letter of reproach to the Yahoo board telling them they have acted 'irrationally and lost the faith of shareholders and Microsoft.'
IBM, Microsoft & Google Eras of Computing
By now it is conventional wisdom to say that there was an IBM Era of computing, then a Microsoft Era, and now we are in the Google Era. In this post, I will explain why Microsoft was not the 'next IBM' and why Google is not the 'next Microsoft' - there are significant qualitative diffe
Book Review: ASP.NET 2.0
ASP.NET developers are bored with traditional books that outline concepts in a lengthy way. These books are good if you like to learn the features in a detailed manner. However, by the time the book is read, a new version will be released. Hence, many learners including myself prefer s
3rd International Virtualization Conference & Expo: Themes & Topics
From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
"RIA" vs "Rich Client Platform": The Term Is Now Up for Debate
'RIA' is slowly fading in terms of its definition. When I first started the RIA Evangelism role in Microsoft, I had this nagging feeling that the term RIA was just all over the place. Depending on which technology you are backing and which stream of alliance you uphold, the truth is th
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE
BREAKING NEWS FROM THE WIRES
Strangeloop Networks Selected for Red Herring 100 North America 2008
Strangeloop Networks (TM) Inc., a leading provider of solutions that accelerate dynamic web