Welcome!

Microsoft Cloud Authors: David H Deans, Pat Romanski, Janakiram MSV, Jnan Dash, Andreas Grabner

Blog Feed Post

Diving into H2O

by Joseph Rickert One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity. According to the 0xdata website, H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View. H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory. Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O. The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster. So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC: # The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }   # Next, we download, install and initialize the H2O package for R. install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))   library(h2o) localH2O = h2o.init()   # Finally, let's run a demo to see H2O at work. demo(h2o.glm) Created by Pretty R at inside-R.org Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx="10g") will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality. The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. "H2O supports an R-like language" (See a note on R) but sometimes things behave differently than an R programmer might expect. For example, the R code: y <- apply(iris[,1:4],2,sum)y produces the following result: sepal.length sepal.width petal.length petal.width 876.5    458.6 563.7  179.9 now, let's see how things work in h2o, code loads h2o package, starts a local instance of uploads iris data set into from r package and produces very r-like summary. library(h2o) # load library localh2o =h2o.init() initial locl instance # upload file instance iris.hex >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
Have you ever noticed how some IT people seem to lead successful, rewarding, and satisfying lives and careers, while others struggle? IT author and speaker Don Crawley uncovered the five principles that successful IT people use to build satisfying lives and careers and he shares them in this fast-paced, thought-provoking webinar. You'll learn the importance of striking a balance with technical skills and people skills, challenge your pre-existing ideas about IT customer service, and gain new in...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develop...
SYS-CON Events announced today that Hitrons Solutions will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Hitrons Solutions Inc. is distributor in the North American market for unique products and services of small and medium-size businesses, including cloud services and solutions, SEO marketing platforms, and mobile applications.
For basic one-to-one voice or video calling solutions, WebRTC has proven to be a very powerful technology. Although WebRTC’s core functionality is to provide secure, real-time p2p media streaming, leveraging native platform features and server-side components brings up new communication capabilities for web and native mobile applications, allowing for advanced multi-user use cases such as video broadcasting, conferencing, and media recording.
In his session at @ThingsExpo, Steve Wilkes, CTO and founder of Striim, will delve into four enterprise-scale, business-critical case studies where streaming analytics serves as the key to enabling real-time data integration and right-time insights in hybrid cloud, IoT, and fog computing environments. As part of this discussion, he will also present a demo based on its partnership with Fujitsu, highlighting their technologies in a healthcare IoT use-case. The demo showcases the tracking of patie...
Almost two-thirds of companies either have or soon will have IoT as the backbone of their business. Though, IoT is far more complex than most firms expected with a majority of IoT projects having failed. How can you not get trapped in the pitfalls? In his session at @ThingsExpo, Tony Shan, Chief IoTologist at Wipro, will introduce a holistic method of IoTification, which is the process of IoTifying the existing technology portfolios and business models to adopt and leverage IoT. He will delve in...
SYS-CON Events announced today that Outlyer, a monitoring service for DevOps and operations teams, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outlyer is a monitoring service for DevOps and Operations teams running Cloud, SaaS, Microservices and IoT deployments. Designed for today's dynamic environments that need beyond cloud-scale monitoring, we make monitoring effortless so you...
Unsecured IoT devices were used to launch crippling DDOS attacks in October 2016, targeting services such as Twitter, Spotify, and GitHub. Subsequent testimony to Congress about potential attacks on office buildings, schools, and hospitals raised the possibility for the IoT to harm and even kill people. What should be done? Does the government need to intervene? This panel at @ThingExpo New York brings together leading IoT and security experts to discuss this very serious topic.
It is one thing to build single industrial IoT applications, but what will it take to build the Smart Cities and truly society changing applications of the future? The technology won’t be the problem, it will be the number of parties that need to work together and be aligned in their motivation to succeed. In his Day 2 Keynote at @ThingsExpo, Henrik Kenani Dahlgren, Portfolio Marketing Manager at Ericsson, discussed how to plan to cooperate, partner, and form lasting all-star teams to change the...
The buzz continues for cloud, data analytics and the Internet of Things (IoT) and their collective impact across all industries. But a new conversation is emerging - how do companies use industry disruption and technology enablers to lead in markets undergoing change, uncertainty and ambiguity? Organizations of all sizes need to evolve and transform, often under massive pressure, as industry lines blur and merge and traditional business models are assaulted and turned upside down. In this new da...
“We're a global managed hosting provider. Our core customer set is a U.S.-based customer that is looking to go global,” explained Adam Rogers, Managing Director at ANEXIA, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
WebRTC services have already permeated corporate communications in the form of videoconferencing solutions. However, WebRTC has the potential of going beyond and catalyzing a new class of services providing more than calls with capabilities such as mass-scale real-time media broadcasting, enriched and augmented video, person-to-machine and machine-to-machine communications. In his session at @ThingsExpo, Luis Lopez, CEO of Kurento, introduced the technologies required for implementing these idea...
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and G...
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Onalytica. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Manufacturers are embracing the Industrial Internet the same way consumers are leveraging Fitbits – to improve overall health and wellness. Both can provide consistent measurement, visibility, and suggest performance improvements customized to help reach goals. Fitbit users can view real-time data and make adjustments to increase their activity. In his session at @ThingsExpo, Mark Bernardo Professional Services Leader, Americas, at GE Digital, discussed how leveraging the Industrial Internet and...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.