Archive for the New Tech category

Dear Kettle fans,

For those of you that are interested, Pentaho is organizing a free community oriented WebEx conference tomorrow (Wed June 17th).  Community oriented means no marketing and open for discussion and feedback.

The goal of the one-hour session is to inform the community of the drastic changes that are taking place in the Pentaho Data Integration product.  More specifically we will cover the challenges that we face with respect to version control of jobs and transformations, central repositories, team development, etc.  We will cover the work we did so far, the choices we made, etc.

Follow this link for more information on the WebEx conference.

See you tomorrow!

Matt

UPDATE : The recording of the session as well as the presentation are available

For some reason, the creation of a mapping to a database table poses a problem for certain people.

This is how it’s done in PDI 3.2.0 or later in the “Table Output” step:

Ogg video available over here

Until next time,
Matt

Dear Kettle fans,

There isn’t a week that goes by where I don’t find myself amazed by the number of contributions and help that the Pentaho Data Integration project receives in all kinds of forms.  There are people contributing anything from small patches to complete steps, folks helping out others on the forum, writing documentation, writing books, translating PDI, etc.  Without any question, this has been a truly amazing experience, not just for me but for the whole Kettle project.

It’s because of that overwhelmingly positive experience that I’ve always tried to be accessible and in contact with my community in all sorts of possible ways.  And because of that positive vibe I have refrained from commenting on the negative flip side to that story for the longest time.

The problem is really that lately things have been changing.  It’s probably caused in general by an increasing attention to open source and specifically by an increase in popularity of Kettle.  In any case, certain types of people do the following:

  • Send me personal email
  • IM me on skype/Yahoo!/MSN/AIM/…
  • Send me all sorts of messages and questions through the forums
  • Ask questions on this blog

Usually it’s a combination of any of the above.  Any time now I expect folks to be sending me direct twitter messages.  The questions are always the same:

I have an urgent Pentaho porblem.  I am incapable of using the forum for some stupid reason and so you have to help me, preferable now or within the next 15 minutes!!!!

This way, the meaning of “The kindness of strangers” becomes more and more like the one from the Nick Cave song.

I’ve just finished reading Linus‘ book “Just for fun” (Thanks again Domingo!) and his approach to the problem of staying in reach for people to contribute code and at the same time allowing yourself to have a life and a job is simple : if it ain’t fun, don’t do it.  Well, the barrage of this sort of questions has stopped being fun for me a long time ago.

As such, I’m going to try this approach: any question that could or should be asked on the forum is from now on silently ignored and deleted from my mailbox.  Any person that is not part of my “community” and that needlessly contacts me over IM gets blocked indefinitely.  And yes, that goes for twitter as well.  Off-topic questions on this blog go to the spam folder as well.  I will simply refuse to spend time on non-interesting topics.

I thought about creating a standard response e-mail, but any sort of replying is simply an encouragement to certain types of people and will only make matter worse. (been there, done that)

I’m sure everyone understands that this is the only way to free up time to work on the real problems at hand.  Thank you for your understanding in any case.

Until next time,

Matt

Hello Pentaho friends,

Pentaho BI Server 3.5 RC1 (***) is released and this great news indeed! There are lots of new features in there, major advances to Pentaho Report designer, Pentaho Metadata, enhancements to Pentaho Dashboard Designer including a new chart editor and an interactive data grid, along with
numerous usability enhancements and maintenance fixes.

Personally I’m particularly pleased with the usability side of the story.  Years ago when I joined Pentaho I rooted for us to pass the “Report ready on my own data in 5 minutes” benchmark. Granted, this is usually not even possible with closed source BI gear but let’s not use that sort of software as a standard, mkay?  Well, I’m happy to report that we’re getting really close to that goal with this 3.5 release.  The auto-generation of Pentaho Metadata models and nifty things like Will Gormans in-line ETL support for CSV file reading really is ground-breaking stuff. (Kettle doing the work, yeah!)  The back-end story is interesting and there is a lot to say about that, but the most important thing is that it makes it easy to create reports on your own data.

On top of that, if you used the old report designer, you wouldn’t recognize the new one, it’s that much better. Take a look at the video series on Youtube on the subject:

It’s amazing to see how much work got cranked out by our team at Pentaho in the last year, not just on the “bling bling” but also solving back-end issues like unified file formats, charting standards, etc.

It’s also interesting to note that Pentaho keeps releasing the majority of its software under an open source license in a time where our direct competitors are doing exactly the oposite.

All this and a few “other things” make me really look forward on the year to come.

Go Pentaho!!

Reporting from Pentaho HQ in Orlando, until next time,

Matt

*** RC1 means “Beta 1″ as in “for evaluation, not production” purposes folks!

With all the traveling I forgot to blog about the Pentaho Data Integration 3.2.0-RC1 release.

Grab the goodies on Sourceforge!

Until next time,
Matt

Dear Kettle & MySQL fans!

I’m really looking forward to go to the MySQL User Conference next week, not just because I’m speaking in 2 sessions again, but perhaps also because these are “interesting” times for MySQL and Sun Microsystems.  Pivotal times it would seem.

Here are the 2 sessions I’m going to do:

  • Cloud Computing with MySQL and Kettle : I’m particularly happy that MySQL accepted this session: it will demonstrate how easy it has become to do cloud computing exercises with tools like MySQL and Kettle.

So please drop in on our sessions and join the fun.  2 years ago my sessions drew quite a crowd and so I hope that this is again the case.  Pentaho is a sponsor of the event and even has a booth (#308) on the main show floor.  You can find me there to chat on Tuesday & Wednesday afternoon (1pm-4:30pm).  I’ll be there together with a group of people from Pentaho including Julian Hyde, James Dixon, Lars Nordwal, Lance Walter, Matt Papertsian & Jared Cornelius.

On Thursday I’ll be visiting the sages from SQLStream in the morning to talk about integrating their technology to create truly real-time data integration solutions without the need to fork over insane amounts of money.  Later that day we’ll all go see John Sichi’s session at the nearby (same building) Percona Performance Conference.

See you soon!

Matt

Dear Kettle fans,

As expected there was a lot of interest in cloud computing at the MySQL conference last week.  It felt really good to be able to pass the Bayon Technologies white paper around to friends, contacts and analysts.  It’s one thing to demonstrate a certain scalability on your blog, it’s another entirely to have a smart man like Nicholas Goodman do the math.

Sorting massive amounts of rows is hard problem to take on.  Making it scale on low-cost EC2 instances is interesting as it proves a certain level of scalability.  Nick ran 40 EC2 nodes in parallel to do the work and saw that it was good.  450,000 rows/s for $US 4,00/hour is not bad. Note: the tests sort 300M (50), 600M (100) and 1.8B (300) line-item rows from TCP-H respectively.

For certain, the paper seemed to make it easier for me to point to PDI scalability and it opened some doors for further testing on big iron at Sun Microsystems.  It was great to talk to so many people.  I even walked up to the Amazon Web Services booth at the expo to ask about the performance bottleneck in the EBS that was exposed by the white paper.  “It’s being worked on” was the reply :-)

The most interesting thing about the PDI cloud integration work is that there don’t seem to be a lot of other ETL tool vendors doing it.  In fact, after a Google or 2 I could only find Informatica with a Saas (not even IaaS) offering and I kinda doubt that closed source software is a good match for cloud computing.

So I went out there and did a presentation on the subject to explain to people how they would set it up for themselves.  The open source way is to not only do the marketing but to allow people to run their own tests and see for themselves.  That way you get valuable feedback to improve your offering.

Here is a copy of the presentation I gave: Cloud Computing with MySQL and Kettle.

I thought it was a good session although for once I didn’t get “The Question”, you know the one where people ask me how Kettle is different from Talend and where I get to comment on their lack of scalability.  Oh well, I guess you can’t win them all :-)

Finally, people have been asking me about integration with both SQLStream on the one hand and MapReduce/Hadoop/Hive/HDFS on the other hand.  I’m happy to say that the former is in progress and that I’ve started talks with the fine folks from Cloudera to get started on the latter.  I simply loved Aaron Kimball’s tutorial @ MySQL Conf on the MapReduce subject and think that there is a lot of potential for integration with PDI to make us scale even better.

Until next time,

Matt

Dear Kettle friends,

Because there were so many people wondering what GA or even “Generally Available” meant, we renamed the thing to “Stable”.  So the stable production ready version of Pentaho data integration 3.2.0 now is called 3.2.0-stable.  Go figure.

In any case, Pentaho Data Integration 3.2.0 is generally available on sourceforge over here.

We changed a lot of things in that release, from small visual improvements in the hop color schema to larger changes behind the covers with things like named parameters.

Other pet peeves that changed are for example the Table Output step where we now (finally!) are capable of specifying individual fields to be stored.  The Dimension Lookup/Update step got an upgrade as well with support for cache pre-loading, alternative algorithms for date ranges, current row flag and much more.  Obviously, we got a whole batch of new steps.  Simple steps were added like “Replace values in string” or the ultra-fast User Defined Java Expression.

For an overview of the changes, have a look at the What’s new in 3.2 page on the wiki.

Enjoy this release & thank you all for your continued support and help,

Matt

Dear Kettle friends,

Will Gorman and Mike D’Amour, Senior Developers at Pentaho, are presenting Pentaho’s Google integration work at the Google I/O Developer Conference. (at the Sandbox area to be specific)   Yesterday, Pentaho announced that much.

Here are a few of the integration points:

  • Google maps dashboard (available in the Pentaho BI server you can download)
  • A new Google Docs step was created for Pentaho Data Integration Enterprise Edition
  • Running (AVI, 30MB) the Pentaho BI server on Android
  • A new Google Analytics step was created for Pentaho Data Integration Enterprise Edition
  • Since version 2.0, the Pentaho BI server depends heavily on Google Web Toolkit (GWT)

To top that off, Will twittered about this new Lego bar-chart + logo they created for the conference:

UPDATE: now with building instructions and action video!

We are all soooo proud of them!

Until next time,

Matt

Simply following the rest of the bloggers on this meme.

Usually I get by with these numbers.  The upload speed is dramatically low though.