|
|
|
|
|
|
|
|
|
|
1/19/2010
I got asked the question of how to generate random numbers with a normal distribution. A quick search online provides all kinds of formulas, it is a simple matter of picking one to implement. I could of course implement this in the formula library in C++, but that doesn't provide instant gratification, so off writing a macro I go. Some of the functions are iterative, which would be slower and more difficult in Alteryx, so I chose a simpler formula. The formulas online also produce a normal distribution with and average of 0 and a standard deviation of 1. Depending on the data you are trying to model, you probably want to change that. Rather than make the use do it, I added some parameters in the macro.
Finally, anytime you write a macro it is nice to test it. I wrote a quick module to generate 1,000,000 random #'s and test their average, StdDev, and also produce a chart so you can visually confirm that it is a norm distribution. The results of that are:
|
Number of Iterations Tested: 1,000,000
Average(should be close to 100): 100.02
Standard Deviation (should be close to 15): 14.98 |
|

|
My favorite part of this is I used the chart that the test module produced as the icon for the macro. Kind of recursive. Anyway, now generating random numbers with a Gaussian Distribution is as easy as dragging out a macro. The macro and test module can be found here... 1/7/2010To start at the end, the source code associated with this article can be found here. The projects are all built with VS 2005.
I have received a bunch of questions about using the Alteryx API lately. It is mostly straightforward, but it is can be confusing how to find the latest installed Alteryx directly and properly wrap calls from C#/,Net to the Alteryx API. In order to simplify using the API, I wrote a simple wrapper that not only wraps the Alteryx API, it also will find the latest installed version and link to it without having to do anything special. The interface is compiled for any CPU, and will detect if you are running in 64bit or 32bit and find the appropriate Alteryx. It you are running in 64bit and you only have a 32bit Alteryx installed, it will fail. All failures are thrown as a AlteryxApiException. In order to use that Alteryx API, all you need to do is to include AlteryxAPI.Net.dll as a reference in your project. That's it, just start calling the API and it will deal with finding the currently installed version.
In addition, there is a new API sample application. It just implements a Mortgage Calculator wizard. It shows running a custom interface on top of a wizard. It allows a user to solve for any of the common Mortgage parameters. It does it all via an Alteryx Wizard. It also shows the version # of the Alteryx that it found installed.
The Interface of the sample app | The Alteryx wizard behind the scenes | 
| 
|
This application does show one advanced API feature. There is a special tool that can be used to send output directly to the API without going through a file. This application uses that feature to directly populate the result into the form. This example shows just how easy it is to link to the Alteryx API to perform a complex operation. There are only a few lines of code that deal with the Alteryx API at all.
In addition to creating the new DLL wrapper, there is an associated help file:
Update: There was a small bug in the Alteryx wrapper from this article that could generate an Null Reference exception if you didn't have the correct Alteryx installed. (32 vs 64 bit). The code (found here) has been updated. Also, although the .Net wrapper will work with Alteryx 4.1, the associated sample uses new features in Alteryx 5.0 so it will not work in 4.1.
12/16/2009Before I get to today's post, I want to talk a little about Bill's recent post. Polar graphs are not for everyone, but what is really interesting about that post is the technique of using the map tool to make custom charts. He used the map tool in a way that had nothing to do with mapping, it was used exclusively as a rendering engine. Anyway, back to today's topic...
A gravity model is a simple model of the flow of trade from demand (the customers) to the supply (stores). I wanted to make a macro that implements the gravity model to make it easy to reuse. Just to review some simple rules of building macros & wizards: The 1st step when building a macro (or wizard) is to build a standalone module that works - so I spent some time building a module and documenting the module. Once that is done, Figure out the list of questions you want to ask. In my case it was some simple gravity model parameters:
Once you know the questions to ask, identify the tools in the module that need to be changed. It is always a good idea to go into the annotation of each tool you want to change and set a descriptive name (it will make it easier to find later):
Now go into the macro settings and set all the appropriate information into the tools:
Since we named the tool, it is really easy to find in the list. In this case we are updating the Friction of Movement (exponent) value in the formula. Once you have all your actions figured out, you can go ahead and configure your inputs and outputs and you have a pretty easy macro to use. I always like to put a new icon on my macros so I can remember what they do.
Now that the macro is so easy to use, it's really easy and quick to experiment with adding or removing store locations and you can see how it affects your network. Wouldn't it be cool if there was some automated way to do that? Oh well, that will have to wait.
This model is of course very simplistic. We have a whole bunch of companies that use Alteryx to produce MUCH more interesting models than this. I will be trying to put together a list of companies and their specialties that can help people build all kinds of Alteryx modules.
The modules in this post can be found here. Note, this macro requires Alteryx 5.0
12/8/2009Now that Alteryx 5.0 is finally out the door, my team and I finally have a chance to breath. Once again, the team did an amazing job. I believe that Alteryx 5.0 is the best release yet. The team that brought this release involves a whole lot more than development, so the list of people to thank is getting bigger all the time.
1st, a big thank you to the development team. I could not do this without Linda, Rob & Kurt. Linda gets extra props for doing anything and everything for the last few weeks to get the product tested, built and out the door. Rob gets extra props for fixing the last minute install bug after we thought we were done and the rest of us were on vacation. Thank you, you guys make me look good.
How on earth could we do this without the products team. Tara came back from having a baby straight into a stressful release cycle and was amazing. Nothing would happen in Boulder without Amy. She is the anchor that keeps the rest of us going. And Rob, the newest member of the team, had quite an awakening about just what it takes to release a huge package like Alteryx, but jumped in with a great attitude and was really helped with testing. Thank you.
Another big thank you to everyone who helped test. The data and professional services team are HUGELY appreciated for helping with the regression tests. In particular a big thank you to Bill - without your help the Calgary link tables would be much worse. Lots of other people in other departments help with our testing blitzes at the end. It is them you can thank for running such a quality product.
I would also like to thank the Web Core team, in particular Ian. They are becoming a major consumer of Alteryx and driving us to improve in all kinds of subtle ways. It is a pleasure working with you guys. Many of the improvements in the next year will be driven by them.
As they say, it takes a village to raise a child. Similarly it takes a whole company to produce a great product. SRC has a fantastic group of talented people and it is a please working with all of you. Thank you for your time and efforts. We couldn't focus on making a great product if we didn't have such a great team backing us up.
And last, but certainly not least, I want to thank all of our users, in particular our beta testers. I won't name any of you, but without customers, Alteryx doesn't exist. For Alteryx 5.0 we got the best feedback from beta testers ever. We also have been getting more and better feedback from other customers. Every one of the new features is in response to something you guys requested (ok demanded).
Thank you,
ned.
12/7/2009In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a fixed point and an angle from a fixed direction.
We can create a polar coordinate system in Alteryx by taking an input file with a distance and degree for each of the 360 degrees in a circle and using our Mapping Tool creatively we can emulate polar graphing.
Sample:
Distance | Degree | 0 | 0 | 0 | 1 | 0.5 | 2 | 1.3 | 3 | 2.6 | 4 | 4.4 | 5 | 7 | 6 | 10.4 | 7 | 15.5 | 8 | 23.2 | 9 | 24.6 | 10 | 20.3 | 11 | 19 | 12 | 19.8 | 13 |
The Alteryx module will determine the max distance value from the input file. With the max distance we can build the distance rings from the center (0.00 Lat and 0.00 Lon) in increments of 10 using the generate rows tool. The outer Ring is 0 distance with the next ring toward the center as distance of 10.
Using the formula tool with the input data from above we determine the distance from the center for each value by taking the Max_Value – Distance. So that when we plot the 0 degree and 0 distance it is located on the outer ring at 0 degrees.
The distance from the center for each direction is used as input into the GetOffsetPoint macro which calculates the actual point in space from 0.00 lat and 0.00 lon, the Offset Point.
Using the Poly-Build Tool Sequence Polygon with the Source Field the Offset Point and the Sequence Field each of the degrees starting at 0 and ending at 360 the graph is created.
The Vertical lines for the graph are built using the GetOffsetPoint every 15 degrees with the distance of the Max_Value calculated above. The Poly-Build Sequence Polyline actual generates the lines for the graph from the Offset Points.
We also create the Label point for the rings using the GetOffsetPoint with increments of 10 for the distance and 0 for the degrees.
Finally the layers are all joined together and Plotted using the Map Tool to view the results below.
A good example of a practical application for Polar Graph is graphing signal strength around a tower.
For example module and Data click here. Download the Polar Graph ZIP file. 11/30/2009
It is official... Alteryx 5.0 was released on Friday November 20, 2009. It is on its way to our devoted users and loaded with new features and functionality.
I will be conducting the What's New in 5.0 webinar tomorrow, Tuesday December 1 at 2pm Mountain Time. If you haven't done so already please register as I will be touching all the new stuff and directing you to where these gems are located in the program. This session will be recorded so you can play back at your leisure. To register click here.
The Customer Advantage group has already scheduled instructional webinars on the new features of Alteryx 5.0. You can register for them here.
We've been blogging about some of the latest features since the summer. Below is a list of the highlights of this release and hyperlinks will take you to the posts on those topics:
· Dynamic Select and Dynamic Rename Tools – these new tools enable developers to create more adaptable wizards and work with a wider variety of data.
· Report Formatting Tools – two new tools in the reporting category – Arrange and Overlay – provide greater flexibility in creating customized reports. The Arrange Tool enables transformation and rearrangement of data fields for more meaningful presentation of information. The Overlay Tool allows report content elements (e.g. maps, tables, images, etc.) to be overlaid on top of each other.
· Reporting Email Tool – this tool creates automated email attachments of Alteryx module outputs, which then get directly sent to the defined recipients.
· Batch Macro - enables automated program looping within a single module. Allows for target analysis of individual records/groups that need to be isolated from each other.
· New Macros – six new macro tools have been added, including several different statistic calculations and a macro that returns a count of how many records are going through the tool.
· Formula Library Additions:
-
Spatial Functions Category – over 20 new spatial functions have been added including calculations of area, bounding rectangle, centroid, distance, length, and others. These spatial functions can be utilized anywhere in Alteryx where formulas are used.
· Finance Functions Category – financial analysis calculations (e.g. NPV; IRR; etc.) can now be natively added to Alteryx analyses (no need to export data to other applications to perform these calculations).
· MetaCarta GeoSearch News Wizard – a new wizard is being added that will launch searches of MetaCarta’s vast library of news feeds and will return the latest articles, blog entries, etc. for any topic in any defined geographic area.
· Calgary Table Linking – new capability to link multiple Calgary data tables together for querying. Build one query that accesses multiple tables and brings back a combined result.
· Module Dependencies – easily see and edit module data dependencies/paths for easy sharing and distribution of modules from one user’s desktop to another.
Hope you enjoy the improvements we made to Alteryx. As always we welcome your feedback!
Tara
10/27/2009It's been a long time since our last blog post. The core team is extremely busy working towards Alteryx 5.0. I hesitate to give an exact date, but we are getting very close to a code freeze with a release a few weeks after that.
One of the issues that has come up in our testing is comparisons to NULLs in formulas. For the purposes of simplicity, we will be focusing on numbers, but strings have all the same issues and produce the same results. Think about the following examples:
1<Null()
0<Null()
-1<Null()
Null()==Null()
Null()<Null()
Null()<=Null()
In the past, we made the decision that Null()==Null() should be true. Without that being true, it makes it difficult to compare streams of data. Many people would argue that Null()!=Null() and in an ideal sense, I would tend to agree with them. The problem is that 2 different fields can be Null() for different reasons and may not be equal in any real sense. The problem with ideals is that they aren't always intuitive. Users expect to be able to compare fields from 2 different data streams for equality. Implementing some Platonic ideal of "right" that confuses users never helps.
We recently found some bugs in the <, <=, > and <= operators with regards to Null() and we were forced to make the decision that nothing is greater or less Null(). If there is a Null() on either side of a < or >, than the answer is false. Clearly Null()<Null() is also false. With <= and >=, if there is a Null() on either side, the only way it can be true is if there is a Null() on the other side - because Null()==Null().
It does get weird though. If 5<Null() is false and 5>Null() is false, doesn't that imply the 5==Null()! Obviously having Null() values puts the comparisons outside the normal rules of logic. Another weird one: Null()==(Null()+1) Since (Null()+1) results in Null(), the result is true. Another argument that Null()==Null() should probably be false in an ideal world. Sometimes you have to compromise and be less "right" to give users what they want. 9/25/2009A question came from our own Professional Services group:
They are requiring a lot of fuzzy matching on business names. Is there any documentation on how the weighting occurs when using Word Frequency Stats. They have traditionally removed common words from the match name before matching using a third party matcher I am not familiar with. They had to do this because there was not the capability of the word stats. I need to be able to explain how this works, and compare this process to the matching results when frequent words are completely removed.
There is a good reason for weighting instead of removing common words. The following are all real company names from the D&B database: "CITY INC", "FIRM INC" "COMPANY INC" "THE COMPANY INC", "STORE INC", "CLUB INC", etc.... If we removed common words we'd have nothing left and couldn't possible match those names.
The solution is to use statistical weighted averages. Say we have a company named HARDING INC, we still count INC, but we count HARDING much higher. This allows us to match names that are only common names while at the same time not having the common words interfere with good match of less common names. Alteryx does this by supplying a statistics file with the word based matching.
The weight for a word is effectively log(totalWordCount/thisWordCount).
TotalWordCount is the sum of all the word counts in the statistics file. thisWordCount is the count from the statistics file or if not present, 1.0
If the statistics file is truncated such that the last word in the stats has a count >1, all of the counts and the total are divided by the # such that the last word has an equivalent weight of an unknown word. This is to prevent a discontinuity in the math.
Below is a math workup from the generic US Company Statistics. The last word is OLSON with a count of 2605, so the whole file gets scaled down with that. The net effect is that an unknown word is weighted 3X higher than INC. and 1.9X higher than CHURCH.
If the statistics file were not truncated at 200 records (or a dummy record added with a count of 1), an unknown word would weight 4.4X INC and 3.5X CHURCH.
Below are some specific counts and their weights from our generic US Company statistics. The results are always better if you calculate the statistics with your own file.
Remember in the Alteryx Runtime directory there is a module you can use to create word frequency statistics on your own databases. That module is in the following directory: \Program Files\SRC\Alteryx4.1\RuntimeData\FuzzyMatch\CollectStats.yxmd
You will want to save your own resulting *.yxdb file to that directory so you can use it in your Fuzzy Match analysis.
Word | Count | Weight | Total | 29190175 | | & | 1653048 | 1.246951 | INC | 1304729 | 1.349716 | SVC | 569248 | 1.709935 | CO | 548111 | 1.726368 | CTR | 328941 | 1.948119 | OF | 313313 | 1.969258 | ASSOC | 231113 | 2.101412 | CHURCH | 226418 | 2.110326 | AUTO | 220870 | 2.1211 | A | 217291 | 2.128195 | MD | 211605 | 2.139711 | CORP | 176711 | 2.217973 | UnkownWord | 2605 | 4.049429 | FullStatsUnknown | 1 | 7.465237 | | | | | | | | |
9/16/2009
Last time I wrote a module that could parse web server logs according to the W3C spec. Although it works well, server logs can get very big and there can be a lot of them - typically 1 per day. It is important that the parsing run as fast as possible. On one machine, parsing 100,000 records with 22 fields takes about 7 minutes. When web server hits often number in the millions, that is just too long just for parsing.
The left side of this module is obviously the crucial part, because that is parsing the bulk of the records. The right side is only parsing the field names, which should only be a single row so we won't worry about speed on the right. Looking at the left side (screen shot below), some obvious speed issues stand out. The following tools require all the data to be sorted: Tile, Summarize, Join, Join, Crosstab. There are potentially 5 different sorts going on here! There may be less than 5 because Alteryx detects if a stream has already been sorted, but it's still bad. This is exacerbated by the fact that the RegEx tool is tokenizing to a stream, which means that we are multiplying the number of records by the # of fields. This means that we are sorting 2.2 million records instead of 100,000!
So, in order to make this run faster, we want to avoid sorts and also avoid parsing to rows. The RegEx tool allows us to parse to columns as well as rows, but you have to tell it how many columns ahead of time. In this case I just picked a number that is too high. I picked 40, since our server logs have 22 fields. If your server logs have more, you might have to change that 1 tool. I then used a MultiFieldFormula instead of a formula to remove quotes and escaping since all fields are in a single row. From here onto a Dynamic Rename to rename the fields and then I get to use the new DynamicSelect tool to filter out all the extra columns that were parsed and not used. It now runs in 27.2 seconds! that is more than a 15X improvement on our throughput.
There is still 1 temp file left. That is in the dynamic rename because it has to wait for the field names to come in from the right before it can release any records from the left. My 1st thought was to have a Sample tool on the right stream to only look at the 1st 100 records or so for the field names. Unfortunately, the way the sample tool was written, it still didn't release the record stream until its input was complete, even though it all pushes the 1st 100 records. Good thing I have access to the source code though, I changed that behavior so it releases the full stream after it gets its record limit. This didn't make any difference in performance for the 100,000 record test, because Alteryx was smart enough to keep that temp file in memory, but for larger data sets it makes a positive speed difference.
As you can see, spending time optimizing Alteryx modules can make a huge difference in performance. Next time we will talk about converting this to a batch macro so we can parse a directory full of logs. The module from this blog can be downloaded here.
A quick addendum: I finally got around to testing a fresh x64 Alteryx. This same test ran in 20.4 seconds. Since there are no temp files, there is no advantage with memory. That extra 25% speed is purely the speed advantage of 64bit over 32bit.
Below is a screen shot of the before and after for the relevant part of the module. You can see it is both simpler and has less tools.
|
Before (7:02 Minutes) |
After (27.1 seconds) |
|

|

|
9/11/2009
This started out as a simple idea for a blog post that has grown out of control. Following up the article on Geocoding IP Addresses, I wanted to write a simple module to parse a web servers log and share it with this blog. It turned out to lead me down a rabbit hole.
First let's look at a sample of a log file:
#Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2009-05-19 00:00:10 #Fields: date time cs-uri-stem c-ip cs(Referer) 2009-05-19 00:00:10 /pp/anp.php 88.80.7.248 - 2009-05-19 00:00:28 /pixel.gif 213.17.9.61 http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Hoofdstukken.htm 2009-05-19 00:00:29 /pixel.gif 82.75.139.91 http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Hoofdstukken.htm 2009-05-19 00:01:10 /pixel.gif 62.131.15.194 http://home.hccnet.nl/t.amerongen/Wereldklok.htm 2009-05-19 00:01:21 /pixel.gif 80.127.8.10 http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Systematische%
I don't have any log files I can give away, but there are lots of people on line who leave them available for download. A quick Google search will give you lots of files of all different sizes & styles.
It might also be worth your time to read the specification for these files. http://www.w3.org/TR/WD-logfile.html The good news is that this file format is an open standard and used in most web servers. The bad news is that the standard is very flexible. That means that we have to do a lot of work in Alteryx to parse all the permutations.
1st up is the multi-line header. We have to split off all the header rows and find the one starting with #fields, parse that and get the fields names out.
After that we have to parse all the non-header rows. Mostly it is a matter of splitting on whitespace. The problem is that a value can contain whitespace, in which case it will be in quotes. Further there is a problem that a quoted string can contain quotes, in which case the internal quote will be doubled up. After a bunch of trial and error and help from RegEx Coach, I settled on the following regular expression to do the parsing:
(?:^|\s)((?:[^"\s]+)|(?:"(?:[^"]*(?:"")*)+(?:"|$)))
Breaking that expression down, we get:
It can be helpful to break a regular expression up like that. It really helps you see what you are matching. Anyway using the regular expression tool in Split to Rows mode, we successfully got all the fields parsed out and (I think) matched the parsing rules from the spec.
Joining our fields together with our data was fairly simple, but it involves adding a RecordID, a few joins, an append fields and even a crosstab. When I first did it it sure seemed like the simple way. The module can be found here: ParseServerLog_Part1 Again, you will have to have your own server log, or get some from Google.
This module works great, but there are 2 problems that will be address in future blogs.
1) Most servers write a new log per day. We will want to read a whole batch of them. Here is a hint to how we will do that.
2) When you parse a years worth of data, it is slow. Its a lot of data, but there ought to be a way to speed it up.
|
|
|
|
|
|
|
|
 |
 |
 |
 |
|