Archive for July, 2008

Efficient grouping and debatching of big files using BizTalk 2006

Wednesday, July 16th, 2008

I’ve seen people struggle both on the forums and while doing consulting when in it comes to finding an good way of grouping and transforming content in file before debatching it. Say for example we have a text file like the example below.

0001;Test row, id 0001, category 10;10
0002;Test row, id 0002, category 10;10
0003;Test row, id 0003, category 10;10
0004;Test row, id 0004, category 20;20
0005;Test row, id 0005, category 20;20
0006;Test row, id 0006, category 20;20
0007;Test row, id 0007, category 20;20
0008;Test row, id 0008, category 10;10
0009;Test row, id 0009, category 10;10
0010;Test row, id 0010, category 30;30

Notice how the the ten rows belong to three different categories (10,20 and 30). These kind of export are in my experience quite common batch export from legacy systems and they usually aren’t ten rows (in my last project the sizes ranged from 5 MB to 25 MB) …

The problem

The problem is that the receiving system expects the data to be in separate groups, grouped by the categories the rows belong to. The expected message might look something like the below for category 10 (notice how all rows within the group are from category 10)

<ns1:Group numberOfRows="5" xmlns:ns1="http://Blah/Group">
  <Row>
    <Id>0001</Id>
    <Text>Test row, id 0001, category 10</Text>
    <Category>10</Category>
  </Row>
  <Row>
    <Id>0002</Id>
    <Text>Test row, id 0002, category 10</Text>
    <Category>10</Category>
  </Row>
  <Row>
    <Id>0003</Id>
    <Text>Test row, id 0003, category 10</Text>
    <Category>10</Category>
  </Row>
  <Row>
    <Id>0008</Id>
    <Text>Test row, id 0008, category 10</Text>
    <Category>10</Category>
  </Row>
  <Row>
    <Id>0009</Id>
    <Text>Test row, id 0009, category 10</Text>
    <Category>10</Category>
  </Row>
</ns1:Group>

The problem is now that we need to find a efficient way of first grouping the incoming flat file based message and then to debatch it using those groups. Our ultimate goal is to have separate messages that groups all rows that belongs to the same category and then send these messages to the receiving system. How would you solve this?

I’ve seen loads of different solution involving orchestrations, databases etc, but the main problem they all had in common is that they’ve loaded up to much of the message in memory and finally hit an OutOfMemoryException.

The solution

The way to solve this is to use pure messaging as one of the new features in BizTalk 2006 is the new large messages transformation engine.

Large message transformation. In previous versions of BizTalk Server, mapping of documents always occurred in-memory. While in-memory mapping provides the best performance, it can quickly consume resources when large documents are mapped. In BizTalk Server 2006, large messages will be mapped by the new large message transformation engine, which buffers message data to the file system, keeping the memory consumption flat.

So the idea is the to read the incoming flat file, use the Flat File Disassembler to transform the message to it’s XML representation (step 1,2 and in the figure below) and the to use XSLT to transform in to groups (step 4 and 5). We will then use the XML Disassembler to split those groups into separate messages containing all the rows within a category (step 6 and 7).

GroupingFlow2

Step 1, 2 and 3 are straight forward and pure configuration. Step 4 and 5 will require some custom XSLT and I’ll describe that in more detail in the section below.  Step 6 and 7 will be discussed in the last section of the post.

Grouping

Let’s start by looking at a way to group the message. I will use some custom XSLT and a technique called the Muenchian method. A segment from the XML representation of the flat file message could look something like this.

<Rows xmlns="http://Blah/Incoming_FF">
    <Row xmlns="">
        <ID>0001</ID>
        <Text>Test row, id 0001, category 10</Text>
        <Category>10</Category>
    </Row>
    <Row xmlns="">
        <ID>0002</ID>
        <Text>Test row, id 0002, category 10</Text>
        <Category>10</Category>
    </Row>
...
[message cut for readability]

The XSLT will use could look something like the below. It’s kind of straight forward and I’ve tried commenting the important parts of in the actual script. Basically it will use keys to fins the unique categories and then (again using keys) selecting those rows within the category to loop and write to a group.

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:ns1="http://GroupAndDebatch.Schemas.Incoming_FF"
                xmlns:ns2="http://GroupAndDebatch.Schemas.Grouped"
                >
    <!--Defining the key we're gonna use-->
    <xsl:key name="rows-by-category" match="Row" use="Category" />

    <xsl:template match="/ns1:Rows">
        <ns2:Groups>

        <!--Looping the unique categories to get a group for-->
        <xsl:for-each select="Row[count(. | key('rows-by-category', Category)[1]) = 1]">

            <!--Creating a new group and set the numberOfRows-->
            <Group numberOfRows="{count(key('rows-by-category', Category))}">

            <!--Loop all the rows within the specific category we're on-->
            <xsl:for-each select="key('rows-by-category', Category)">
                <Row>
                    <ID>
                        <xsl:value-of select="ID"/>
                    </ID>
                    <Text>
                        <xsl:value-of select="Text"/>
                    </Text>
                    <Category>
                        <xsl:value-of select="Category"/>
                    </Category>
                </Row>
            </xsl:for-each>
            </Group>
        </xsl:for-each>
        </ns2:Groups>
    </xsl:template>

</xsl:stylesheet>

noteYou have found all the XSLT and XML related features in Visual Studio - right?

Ok, so the above XSLT will give us a XML structure that looks some like this.

<?xml version="1.0" encoding="utf-8"?>
<ns2:Groups xmlns:ns2="http://Blah/Groups" xmlns:ns1="http://Blah/Group">
    <ns1:Group numberOfRows="5">
        <Row>
            <ID>0001</ID>
            <Text>Test row, id 0001, category 10</Text>
            <Category>10</Category>
        </Row>
        <Row>
            <ID>0002</ID>
            <Text>Test row, id 0002, category 10</Text>
            <Category>10</Category>
        </Row>
        <Row>
            <ID>0003</ID>
            <Text>Test row, id 0003, category 10</Text>
            <Category>10</Category>
        </Row>
        <Row>
            <ID>0008</ID>
            <Text>Test row, id 0008, category 10</Text>
            <Category>10</Category>
        </Row>
        <Row>
            <ID>0009</ID>
            <Text>Test row, id 0009, category 10</Text>
            <Category>10</Category>
        </Row>
    </ns1:Group>
    <ns1:Group numberOfRows="4">
        <Row>
            <ID>0004</ID>
            <Text>Test row, id 0004, category 20</Text>
            <Category>20</Category>
        </Row>
...
[message cut for readability]

Finally! This we can debatch!

Debatching

Debatch the Groups message above is also rather straight forward and I won’t spend much time on in this post. The best way to learn more about it is to have a look ate the EnvelopeProcessing sample in the BizTalk SDK.

And the end result of the debatching are single messages within a unique category, just as the receiving system expects! Problem solved.

Issue #1 - slow transformations

The first time I’ve put a solution like this in test and started testing with some real sized messages (> 1 MB) I really panicked, the mapping took forever. And I really mean forever, I sat there waiting for 2-3 hours (!) for a single file getting transformed. When I had tested the same XML based file in Visual Studio the transformation took about 10 seconds so I knew that wasn’t it. With some digging here I found the TransformThreshold parameter.

TransformThreshold decides how big a message can be in memory before BizTalk start buffering it to disk. The default value is 1 MB and one really has to be careful when changing this. Make sure you thought hard about your solution and situation before changing the value - how much traffic do you receive and how much of that can you afford reading in to memory?

In my case I received a couple of big files spread out over a night so setting parameter with a large amount wasn’t really a problem and that really solved the problem. The mapping finished in below 10 minutes as I now allow a much bigger message to be read into memory and executed in memory before switching over to the large message transformation engine and start buffering to disk (which is always much slower).

Problem #2 - forced to temp storage

Looking at the model of the data flow again you probably see that I’m using the XML Disassembler to split the grouped files (step 5 to step 6).

GroupingFlow3

The only way I’ve found this to work is actually to write the Grouped XML message to file and the to read that file in to BizTalk again and in that receive pipeline debatch the message. Not the most elegant solution, but there really isn’t a another out-of-the-box way of debatching messages (the XML Assembler can’t do it) and I don’t want to use an orchestration to execute the a pipeline as I want to keep the solution pure messaging for simplicity and performance reasons.

Finishing up

Have you solved similar cases differently? I’d be very interested in your experience! I also have a sample solution of this - just send me an email and make sure you’ll get it.

Running MSBuild scripts from Visual Studio

Wednesday, July 9th, 2008

MSBuild2It seems like there more build script one writes, the more often one wants to run them and it’s always a bit annoying (and time consuming) having to leave Visual Studio and start MSBuild from the command line. Brennan Stehling has a cool solution to that problem here were he sets up MSBuild as an external tool and runs it.

One problem for us was the we had our solution files in one place on the file system and our build files in a totally different place. The solution was to add the build file for the current solution as a Solution Folder (as shown in the figure below) and then set MSBuild to use $(ItemDir) as its Initial Directory. That will kick of MSBuild from the directory that the current selected Solution Folder points to and in our case that’s were the XXX.Build.Article.proj file exists.

MSBuild

Looking forward to stackoverflow.com

Wednesday, July 9th, 2008

stackoverflow-logo-250I’m really looking forward to the release of stackoverflow.com - have you heard about it? If not it’s a collaboration between Joel Spolsky and Jeff Atwood. Joel and Jeff are of course the two blogger behind the popular Joel on Software and Coding Horror blogs.

Joel is also one of the founders of the FogCreek Software company. FogCreek has a couple of different products but the most known is probably FogBugz which is a really cool project management/bug tracking/wiki/support tool for software development and management. At first it might not look like a very impressing tool but have a look at this presentation by Joel and I think you’ll start thinking different (I know we did and we’re now using it for both planing and managing big integrations projects).

Anyway, if you missed the story behind stackoverflow have a look here and don’t miss the podcasts that Joel and Jeff record, some of them are really cool. I know I’ve learnt a lot by just listening in to their conversations.

We not have quite a few online BizTalk communities such as the Microsoft forums, Google Groups, BizTalk Gurus, facebook (search for BizTalk related groups) and at LinkedIn (a couple of open BizTalk related groups) - is there room for another one at stackoverflow? Do we need one?