How I defeated one of IntelliMagic Vision's largest bottlenecks: custom data tagging // Querijn Heijmans

The IBM mainframes report on their state and the processes it’s running very frequently, outputting up to several billions of small blobs of introspective data per hour. This is useful to know, because these machines are pricey, as you pay for the seconds you’re running it. So you’ll want this machine to perform in optimal condition. This is where IntelliMagic Vision comes into play, it’s one of the biggest performance analysis tool available for these machines. As far I’m told, it’s the best-in-class tool for the job (but the folks claiming this, my colleagues, might be potentially a bit biased)

One of the features it has are “Business Applications”, a system where the customer can tag certain fields of data in these binary blobs of data.

The user could write the following format of statement: APPLICATION 'APP', JOBNAME='EXAMPLE';, and whenever the system would find a piece of data with JOBNAME matching “EXAMPLE” exactly, it would assign the database application field the name “APP”. This was written in a script of sort, and you could write as many applications here as you wanted to, all in order, one per line. It would go down this script line by line, and if one matched, it would assign that name, and not consider the rest of the lines below. It did this for every single blob of data that had a field that we had made support for matching. Can you see where this is going?

One day, we got a support ticket, a customer had a problem with the data processing setup - It would run forever when they used their Business Application script, sometimes crashing after several hours. They had 65,000 lines of applications in this script! When you compare that to the default setup where we had just under 100 lines, that’s incredible. And occassionally, that meant that every single one of those data would be matched against every single one of those application script lines. You can see why that would halt the system down to a crawl, especially on data that wouldn’t match the .

Inspecting their file, there was an interesting observation to be found, I won’t reveal what it contained, but it was somewhat like the following example:

APPLICATION 'DE SERV 1', QWHCEUWN="DE_FOO_B???";
...
APPLICATION 'DE SERV 678', QWHCEUWN="DE_BAR_C???";
APPLICATION 'AT SERV 1', QWHCEUWN="AT_BAR_C???";

The triple dot here redacts the contents, but there were nearly 700 lines which all checked the field QWHCEUWN (end-user workstation) for DE_{IDENT1}_{IDENT2}. The way our system worked was that it would go over every single field, even if it wasn’t present in the script, like the JOBNAME field, determine it was empty and continue evaluating. If memory serves, we had 17 of these. Not all data would have these fields, which made it extra wasteful. When it would start checking QWHCEUWN, it would go over every character or token until it failed evaluation. If it failed, it would go to the next line, and do the whole process 65,000 times, for each piece of applicable data. You can see how it gets problematic.

A concise overview of what it did for this data. All this for only two lines of the 65,000 line file.

The team was kind of stuck trying to figure out what to do with it, before I was involved. There was an optimisation introduced, which would only focus on the single field involved, but it still was very slow, if you looked at it realistically, too slow. We process data per interval (for instance, an hour), and that data needs to be processed within that interval, so if we work with 1 hour of data, it needs to be done within the next hour. If processing takes 1 hour and 10 minutes, we’d fall behind 10 minutes each hour we process, which means after 24 hours, we’d have fallen behind 240 minutes, which means the next day, your data is 4 hours behind, or we ditch 4 hours of data.

So after this was accepted and ready to give to a customer, I went on to find a more appropriate solution to the problem. I’ve alluded to my discovery over the last two paragraph and the image of my notes: Even optimally, it would compare the QWHCEUWN’s first character to “D” over 700 times. If you were a human, you’d scroll down and find the first entry that wouldn’t start with a ‘D’, but unfortunately, that wasn’t the case with our code. I decided to reroute the application checking code into a new setup that would instead check the character once. I did this through a trie-system.

A trie representing the examples from the file above. A trie representing some of the examples from the file above.

This solved our first issue! Now “D” would only be checked against once. It was incredibly fast already, but it wasn’t as powerful as the original system. The next issue was the special characters you could insert into your tags. ? would match a single character digit, 0-9. % would match any alphabetical character, and * would match anything, for multiple characters.

This doesn’t work with a trie that only checks against single characters. For ? it would mean I have to create 10 branches, one for each digit, and for % I would have to create one for each letter of the alphabet! After which, I’d have to copy and paste any trie evaluations to every single of these branches. This would explode the size of the trie, and would be unmaintainable. And it was impossible anyway, since we have a *, which means any character, for any length. How would I even approach that problem?

So instead of matching a field to a trie, I decided to integrate whole operations into the trie itself.

So instead of a std::map<DataFieldType, Trie>, I would instead have the TrieNode as the base, and it would contain ‘InstructionNodes’.

enum class PatternType
{
	QHCEUWN,
	Jobname,
	...
};

enum class InstructionType
{
	MatchExact,				// 'a'
	MatchAnyOneDigit,		// ?
	MatchAnyOneAlphabet,	// %
	FindNext,				// *
};

class InstructionNode
{
public:
	InstructionNode(InstructionType type, PatternType patternType, int64_t applicationIndex = ~0ull, char charToMatch = 0);

	bool operator==(const Instruction& instruction) const;
	bool operator!=(const Instruction& instruction) const;
	bool operator<(const Instruction& instruction) const;

	int64_t Match(std::string_view input) const;

private:
	std::set<Instruction> m_subInstructions;
	PatternType			  m_patternType;
	InstructionType		  m_type;
	char				  m_argument;
	int64_t				  m_applicationIndex;

	bool MatchChar(char c) const;
};

So, each single one of these would tackle a pattern type (QWHCEUWN for instance), an application index if the current node is at the end of any trie result, and optionally a character to match it. At the start of data processing, after the Application script would be read, we start building the trie. Now, whenever we would evaluate the branch, instead of checking a character, we run the Match function. This function is a recursive function that would evaluate the InstructionType’s evaluation code, and if matching, continues down the sub-instructions.

bool Instruction::MatchChar(char c) const
{
	switch (m_type)
	{
	case InstructionType::MatchAnyOneDigit:
		return c >= '0' && c <= '9';

	case InstructionType::MatchAnyOneAlphabet:
	{
		return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
	}

	case InstructionType::MatchExact:
		return m_argument == c;

	case InstructionType::FindNext:
		return true; // We'll get into this in a bit!

	default:
		assert(false && "Implementation of instruction is missing!");
		return false;
	}
}

int64_t Instruction::Match(std::string_view input) const
{
	char c = input.empty() ? 0 : input[0];
	if (!MatchChar(c))
		return ~0ull; // No result

	int64_t result = m_applicationIndex;
	for (auto& instruction : m_subInstructions)
	{
		int64_t currentResult = instruction.Match(input.substr(1)); // Evaluate instruction on next character
		if (result > currentResult)
			result = currentResult;
	}

	return result;
}

Note: This is a recreation of what the functionality looks like from the top of my head. I can’t of course recall the exact details of it, but it’s somewhere along these lines.

So the Match function goes down the input line with each instruction, and it matches exactly. This implements most of the functionality for the single-character check! Since lengths are explicit, and your application has to match exactly (unless * characters are involved), it means I also have to explicitly check for the end of the string. It doesn’t show in the image above, but the last instruction of every single trie will be MatchExactly ‘\0’. That’s why the char c = input.empty() ? 0 : input[0]; is necessary, since we’ll be out of a string once we reach the end.

Something to put of note is that we take a result and find the lowest. This is because originally, the script goes from top to bottom, and returns the first result. But once we turn this into a tree, that order is no longer retained. That means that you could be checking for the very last item in the list, but you have to ensure that you’re not skipping anything that’s defined above.

APPLICATION "A", JOBNAME="FOOBAR*";
APPLICATION "B", JOBNAME="FOO*";

Take this example for instance. In the old system, it would first check if the job name starts with FOOBAR, and assigns application index 0 if so. If only FOO would match, then it would apply index 1. The trie would look something like this:

A more problematic trie.

A more problematic trie. This would first check B, and it would conclude it’s B. Even though A matches as well, and came before it in the script! That’s why it takes the lowest result. The nodes in the set are sorted by the contents of the instruction node, not by the script order. This also means we have to evaluate the full trie. I made an optimisation here where nodes also stored the lowest possible result it could produce, and skip those that are higher than our current result. I’m skipping the details of this implementation, since they’re not that interesting.

This gets almost everything done, except for the * character, which checks any character, for any amount of length. But if you think about it, that doesn’t check anything at all. If everything matches, how do you check if it doesn’t check? That’s easy, whatever the next node is, is the one to evaluate for every single characters from the current character, until the end of the string.

APPLICATION "A", JOBNAME="*BAR";
APPLICATION "B", JOBNAME="FOO*";

Taking these two examples, that means that for “A”, we take B, and for every character that fails, we go to the next. If something does match, we continue as usual, checking just a single character for the following instruction afterwards. If all character matches fail, we return to the point of the parent of the * instruction, and go from there. Application “B” is easier, since it’ll look for 0, which we can guarantee is going to happen at the end of the input. We can thus remove those checks altogether.

int64_t Instruction::Match(std::string_view input, bool checkWholeString) const
{
	int64_t result = ~0ull;
	size_t charsToCheck = checkWholeString ? input.size() : 1;
	for (size_t i = 0; i < charsToCheck; i++)
	{
		char c = input.empty() ? 0 : input[i];
		if (!MatchChar(c))
			continue;

		result = m_applicationIndex;
		for (auto& instruction : m_subInstructions)
		{
			bool isFindNext = m_type == InstructionType::FindNext; // Check whole string?
			std::string_view nextInput = input.substr(i + 1);
			int64_t currentResult = instruction.Match(nextInput, isFindNext); // Evaluate instruction on next character(s)
			if (result > currentResult)
				result = currentResult;
		}
	}
	return result;
}

As you can see, it gets a little more complicated, but the original behaviour is the same: i is zero for single-character checks, and can be any subsequent character otherwise.

That’s a really short summary of it all. The old code could do 100 lines of application matching in just under 3 minutes on simple test-data that we used. If I used the customer’s script, on this same test data, which absolutely did not match and would generate the worst case scenario on the old code, it would run in almost the same amount of time. This lifted an incredible amount of restriction on the system, so the product manager got to work and increased the pattern types we could match to far, far more, letting the customer be more creative with their tagging. Aside from that, the amount of applications you can match per result is now 8 instead of 1.

I was very happy with the work I had done, and was able to tinker far more optimisations into it. When the whole thing released, I was surprised to hear that I was nominated for and won the Outstanding Technical Achievement Award from IBM, together with other members building the other parts of the business application setup.

I want to thank my team at IntelliMagic for allowing me to explore and research this problem, which gave me the freedom to create something great. If you have any questions whatsoever, feel free to contact me through LinkedIn or Github!