A Method for Handling Process Skew with Unknown Defaults

Back in the post Distributed Joins, Process Skew I mentioned a common cause of process skew were default values such as -1 for an “Unknown” customer creating skew when a redistribution happened on that customer key and there was a significant number of unknowns or -1 records. I also mentioned one solution was to spread the unknowns across a range of negative values. And I also mentioned I was working on a solution that would sort of virtualize this spread without having all the extra unknown negative values in you dimension. Well this is that solution in detail.

First I want my solution to support several default values , so mine with support a -1 though -9, or 9 defaults. I did not want to physically have to use anything other than the single default value in my fact. And I didn’t want anything other than the single default row in my dimension .

So first I just wanted a dummy table with 10000 rows in it to generate data.

CREATE TEMP TABLE TEMP_DUMMY AS

SELECT PK_SOME_TABLE

FROM SOME_BIG_TABLE

LIMIT 10000;

Then I want to create a table I’m calling a multiplexer table

CREATE temp TABLE DEFAULT_MULTIPLEXER

(BASE_NUMBER BIGINT,

MULTIPLEXED_NUMBER BIGINT)

DISTRIBUTE ON RANDOM;

Then I generate of range of negative values for each default. As example I’ll assume we have a -1 unknown. Others can be added in the same way

INSERT INTO DEFAULT_MULTIPLEXER

(BASE_NUMBER,

MULTIPLEXED_NUMBER)

SELECT

-1 AS BASE_NUMBER,

0- ((ROW_NUMBER() OVER (ORDER BY PK_SOME_TABLE ) -1) * 10) -1 AS MULTIPLEXED_NUMBER

FROM

TEMP_DUMMY;

I’ll add a second set for -2 just so it’s clear how that would work

INSERT INTO DEFAULT_MULTIPLEXER

(BASE_NUMBER,

MULTIPLEXED_NUMBER)

SELECT

-2 AS BASE_NUMBER,

0- ((ROW_NUMBER() OVER (ORDER BY PK_SOME_TABLE)-1) * 10) -2 AS MULTIPLEXED_NUMBER

FROM

TEMP_DUMMY;

So it’s clear, all the values in the -1 set end in 1 and all the values in the -2 set end in 2. The -1 set ranges from -1 to -99991 , 10000 values.

Below is the example.

First part is the MCUST sub select, which in practice should probably be hidden in a view. You outer join to the multiplexer table , and use the 10000 values from the multiplex table when you match on the single base -1 value, and use the original key for all the other keys. This is done with a simple nvl

(DEFAULT_MULTIPLEXER_EDW.MULTIPLEXED_NUMBER,CUST_KEY)

This becomes the new cust_key with the -1 values now scattered from -1 to -99991 in steps of 10.

Then on the fact you take a large, high cardinality integer column ranging past 10000 and evenly distributed, and leverage it to generate the range of negative values. If you have a primary key on the table that’s probably a good one to use. If their is a chance that leveraged column has negatives, you need to wrap it with an abs() . To break it down, a MOD with a divisor of 10000, with result in a range of 0-9999 . It is then multiplied by 10 to give a range from 0 to 99990 in steps of 10, and then it subtracts 1 for the -1 defaults, and it would need to subtract -2 for the -2 defaults. Use a decode (or case if you like) to generate the range for -1, or -2, or up to your 9 defaults. This gives the same range of negatives as the dimension.

DECODE(SALE.CUST_KEY,-1,(0-MOD(ABS(SALE.PK_SALE_KEY),10000)*10) -1, SALE.CUST_KEY)

This calculation on the FACT table becomes the new customer key. This too could be hidden in a view to make it easier to use with reporting tools. I’ve tested it and it distributed on this decode/mod calculated value. I did a test on a fact with 10% of the rows on a -1 unknown. On a Twinfin 24 with 192 dataslices this solution gave an 18.5x performance gain.

SELECT MCUST.FIRST_NAME,

M.CUST.LAST_NAME,

SUM(SALE.SALE_PRICE)

FROM SALE_FACT SALE

JOIN

(

SELECT

NVL(DEFAULT_MULTIPLEXER.MULTIPLEXED_NUMBER,CUST_KEY) NEW_CUST_KEY,

CUST.*

FROM CUSTOMER_DIM

LEFT OUTER JOIN DEFAULT_MULTIPLEXER

ON CUST.CUST_SKEY= DEFAULT_MULTIPLEXER.BASE_NUMBER