Back in the post Distributed Joins, Process Skew I mentioned a common cause of process skew were default values such as -1 for an “Unknown” customer creating skew when a redistribution happened on that customer key and there was a significant number of unknowns or -1 records. I also mentioned one solution was to spread the unknowns across a range of negative values. And I also mentioned I was working on a solution that would sort of virtualize this spread without having all the extra unknown negative values in you dimension. Well this is that solution in detail.
First I want my solution to support several default values , so mine with support a -1 though -9, or 9 defaults. I did not want to physically have to use anything other than the single default value in my fact. And I didn’t want anything other than the single default row in my dimension .
So first I just wanted a dummy table with 10000 rows in it to generate data.
CREATE TEMP TABLE TEMP_DUMMY AS
SELECT PK_SOME_TABLE
FROM SOME_BIG_TABLE
LIMIT 10000;
Then I want to create a table I’m calling a multiplexer table
CREATE temp TABLE DEFAULT_MULTIPLEXER
(BASE_NUMBER BIGINT,
MULTIPLEXED_NUMBER BIGINT)
DISTRIBUTE ON RANDOM;
Then I generate of range of negative values for each default. As example I’ll assume we have a -1 unknown. Others can be added in the same way
INSERT INTO DEFAULT_MULTIPLEXER
(BASE_NUMBER,
MULTIPLEXED_NUMBER)
SELECT
-1 AS BASE_NUMBER,
0- ((ROW_NUMBER() OVER (ORDER BY PK_SOME_TABLE ) -1) * 10) -1 AS MULTIPLEXED_NUMBER
FROM
TEMP_DUMMY;
I’ll add a second set for -2 just so it’s clear how that would work
INSERT INTO DEFAULT_MULTIPLEXER
(BASE_NUMBER,
MULTIPLEXED_NUMBER)
SELECT
-2 AS BASE_NUMBER,
0- ((ROW_NUMBER() OVER (ORDER BY PK_SOME_TABLE)-1) * 10) -2 AS MULTIPLEXED_NUMBER
FROM
TEMP_DUMMY;
So it’s clear, all the values in the -1 set end in 1 and all the values in the -2 set end in 2. The -1 set ranges from -1 to -99991 , 10000 values.
Below is the example.
First part is the MCUST sub select, which in practice should probably be hidden in a view. You outer join to the multiplexer table , and use the 10000 values from the multiplex table when you match on the single base -1 value, and use the original key for all the other keys. This is done with a simple nvl
(DEFAULT_MULTIPLEXER_EDW.MULTIPLEXED_NUMBER,CUST_KEY)
This becomes the new cust_key with the -1 values now scattered from -1 to -99991 in steps of 10.
Then on the fact you take a large, high cardinality integer column ranging past 10000 and evenly distributed, and leverage it to generate the range of negative values. If you have a primary key on the table that’s probably a good one to use. If their is a chance that leveraged column has negatives, you need to wrap it with an abs() . To break it down, a MOD with a divisor of 10000, with result in a range of 0-9999 . It is then multiplied by 10 to give a range from 0 to 99990 in steps of 10, and then it subtracts 1 for the -1 defaults, and it would need to subtract -2 for the -2 defaults. Use a decode (or case if you like) to generate the range for -1, or -2, or up to your 9 defaults. This gives the same range of negatives as the dimension.
DECODE(SALE.CUST_KEY,-1,(0-MOD(ABS(SALE.PK_SALE_KEY),10000)*10) -1, SALE.CUST_KEY)
This calculation on the FACT table becomes the new customer key. This too could be hidden in a view to make it easier to use with reporting tools. I’ve tested it and it distributed on this decode/mod calculated value. I did a test on a fact with 10% of the rows on a -1 unknown. On a Twinfin 24 with 192 dataslices this solution gave an 18.5x performance gain.
SELECT MCUST.FIRST_NAME,
M.CUST.LAST_NAME,
SUM(SALE.SALE_PRICE)
FROM SALE_FACT SALE
JOIN
(
SELECT
NVL(DEFAULT_MULTIPLEXER.MULTIPLEXED_NUMBER,CUST_KEY) NEW_CUST_KEY,
CUST.*
FROM CUSTOMER_DIM
LEFT OUTER JOIN DEFAULT_MULTIPLEXER
ON CUST.CUST_SKEY= DEFAULT_MULTIPLEXER.BASE_NUMBER
) MCUST
ON ( DECODE(SALE.CUST_KEY,-1,(0-MOD(ABS(SALE.PK_SALE_KEY),10000)*10) -1, SALE.CUST_KEY)
= MCUST.NEW_CUST_KEY)