5 Replies Latest reply: May 26, 2016 12:42 PM by Theo Klemming RSS

    How to count(distinct) in a table ?

    mtondeur

      Hello,

       

      I try to calculate a count(distinct) in Datawatch, using a hierarchical breakdown in a table.

       

      By example we have several client in different Regions.
      Some of the Clients belongs to different Regions.

       

      There are 4 unique Clients in R1 : C1, C2, C3, C4

      There are 3 unique Clients in R2 : C1, C5, C6

      There are 5 unique Clients in R3 : C1, C2, C7, C8, C9

      We have in total 9 unique Clients.
      We can see for example C1 and C2 appear in different Regions.

      If we show in Datawatch a hierarchy (drill down feature) in a table Global --> Region, we will see :

      Global | Region | # of clients
      Global | R1        | 4
      Global | R2        | 3
      Global | L3         | 5

      Which is correct.


      But if we reduce in the breakdown of the table to the Global level in Datawatch we will see the sum of counting :

      Global | # of clients
      Global | 12

      Which is wrong because it is the sum of count(unique) instead a count(unique).
      The expected number is 9.


      Counting unique is not compatible with the drill up/down features.
      There is no count(unique) aggregation option in Datawatch, only Sum, Count, Average, etc...)

       

      Do someone have any idea to do a dynamic Count(Distinct) in Datawatch Designer ?

       

       

      Regards,

       

      Mike

        • Re: How to count(distinct) in a table ?
          Theo Klemming

          Hi Michael

           

          What you are describing is a challenge that R&D has been wrestling for a long time. Let me try to describe what the problem is from the perspective of the internal workings of the product:

           

          All data sets in Datawatch Panopticon are organized into hierarchic data trees in memory as soon as the data set is applied to a visualization.

           

          The data tree is capable of getting re-created and changed often - that happens each time someone changes a breakdown or drills or changes the visible detail level on a breakdown. For each data set and each data tree, there is no meta data at all; this means, there is no information stored about how many items you have of a certain value, what distribution you have in a numeric column and so on. The reason for this is that everything in the product is engineered to cope with continuously updating, streaming data, where aggregate values are calculated on-the-fly for each level of the data tree.

           

          And as a consequence of that, there is no record or "awareness" about how many times a particular value occurs in a data tree, for example the value "Customer 1". That is why a Count of values must be based on a numeric column, instead of just counting values in a text column. The data model knows about the number of rows in the original data set, and the number of nodes in the data tree. Each row in the data set is associated with a value for each numeric column on that row, and each node in the data tree is associated with the aggregate values of numeric columns of all rows that are part of that node and its child nodes.

           

          So, when C1 is a child node of both R1, R2, and R3, it will inevitably contribute with a count of 1 in all three places, since it causes a data tree node to be created in all three places. And putting Client before Region in the breakdown (in the data tree) makes no difference.

           

          The closest thing that we have been able to create, is an aggregation method called Combinations. It works like Count, with this difference:

          Count will count the number of rows in a numeric column of a data set

          Combinations will count the number of nodes at the most granular level of the tree, i.e. the number of leaf nodes in the tree.

           

          This means that if you use a data set like this:

           

          Global,Region,Client,Count,Value

          Global,R1,C1,1,2

          Global,R1,C2,1,2

          Global,R1,C3,1,2

          Global,R1,C4,1,2

          Global,R2,C1,1,2

          Global,R2,C5,1,2

          Global,R2,C6,1,2

          Global,R3,C1,1,2

          Global,R3,C2,1,2

          Global,R3,C7,1,2

           

          and put ONLY Client on the breakdown, you will see that:

          Aggregation method COUNT applied to a numeric column gives a total of 12 (the number of rows)

          Aggregation method COMBINATIONS applied to a numeric column gives a total of 9 (unique values in Client column)

           

          However, as soon as you put Region into the breakdown as well, you will be splitting up the previous single tree node for C1 and C2 into additional nodes (3 for C1 and 2 for C2), which means the number of leaf nodes grow from 9 to 12, which is equal to the number of rows in the data set, so COUNT = COMBINATIONS at that point.

           

          So, the short answer is No, you can't do Count Distinct on a text column in Datawatch Designer, but I hope my response makes it understandable why that is so.

           

          Best regards, Theo Klemming

            • Re: How to count(distinct) in a table ?
              mtondeur

              Thanks Theo.

               

              Problem is that this solution can only work if we have only one counting column in the table.

              I would have to precise I need to have several count(unique) columns in the same table, for example :

              - # of clients

              - # of new clients

              - # of closed clients

               

              It seems "Combination" will give always the same count in all columns like this.

               

               

              Regards,

               

              Mike

                • Re: How to count(distinct) in a table ?
                  Theo Klemming

                  Hi

                   

                  What does the data look like? For example, are there separate columns for NewClient, ClosedClient with logic/boolean values (T/F or 1/0)?

                   

                  // Theo

                    • Re: How to count(distinct) in a table ?
                      mtondeur

                      Exactly Theo the data seem like this :

                       

                      Global,Region,Client,New,Closed

                      Global,R1,C1,Yes,No

                      Global,R1,C2,No,Yes

                      Global,R1,C3,No,No

                      Global,R1,C4,No,No

                      Global,R2,C1,Yes,No

                      Global,R2,C5,Yes,No

                      Global,R2,C6,No,No

                      Global,R3,C1,Yes,No

                      Global,R3,C2,No,Yes

                      Global,R3,C7,Yes,No

                       

                       

                      But can be written like this too with Boolean :

                       

                      Global,Region,Client,New,Closed

                      Global,R1,C1,1,0

                      Global,R1,C2,0,1

                      Global,R1,C3,0,0

                      Global,R1,C4,0,0

                      Global,R2,C1,1,0

                      Global,R2,C5,1,0

                      Global,R2,C6,0,0

                      Global,R3,C1,1,0

                      Global,R3,C2,0,1

                      Global,R3,C7,1,0

                        • Re: How to count(distinct) in a table ?
                          Theo Klemming

                          With that data, and a breakdown containing only the Client column, you would get a correct values for Closed and New if you aggregate them as MEAN. You would use one or the other and aggregate as COMBINATIONS and name it "Count".

                           

                          HOWEVER: This will only look right for each row in a visual table; the Grand Total summary will be wrong for Closed and New since it will show the over all mean. What you would want to do here is aggregate as a mean for each value of Client, but then aggregate all the mean values as a sum across all clients. That would mean using different aggregation formulas on different levels in the same tree, and you can't. Further, calculated columns are no help here, since they can only access values on one and the same row.

                           

                          I suppose what all this boils down to is that it is difficult to get correct counts for unique entities in the data if each unique entity exists more than once in the data set. If the product had supported a count distinct on text columns while regarding an entire multi-level data tree, then the count distinct part would be covered for any kind of breakdown, but at the moment it doesn't.

                           

                          It is also difficult to device a work-around for Count Closed and Count New, due to the requirement to apply the same aggregation method from top to bottom of a data tree. I actually can't see how the Count Closed and Count New would ever be done, as long as each Client exists on more than 1 row in the data set, without using mean aggregation up to one point in the breakdown data tree, and then sum for everything beyond that point down to the root. But that is not available functionality.

                           

                          The last thing I can recommend that you take a look at is to use External Aggregates, which means that you provide the aggregate values for each measure, for each level in the breakdown, on separate rows in the data set.  Please read more on this page:

                          External Aggregates

                           

                          Best regards, Theo