As a BI practitioner,
I want to understand when and how to understand the source data,
So that I can be sure I can build the things in the users stories and as an input to the estimate of the effort required
Often the majority of the AgileBI team are technically oriented and they love playing with data. There is a tendency for them to want to deep dive into the source data to understand it before they do anything else. I think it is the idea that forewarned is forarmed and in addition they like to be seen as the data experts.
The challenge with this approach is that a lot of effort will be expended for minimal value. The reason is they are looking at the data with no determinable outcome other than they have looked at the data (measurable acceptance criteria anyone?).
In a best case scenario they might produce a data dictionary or some data profiling statistics. But in my experience these will never get used at any level that justifies the investment made in producing them. And what is worse is that they will document or profile all the data available in the source application, not just the data that is required to deliver the next couple of sprints, hardly an Agile approach.
However we do need to understand the data at some stage to be able to load it into the data warehouse and include it in the content. The question is what shall we do and when.
There are three things we need to understand about the data at the right time to ensure the teams velocity:
1) Does the data exist
2) Is the data fit for purpose
3) If we apply the required business rule to the data would the results be valid
= Does the data exist
I know it sounds ridiculous but often we are given data requirements that includes data that simply does not exist in a source application anywhere.
There maybe be a need to capture this data in some form of new application, load data from manually maintained spreadsheets on users PC’s, or even worse derive the data based on a number of complex multi part business logic.
None of these are insurmountable, but they all increase the level of effort and complexity required to develop a sprint and so knowing this upfront means the team can more accurately score the points for the sprint and have a much better chance of successfully delivering the sprint.
[[profile before BEAM session to get understanding of what exists]]
[[profile after BEAM session based on templates]]
[[land into raw vault and profile there]]
[[put BI tool on raw vault and explore]]
[[steel thread all the way though to bi tool and then discovery]]
[[automated tools, used to be expensive and complex, now easy and often open source]]
[[profiling is different to discovery]]