Big Data project and challenges
I have worked on big data projects since 2014 and for me to get a feel of the challenges posed by the projects in the big data area it took me almost 5 years. Most of the challenges came after the system came live and used by business users heavily. Here in this post I am sharing some of the challenges that we have seen as common ones over the years.
Measuring quality of data
Even before you build the unified data warehouse for your company, my suggestion is to establish a framework to measure the quality of data that is there in the on-premise systems before even it flows to big data systems. Many applications both internal and external to the companies provide data in various formats and they must be finally transformed to a standard format to measure the quality of the data starting from veracity to value of the data. Sometimes the applications have some bugs and they logged inconsistent data in the past and now it may be fixed. We need to account for variances like these. Overall it is very important to measure the quality of your data using the 5V’s from Gartner’s terms.
- Volume
- Veracity
- Velocity
- Variability
- Value
Finding relevant data for problem at hand
Data cataloging and metadata about the data is very very important if we want more people in the enterprise to use the data. I can give you few examples from web services and API world. When web services were developed they had directories called UDDI – Universal Description, Discovery, and Integration. This had the list of all the web services that can be consumed by applications and it helped developers find relevant methods to use for their requirements. Similarly each big data system offers its own data catalog features and allows developers to enter metadata, tags. This must be followed deligently and it will help users find the right and relevant data need for them to solve the problem at hand. There should also be a way to let users know what data is missing in the system that they will need it in future.
Anonymization
There are lot of privacy compliances that companies need to follow to protect the personal, payment and other information about users. Anonymizing the data of the users at scale is a challenge. Sometimes the data is anonymized in different ways by different applications and this poses a challenge as it makes it unusable when it is merged in the unified data warehouse with other data. Companies should think of having a standardized encryption, decryption standards and anonymization routines. This will make the data more usable in anonymized formats.
Slicing
Data science teams always ask big data team for best slice of data to train their algorithms. Sometimes they ask for both training set and testing set of data. Many times big data team will not be able to slice the best possible data for the training because of inconsistency in the data and also team members may not be aware of what tables and what columns must be used in the extracted data. This is a challenge that creates friction between big data and data science teams and companies should make both the team work in harmony by implementing catalog standards, documentations, queries etc.
Traceability
It is very tough to trace the origin, flow and destination of the records. Traceability is impossible in the big data world. We cant debug the system that easily. Have a traceability of data outside of big data system and ensure that big data is receiving data correctly. Be proactive in inserting/updating data in the big data systems.
Losing Trust
Before the users start using the data from big data system we should verify that the numbers in reports are same between the old system and new big data systems. I see that companies that skipped phase of verification end up losing trust of the users. If the users somehow find out that numbers are not lining up then it will create a cascade of dominoes and users end up losing trust and then eventually the system becomes useless. Mostly when the data science team could not increase their accuracy with this data then it will prompt users to lose trust in the data.
Conclusion
When initiating projects to create unified data warehouse please keep in mind the challenges listed in this post and plan accordingly.
H.Thirukkumaran
Founder & CEO
H.Thirukkumaran has over 20 years of experience in the IT industry. He worked in US for over 13 years for leading companies in various sectors like retail and ecommerce, investment banking, stock market, automobile and real estate He is the author of the book Learning Google BigQuery which explains how to build big data systems using Google BigQuery. He holds a masters in blockchain from Zigurat Innovation and Technology Business School from Barcelona Spain. He is also the India chapter lead for the Global Blockchain Initiative a non-profit from Germany that provides free education on blockchain. He currently lives in Chennai India.