Open Framework, Information Management Strategy & Collaborative Governance | Data & Social Methodology - MIKE2.0 Methodology
Wiki Home
Collapse Expand Close

Members
Collapse Expand Close

To join, please contact us.

Improve MIKE 2.0
Collapse Expand Close
Need somewhere to start? How about the most wanted pages; or the pages we know need more work; or even the stub that somebody else has started, but hasn't been able to finish. Or create a ticket for any issues you have found.

Big Data Solution Offering

From MIKE2.0 Methodology

Share/Save/Bookmark
Jump to: navigation, search
Hv3.jpg This Solution Offering currently receives Major Coverage in the MIKE2.0 Methodology. Most required activities are provided through the Overall Implementation Guide and SAFE Architecture, but some Activities are still missing and there are only a few Supporting Assets. In summary, aspects of the Solution Offering can be used but it cannot be used as a whole.
A Creation Guide exists that can be used to help complete this article. Contributors should reference this guide to help complete the article.

Contents

Introduction


The Big Data Solution Offering provides an approach for storing, managing and accessing data of very high volumes, variety or complexity. Storing large volumes of data from a large variety of data sources in traditional relational data stores is cost-prohibitive. And regular data modeling approaches and statistical tools cannot handle data structures with such high complexity. This solution offering discusses new types of data management systems based on NoSQL database management systems and MapReduce as the typical programming model and access method.

Executive Summary

Big Data can be defined as data that will be stored in data stores categorized as “NoSQL” due to their lack of compatibility with the SQL language that is so ubiquitous in the relational database world. Typically, though not always, these solutions will exceed a size threshold that causes the user to willingly “give up” the mature capabilities of a relational database for the ability to cost-effectively store, and lightly access, the volume of data. NoSQL is sometimes referred to as “not only SQL” because of the need for both SQL and NoSQL solutions at almost any company.

Most NoSQL is open source and most of the well over one hundred NoSQL open source projects are not data stores. However, there are a significant numbers, perhaps dozens, of them that are.

NoSQL solutions originated out of a need by data-oriented companies like Google, Facebook, eBay and Yahoo to store the massive amounts of information their systems generate. The fate of these companies lies in creating outstanding personal experiences and they are able to utilize every click and every aspect of a page render in their analysis of customer behaviour. Commercial software proved too expensive, papers were published and companies took up development of solutions, which soon spread to the community in the open source software development model.

It is important to select the correct category of NoSQL database management system to store the data. Data will be typically stored only once and according to the most important use case due to the high volume. There are five categories:

  1. Hadoop
  2. Key-Value Stores (non-Hadoop)
  3. Column Stores (sometimes referred to as wide column stores)
  4. Graph Databases
  5. Document Stores


Certain aspects of NoSQL are common across all the categories and projects:

  • The Open Source nature of most of the tools, mostly Apache Open Source
  • Use of MapReduce as the data access paradigm, which is batch and keeps processing as close to the data as is possible
  • Data model implemented as JavaScript Object Notation (JSON)
  • Use of Sharding (horizontal partitioning of data across file systems)
  • Very Near Linear Scaling into petabytes
  • Not strictly ACID compliant (atomicity, consistency, isolation and durability; not meant for transaction processing)
  • Highly Fault Tolerant (but not with RAID)


As companies increasingly begin to understand that no matter what business they are in, they are in the business of information and they need to develop a competency for Information Development to take control of their data. This includes data that is best suited for NoSQL solutions. Some of the data that exceeds the volume, variety and complexity thresholds that make NoSQL attractive include:

  • Sensor-read data
  • Detailed clickstream data
  • Full transaction logs
  • Heavy-relationship-based data (for Graph Databases)
  • Web-crawled data

The need to manage Big Data and the benefits of mastering it will trump the inertia currently at work upholding the paradigm of loading all data into a database using ETL and then querying that data. BigData/NoSQL solutions have a chasm to cross, but it will need to happen soon. The investment in BigData by the large IT vendors and the investment community certainly supposes adoption well beyond the initial group of Silicon Valley "new economy” companies like eBay, Amazon, Yahoo and Google – companies who, incidentally, wrote many of the solutions. The Big Data landscape could radically change if the adoption does not continue through to the big insurance companies, car manufacturers, healthcare companies and big retailers.

The key benefits that will motivate mainstream IT to adopt Big Data are as follows:

  1. Reducing cost of incumbent database vendor: around annual renewal time, some companies will look at solutions that are less expensive for “seemingly” similar types of solutions. Big Data may serve a different class of solution better than relational databases, but many of those solutions have been force-fit into the relational model.
  2. Improving search/reporting performance via Sharding and vertical scaling
  3. Offering high availability and better disaster recovery
  4. Providing higher flexibility of the data schema: adding a column takes a month for a large production database, but takes no time in Big Data.
  5. Reducing cost of compliance: meeting regulatory requirements to keep multiple years of highly detailed data like sensor data or weblogs by storing data in relational data stores is not cost effective.

Many of the early adopters of Big Data solutions are reaping these benefits already, but are also carrying the burden of trial and error until the technology and management of such solutions mature.

Solution Offering Purpose

The Big Data Solution Offering is a Core Solution Offering. Core Solution Offerings bring together all assets in MIKE2.0 relevant to solving a specific business and technology problem. Many of these assets may already exist in other Solution Offerings and as this solution is built out over time, assets can be progressively added.

A Core Solution Offering contains all the elements required to define and deliver a go-to-market offering on its own. It can use a combination of open, shared and private assets.

The purpose of the MIKE2.0 Solution Offering for Big Data is to provide guidance into the value propositions, applicability, selection and deployment of NoSQL technologies to get Big Data under management and put to work as an asset for the business.

Solution Offering Relationship Overview

The MIKE2.0 Solution Offering for Big Data goes across multiple Enterprise Views and uses a number of components from the SAFE Architecture. For any significant implementation, many of the activities will be required from the Overall Implementation Guide. Work across the Information Development work stream is the primary focus on this solution.

It should be noted that in addition to the integration work that is needed, significant work will also typically be required in other areas of Infrastructure Development. This includes deployment of the cluster on which the Big Data will reside if the cluster will be managed in-house. Otherwise, clusters can be managed in the cloud – public or private – but must be well understood by the company. Middleware and operating system deployment will most likely be required in an Information as a Service (IaaS) and Platform as a Service (PaaS) model. An enterprise cloud strategy is essential. If Big Data is going to be deployed in the cloud, it will be an extreme user of the cloud for the organization. Big Data may actually be the impetus for developing a cloud strategy.

Please see The Dimensions of Cloud Computing for more information on IaaS and PaaS.

Many will find the vendor and community support to be less than what is found in the commercial community and therefore much more attention must be given to Infrastructure Development.

This work includes platform implementation, security, archiving, storage and backup and recovery.

Solution Offering Definition

Defining Big Data

Elements of "Big Data" include:

  • The degree of complexity within the data set
  • The amount of value that can be derived from innovative vs. non-innovative analysis techniques
  • The use of longitudinal information supplements the analysis


Size is the primary definition of big data. The answer is in the number of independent data sources, each with the potential to interact. Big data doesn't lend itself well to being tamed by standard data management techniques simply because of its inconsistent and unpredictable combinations.

Another attribute of big data is its tendency to be hard to delete making privacy a common concern. For example, it is nearly impossible to purge all of the data associated with an individual car driver from toll road data. The sensors counting the number of cars would no longer balance with the individual billing records which, in turn, wouldn’t match payments received by the company.

A good definition of big data is to describe “big” in terms of the number of useful permutations of sources making useful querying difficult (like the sensors in an aircraft) and complex interrelationships making purging difficult (as in the toll road example).

Big then refers to big complexity rather than big volume. Of course, valuable and complex datasets of this sort naturally tend to grow rapidly and so big data quickly becomes truly massive.

Big Data can be very small and not all large datasets are big.

Big data that is very small

Modern machines such as cars, trains, power stations and planes all have increasing numbers of sensors constantly collecting masses of data. It is common to talk of having thousands or even hundreds of thousands of sensors all collecting information about the performance and activities of a machine.

A plane on a regular one hour flight with a hundred thousand sensors covering everything from the speed of air over every part of the airframe through to the amount of carbon dioxide in each section of the cabin. Each sensor is effectively an independent device with its own physical characteristics. The real interest is usually in combinations of sensor readings (such as carbon dioxide combined with cabin temperature and the speed of air combined with air pressure). With so many sensors the combinations are incredibly complex and vary with the error tolerance and characteristics of individual devices.

The data streaming from a hundred thousand sensors on an aircraft is big data. However the size of the dataset is not as large as might be expected. Even a hundred thousand sensors, each producing an eight byte reading every second would produce less than 3GB of data in an hour of flying (100,000 sensors x 60 minutes x 60 seconds x 8 bytes).

Large datasets that aren’t big

There are an increasing number of systems that generate very large quantities of very simple data. For instance, media streaming is generating very large volumes with increasing amounts of structured metadata. Similarly, telecommunications companies have to track vast volumes of calls and internet connections.

Even if these two activities are combined, and petabytes of data is produced, the content is extremely structured. As search engines, such as Google, and relational databases have shown, datasets can be parsed extremely quickly if the content is well structured. Even though this data is large, it isn’t “big” in the same way as the data coming from the machine sensors in the earlier example.

Read More

Relationship Data

Keep in mind that the above definition of Big Data spans both operational support needs as well as batch and analytic needs.

In addition to big/large data, this Solution Offering Definition includes Relationship Data since it is handled by the NoSQL category of Graph Databases. Relationship data is not necessarily big, as in large, data. However, it is also a new category of data which has been ignored for long but is becoming increasingly relevant.

Relationship data can be comprised of numerous nodes of different data types or it can be numerous nodes of the same data type. For example, if you needed to quickly see someone’s relationship to their cable TV package and the channels that package carries, it would be a quick two-node traversal from the node of the person. The more common use of graph databases is with nodes of the same data type such as the social network. A social network would show all the connections between individuals in a social group, like a telecommunications calling circle or who follows who on Twitter.

Relationship data is interesting for determining how associations influence behavior. This type of relationship data may be the most prominent use of NoSQL in the years to come as companies will manage it for social media analytics, fraud, terrorism, churn management, marketing optimization and other aspects of customer management.

Hadoop

The 800 pound gorilla for the management of big data is Hadoop. Hadoop refers to a couple of open source products – the Hadoop Distributed File System (HDFS) and MapReduce – although the Hadoop family of products extends into a product set that keeps growing and growing. Although open source, many vendors like Cloudera, Hortonworks, MapR, EMC and IBM have closed source some additional capabilities and/or added Hadoop support into their product set.

HDFS runs on a large cluster of commodity nodes. Whenever a node is placed in the IP range as specified by a “name node,” one of the necessary Java virtual machines, it becomes available for data storage in the file system and will report a heartbeat henceforth to the name node. All data is unsequenced, stored in 64 MB block sizes (although records can span blocks) and is replicated three times to ensure redundancy (instead of using RAID).

The first block is written to the node creating the file. The second copy is written to a node within the same rack to minimize cross-rack network traffic. The third, and final, node is written to a different rack to tolerate switch failure. All MapReduce accesses to the data store are full scans and choose one of the three placements to read for each data block. Hadoop has full control over record placement.

This approach is highly scalable to thousands of nodes. The large block sizes maximize the I/O operation. It is also very fault tolerant, which is necessary when dealing with commodity class nodes. Hadoop offers the ability to quickly analyze massive collections of records without forcing data to first be modeled, then transformed, then loaded.

Please see Hadoop and the Enterprise Debates for a discussion of enterprise debates about the value of, and place for, Hadoop.

Types of Data Stores

NoSQL database systems are schema-less and often store data as a key followed by a value (i.e., FIRSTNAME: William). As for many solutions this form of storage is their only one, they are often categorized as Key-Value stores.

In Key-Value stores, the schema can differ from row to row. Key-Value stores are great for stock quoting, parts lists and other forms of high-volume data storing.

Column stores are, again, key-value, but “supercolumns” or “column families” are declared in the schema. For example, a supercolumn could be Name, which is broken down into First_Name and Last_Name columns. This slight deviation to the schema-less concept makes column stores very useful for Big Data with its mix of known and unknown in every row. Column stores are great for time series data.

Graph Databases primarily store the aforementioned relationship information. They therefore handle a workload very different from other NoSQL data stores. “Edge” is the term for a relationship and the edges can connect any nodes (which are like rows in a table) that have a relationship. For example, customers have social relationships with other customers, they send invitations and information to certain other customers, they place orders, they have addresses, etc. All of these (customers, orders, addresses) are nodes and edges refer to the social relationships, the sending of certain communications like invites, the placing of orders and having addresses. Graph database navigation excels over SQL for relationship navigation. Product catalogs are another common use of Graph Databases.

More commonly, Graph Databases can depict the “social graph” of relationships between entities (customers, employees, suppliers, etc.). Graph databases are also useful for recommendation engines and network analysis. Graph databases are further described in Graph databases.

Finally, there are Document Stores. Document is the term these NoSQL stores use to refer to a JSON row. Document Stores have the richest set of possibilities for the values stored. These include scalars, arrays, attachments and, importantly, links to other documents – so documents can be embedded. Document Stores provide developers with the most flexibility of all.

Document Stores are great for complex modeling and accessing actual documents, such as the workload for electronic health records.

Relationship to Solution Capabilities

The MIKE2.0 Solution Offering for Big Data goes across multiple Enterprise Views and uses a number of components from the SAFE Architecture. For any significant implementation, the great majority of activities will be required from the Overall Implementation Guide.

Relationship to Enterprise Views

Adding Big Data into the information management environment means taking an approach focused on Information Development, but with an understanding of the significant differences from legacy, SQL-based environments. Work across the Information Development workstream is the primary focus of this solution.

It should be noted that in addition to the integration work that is needed, significant work will also typically be required for other areas of Infrastructure Development. This work includes platform (cluster) provisioning, data screening and capture and data availability.

Mapping to the Information Governance Framework

This Solution Offering is dependent on an overall Information Governance Framework being in place at the implementing organization. Implementing this solution should be governed by the same guiding principles, standards, policies and practices as the rest of the IM transformation program.

Mapping to the SAFE Architecture Framework

As described above, NoSQL systems require a new approach to extend the enterprise architecure with IaaS and PaaS models, new type of hardware and a cloud computing IT strategy. This Solution Offering will depend on several key components in the SAFE Architecture including:

As Big Data includes large volumes of unstructured data, key components within Enterprise Content Management are particularly relevant.

Mapping to the Overall Implementation Guide

In many cases, most of the Activities from the Overall Implementation Guide will be required for building out a Big Data environment. Users of MIKE2.0 should review each activity as a starting point to see if they are required based on the scope of the project requirements.

Shown below are the most important activities for building a Big Data environment and how they relate to the overall approach.

Business Blueprint (Phase 1)

A comprehensive Big Data environment means developing a vision that impacts people, process, organisation and technology. All activities from the Business Blueprint phase of MIKE2.0 will be required to define this strategic approach although it is possible that other initiatives may cover some activities (i.e. there may be another project underway to define a Data Governance team).

Enterprise Information Management Awareness

An effective big data addition to the information management environment means recognition of the differences between big data/NoSQL approaches and traditional SQL-oriented approaches. The vast majority of the users – either applications or end users – will be used to the SQL approach. Once NoSQL is decided upon, it is important for the team to continue to socialize the differences so the implications are well understood as the technology blueprint and other phases are developed. Enterprise Information Management Awareness for big data creates a mutual understanding in the enterprise between the development team and the using community of the unique aspects that big data solutions must contend with.

While the lack of complete ACID compliance may cause a slight, immeasurable risk, the status of the ACID compliant nature of the big data solution should be well understood and fit with the risk profile of the application being supported or the application(s) being developed for big data.

Furthermore, even though there may been intellectual understanding of the lack of tooling, culture and similar case studies in NoSQL, this should be accounted for in the timeframes estimated for all subsequent phases as well as reinforced with wireframes and tool demonstrations to the business at this time.

The lack of a prebuilt schema during the project for users to review and approve also factors in at this time. The business also needs to be informed of the implications of unpredictable data upon the application.

Organizational QuickScan for Information Development

Big data solutions are for either an operational or analytical purpose. Furthermore, there could be many NoSQL stores in the environment. Therefore, they can fit into the overall architecture in several places. An Organisational QuickScan for Information Development around information usage and information stores across the enterprise, combined with how the new solutions fill gaps and line up other data stores for eventual replacement, should be done.

Future State Vision for Information Management

Once the choice is made to use NoSQL technology, many more decisions are necessary. The Future State Vision for Information Management will determine which category of NoSQL solution, which product and which distribution – potentially whether open- or closed-source solution – will be utilized. It will also determine how the solution fits with the enterprise cloud strategy.

Keep in mind that enterprises are moving to a more complex arrangement for their information. It will be an architecture replete with heterogenous data stores in both operational and analytical realms. There will be NOSQL solutions throughout the enterprise integrated with ERP systems and analytical data warehouse systems. There will be applications that need data that spans systems. There will be systems that feed data to other systems, systems that specialize and systems that generalize. Some systems will keep history data and others will not.

A future state vision for information management must consider existing infrastructure and try to build towards an efficient environment while not losing any of the critical functionality that data provides the organization.

Data Governance Sponsorship and Scope

Perhaps no two domains are as important to understand the intersection of than big data and data governance. Data governance is about people making sticky decisions about how data should be sourced, cleansed, managed, distributed and used.

While many on big data projects run ungoverned today, they want to do data governance, but need a plan. Data Governance Sponsorship and Scope determines the enterprise subject areas in preparation for governance assignment.

Governance Organization

Data governance is at the enterprise subject area level. Soon, if not now, data governance will be interesting for mobile data, cloud data, social data and other-device data. This should compel the leader to think about data from an enterprise perspective. The Initial Data Governance Organisation is comprised of members from the business team that will represent the business subject areas affected by the big data project.

Occasionally, data governance may need to start with the big data program. Being a part of data governance is increasingly becoming necessary for big data initiatives. Though we expect less value per capita from big data, semantic analysis and ontologies give this data the meaning it needs to be effective for the organization. Unstructured and very large data sets still need some of these essentials that data governance programs bring including data quality, metadata and security.

The benefits of data governance of NoSQL data is similar to data governance of SQL data and the approach and organization are similar. Ultimately, good data governance programs can and should be extended to big data and NoSQL technologies.

Return on Investment of Information Assets

Many initial big data projects are focused on harnessing the data and not as cognizant of the business use of the data and how the project fits into the overall business plan. This is a mistake biting most of these types of projects eventually.

Big data projects can be either operational or analytical in nature. A big data project is either perceived as in support of another “project”, such as an operational environment serving content to the website, or perhaps the data layer of the project is so important that the project itself is seen as a big data project.

In the former case, Return on Investment of Information Assets should prove the big data solution as providing the lowest total cost of ownership (TCO) for providing the data asset to the application. This often means that it proves a lower TCO than a relational environment.

In the latter case, where big data IS the project, the project should be proven out by showing a return on investment (ROI) framework to the business. This ROI shows the return to the business in the form of increased sales or reduced expenses that exceed the cost of the project itself (people, hardware and software).

Technology Blueprint (Phase 2)

All activities from the Phase 2 - Technology Assessment and Selection Blueprint phase of MIKE2.0 will typically be required to define the strategic approach to building a big data environment. Some activities (such as the definition of Data Standards) may only require a review of existing artifacts as they may already be in place.

Strategic Requirements for Application Development

With Strategic Requirements for BI Application Development, the project will determine the means of accessing the data stored as part of the big data project. This could be through other open source product like Pig or Hive or it could be a commercial product that has ported its product to work with the big data solution, like Jaspersoft has done with MongoDB for example. It could also be utilization of the MapReduce approach with maps and reducers coded in Java to work in parallel across the nodes where the data is located. This code will run as full scans without workload management.

There can be many access methods for the data, but this is the phase to determine what they all will be.

Strategic Requirements for Technology Backplane Development

Strategic Requirements for Technology Backplane Development map the functional capabilities that are required for Information Development and Infrastructure Development against the conceptual architecture capabilities as defined by the SAFE framework. This is a key activity as the back-end capabilities required for developing the big data systems are the area where many initiatives tend to run into issues.

Strategic Non-Functional Requirements

As with developing the business return on investment, Strategic Non-Functional Requirements is often overlooked at peril in big data projects. While the service levels supported by the non-functional requirements (NFRs) of big data projects tends to be lower than relationally-based projects, they still should be defined, strived for and understood.

Such NFRs for big data include measures for availability, performance, reliability, scalability, maintenance, security, usability, connectivity, systems management and disaster recovery.

Current State Logical Architecture

Current-State Logical Architecture is important for Big Data because the Big Data project will be bringing in entirely new technology into an environment that mostly does not understand it. Since “No SQL” originally referred to SQL replacement (and subsequently was redefined as “not only SQL”), NoSQL practitioners have to ward against overreach with their solutions. Most NoSQL solutions will be for net new applications to the business. It is therefore important to take inventory of what exists in the architecture to know how NoSQL will fit in.

Future State Logical Architecture and Gap Analysis

Future-State_Logical_Architecture_and_Gap_Analysis will take the current state logical architecture and update it for the to-be Big Data/NoSQL solution. The architecture will mostly be amended with new structure and connection points between the new structures and existing structures. This is a good time to identify the method of integration with the enterprise. The two forms are data virtualization, which is leaving data in place and providing the means of doing federated queries, and data integration, which will move data from one environment to the other for processing. In any robust environment with NoSQL solutions, both will be needed.

Roadmap and Foundation Activities (Phase 3)

This phase is one of key features of MIKE2.0 in that it involves laying the foundation for success for the Big Data initiative. Given the newness of projects of this nature in the enterprise, the foundation is critically important.

Testing and Deployment Plan

Since Big Data projects involve large volumes of data and involve exploration of that data, these projects often do not follow a strict development-test-production approach. To the degree testing is necessary, it is important to establish a test plan. Testing and Deployment Plan will forge the testing plan as well as the path-to-production, which will frequently consist of two, instead of the usual three, environments. Initial development may happen in production.

Data Profiling

Data Profiling takes on tremendous importance in Big Data projects. Given the lack of a schema, profiling the source data (or the data generated by the application) is akin to data modeling. While a formal data model is not created for Big Data projects, a documented profile of the source data should be developed showing the expected form of the data. Without this tool for anticipation, data processing will not have a basis to proceed.

Business Intelligence Initial Design and Prototype

Business Intelligence Initial Design and Prototype is the next step and entails the development of wireframes and other forms of representation of the resulting processing of the data stored in the NoSQL solution. Though the developers tend to be very close organizationally and physically with the business and user teams, it is nonetheless important to prototype the look of the expected results by the development team for the user community.

Detailed Design (Phase 4)

This phase extends the foundation into action. It is focused on the development staff’s implementation of the design in the determined environment.

User Support & Operational Procedures Design

User Support & Operational Procedures Design will detail the post-production support measures that the development team will put in place to support their work. For Big Data projects, development will be exceptionally iterative so this step is critical to establish the legitimacy and acceptance of the development.

Information Security Design

Since Big Data projects are often cloud-based and the software lacks granular security mechanisms, Information Security Design is an important step for taking inventory of the security possibilities and the breech opportunities. Many times in big data, the user community is very limited but this is only the community that will be involved in the project and not every person who could breech the data.

Data Integration Logical Design

Data Integration Logical Design will perform the sourcing of the data into the NoSQL data store. This will be the longest and most resource intensive step of the solution. Yet, it is easily explained as the implementation of the Data Profiling from Phase 3. While transformative capabilities are limited in Big Data, screening capabilities are abundant. The pace of the loading will be at levels more rapid than most enterprise architects are used to.

User Interface Design

While the data is being loaded in the Data Integration Logical Design and with the knowledge of the data set from Data Profiling, User Interface Design proceeds to implement the Business Intelligence Initial Design & Prototype with the chosen tool from Strategic Requirements for BI Application Development. In Big Data projects, given the exploratory nature and limited audience of the analysis, expect there to be orders of magnitude more user interfaces and reports to develop than in other solutions.

Test Design

Taking the Testing and Deployment Plan down another level, in Test Design the testing team will work with the development team to develop the functional, system, end-to-end and stress & volume tests. The development team will also outline their unit testing.

Implement and Improve (Phase 5)

The implementation activities in phase 5 provide a mechanism for testing the development, going to production, having success in production and incrementing the solution.

Testing Activities

Testing activities for a Big Data solution includes multiple cycles that are largely executed in a serial fashion. It involves Functional Testing, System Integration Testing, End-to-End Testing, and Stress and Volume Testing. All activities are required for building a complex Big Data solution. Although scale-out solutions are highly scalable, it is important to understand the upper limit of the current cluster. This is best reevaluated after the initial load of several terabytes of data.

Production Deployment

At this stage, Production Deployment should be a drama-free move of the code to the production environment, initial load of data and initial business intelligence deployment, supported by the execution of the User Support & Operational Procedures Design. In the event that development was done in production, this step involves the establishment of the pre-production environment(s).

Evaluation and Launch

The Evaluation and Launch primarily evaluates data and its usage. Data should meet the expectations of the Data Profiling step. Usage should be met with user satisfaction of the Business Intelligence Initial Design & Prototype. Please note though we are using ‘business intelligence’ to broadly refer to usage of the Big Data. Much of the usage is not interactive, but batch.

Continuous Improvement Activities

Like for a data warehouse, Big Data continuous improvement activities ensure the Big Data solution is a living, breathing entity in the organization and that it is possible to make changes to existing feeds and create new feeds into the Big Data store as well as do the same for the data access environment. Continuous improvement also means monitoring the Non-Functional Requirements developed earlier and the scaling of the solution within its current architecture. Big Data data calculations often underestimate the sheer volume in capturing sensor and webclick data.

Mapping to Supporting Assets

Logical Architecture, Design and Development Best Practices

A number of artefacts help support the MIKE2.0 Solution for Big Data:

In addition, there are MIKE2.0 Solutions that are specifically focused on two of the back-end issues that tend to be problematic in building a better Big Data environment:

Product-Specific Implementation Techniques

Though all steps of the offering will be applicable to all big data (data store) product implementations, the major implementation differences will be between operational Big Data and analytical Big Data (Hadoop) as well as between non-graph databases and graph databases.

Operational Big Data support the online, real-time needs of applications like online advertisement targeting and content serving, customer facing dashboards updated in real time, infrastructure monitoring, recommendation engines, social monitoring, web data collection for real-time needs, immediate product serving, high frequency trade and game state storage. The Business Intelligence aspects of the solution will be less interesting in operational Big Data whereas the data profiling takes on a heightened importance.

Hadoop handles the analytic workload much like the data warehouse does, only for a different, more unstructured, data set. The Business Intelligence aspects of the solution are important to Hadoop implementations as they serve the data to users. The data governance aspects of the solution are also more critical for the integration data set that is Hadoop than for operational big data.

The primary difference in the solution when applied to Graph Databases will be that the data may not be as large and therefore the non-functional requirements would be different. Also the Data Profiling step will be quite different since it will apply to relationship data. The modeling will be that of node-relationship-property as opposed to profiling small changes in large numbers of records of mostly similar data.

Product Selection Criteria

In the Executive Summary and Defining Big Data above, the solution steps through helping to get the Big Data project into the correct Big Data category. Once into the correct category, the product set can become quite limited. As “sledgehammers”, these NoSQL solutions may not appear at first to have much technology uniqueness. Some of the uniqueness may be found in the softer factors of the company culture.

For Hadoop, there is the important question of which distribution.

Relationships to other Solution Offerings

The MIKE2.0 Big Data Solution Offering delivers a fully operable Big Data environment. It comprises business considerations, technology planning, design, development, testing and beyond into iteration management. It is mindful that multiple Big Data solutions in the enterprise may be necessary, may share data and may even have simultaneous development occurring.

Other than some of the data integration and business intelligence from the Business Intelligence Solution since Big Data solutions are still source-store-access solutions, the Big Data Solution Offering will have more differences than similarities to other solution offerings.

Extending the Open Methodology through Solution Offerings

The Big Data Solution Offering has numerous implications and required revisions to the Open Methodology.

These include a greater focus on open source procurement in the Technology Selection QuickScan, a revision of the definition of business intelligence to include batch access techniques in the BI_and_PM_Offering_Group, data integration when transformation is lacking in the Data Integration Solution and a very new way to do data modeling in Data_Modelling_Solution.

Wiki Contributors
Collapse Expand Close