Data related to COVID has been the most watched across the globe. Several data enthusiasts contributed, disputed, and inferred to provide relevant information and insights related to citizen data. Public and health departments across the world shared the datasets on several public marketplaces to enable enthusiasts to provide insights. This has given us an opportunity to understand the value unlocking that can happen with citizen data scientists, and access to data that has been vetted, certified, and hosted on a data exchange or a data marketplace.
This influx of data could be the beginning for initiatives such as - bring your own data (BYOD), on lines of the familiar bring your own device.
We can see possibilities for data marketplaces, for instance self-service data platforms handling on-demand general and specialized workload clusters, thereby enabling enterprises to be agile in taking data-driven decisions, both human and augmented.
Apache NiFi products provide low-code, easy to build, data flow pipelines with a multitude of connectors across databases, clouds and middleware. These products act as the conduit and one place to design, run and monitor data collection visually.
Streamsets Data Collector is an easy-to-use modern execution engine for fast data ingestion and light transformation that can be used by anyone.
Confluent Kafka, the commercial counterpart of Apache Kafka, has expanded a set of features with Kafka connectors, which enable data movement across several sources and syncs with schema registry, transformers, and convertors.
Change Data Capture (CDC) provides a quick and non-intrusive way to integrate new data ecosystems with enterprise applications including ERP, CRM and on-premises transactional systems. With the recent trend of using elastic cloud, CDC products provide the right component to enable data flow with guaranteed delivery. HVR and Informatica CDC are leading products in this space.
Amazon Redshift is a cloud data warehouse that allows enterprises to scale from a few hundred Gigabytes of data to a Petabyte or more. Redshift variations include Redshift Spectrum that enables you to run queries against Exabyte of unstructured data in Amazon S3, with no loading or ETL required.
Delta is a new lakehouse paradigm that enables atomicity, consistency, isolation, and durability (ACID) of transactions by using metadata on top of data lakes and cloud blob stores. Time travel, schema enforcement, support for unified real time and batch processing with parquet as storage format are the key features.
Apache Hudi brings stream processing to data lakes. It enables Upsert support on data lakes. Snapshot isolation between writer and queries, asynchronous compaction, timeline metadata to track lineage and atomic publish with rollback are key features. This could be seen as an alternative for HBase and upgrade over plain vanilla HDFS.
Apache Hive is a data warehouse infrastructure tool used for processing structured data in Hadoop. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive supports native ORC format, support for external tables and the expressive Hive QL language.
Parquet is an open-source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for flat columnar storage format of data with record shredding. The assembly algorithm is optimized to work with complex data in bulk and there are different ways for efficient data compression and encoding types. Parquet can only read the required columns, thereby minimizing the IO significantly.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale.
Google BigQuery, which is a server-less, highly scalable, and cost-effective multi-cloud data warehouse has been upgraded with new features such as BigQueryML for query-based ML model build and operations, BigQuery OMNI, BigQuery BI Engine, and BigQuery GIS enable multi-cloud query, in-memory BI, and geospatial analysis.
Snowflake is a hyperscaler-independent, cloud data warehouse that is compatible with all the three top cloud providers. Key features include complete compute-storage decoupling, results caching, and automatic dynamic query optimization.
Databricks platform leads the concept of DataLakehouse. It is a replacement for Data warehouse as it brings the Datawarehouse semantics to Datalakes. Support for Delta Lake, Databricks notebook environments, Databricks SQL analytics, scalable Apache spark processing engine and optimization extensions, and ML flow make Databricks a preferred platform in the data space.
Apache Spark engine is a distributed, scalable processing engine written in Scala with bindings to Python and R. It is the most preferred processing engine for unified real-time and batch processing.
The data build tool (dbt) is a data transformation tool that enables data analysts and engineers to transform, test and document data in the cloud data warehouse. Engineers can transform data in their warehouses by simply writing select statements and turning them into tables and views. dbt is good at transforming the data that is already loaded. It provides the foundational elements for composable SQL constructs that provides a fresh approach for analysts.
Apache Airflow is an open source orchestration platform that enables data engineering teams to design, develop, deploy, and manage complex workflows and data processing pipelines. A large selection of connectors, sensors, and third-party integration with cloud hyperscalers makes Airflow a preferred platform. GCP provides a fully managed version via GCP cloud composer, which is built on top of Airflow.
The Informatica Intelligent Data Management Cloud (IDMC) platform provides complete, comprehensive cloud-native and AI-powered data management capabilities, including data catalog, data integration, API and application integration, data prep, data quality, master data management, and a data marketplace, on a foundation of governance and privacy. Informatica IDMC is powered by an AI and machine learning engine, CLAIRE®. It is optimized for intelligence and automation, and is built on a modern, elastic, serverless microservices stack that connects data consumers to relevant data sources. It enables you to intelligently discover and understand all the data within and outside the enterprise, access and ingest all types of data when required, curate and prepare data in a self-service mode, so it is usable, and delivers an authoritative and trusted single view of all your data. Informatica IDMC is a complete platform for cloud-native data management.
The Collibra Data Intelligence platform enables data governance, catalog, privacy, lineage, and data quality rules management on a single platform. The Collibra marketplace provides connectors across the data chain from ingestion to business consumption.
Alation provides an open and intelligent platform that supports a wide variety of metadata management applications from search and discovery, data governance to digital transformation. The foundational element is the behavioral analysis engine (BAE), which improves virtually all areas of the platform with advanced artificial intelligence and machine learning technology. Discovery is enhanced through natural language and popularity-driven relevancy rankings. Stewardship is streamlined with emphasis on the most active data sets in empirical usage. Governance is implemented in the workflow through flags and suggestions about the relevant policies that govern data assets.
Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. You can easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage. Data consumers are able to find valuable, trustworthy data.
Power BI is a cloud-based business intelligence service from Microsoft. You can connect from on-premise and cloud data sources to create dashboards and interactive visualizations using a desktop-based interface called Power BI Desktop. You can build end-to-end business solutions by connecting Power BI across the entire Microsoft power platform - Office 365, Dynamics 365, and Azure. Connect Power BI with power apps and power automate to easily build custom business applications and automate workflows. All users can utilize BI as it offers self-service analytics at enterprise scale, analyses large volume of data, finds insight from both structured and unstructured data, performs stream analytics in real time to converts insights to action.
Tableau is a business intelligence software that helps you create rich interactive dashboards with limitless visual analytics capabilities. It's a market leader for modern business intelligence that supports self-service analytics from data preparation and analysis to share, along with governance and data management. With Tableau, you can securely consume your data via browser, desktop, mobile, or embed it into any application. You can deploy in the cloud, on-premises, or natively integrate with Salesforce CRM. Tableau Prep, a self-service data preparation tool with a visual interface that is used for data cleansing, automation and administration makes data preparation easier than ever.
Looker is an enterprise cloud-based platform for business intelligence, data applications, and embedded analytics. Looker, part of Google Cloud platform, helps you explore, share, and visualize data. You can perform data analysis and visualization across multi-cloud - Google Cloud, AWS, Azure, and on-premise databases. Looker provides tools to power myriad data experiences, from modern BI and embedded analytics to integrating workflows and custom applications. You can also augment business intelligence from Looker with leading-edge machine learning, AI, and advanced analytic capabilities built into the Google Cloud platform.
Fivetran is a cloud-based SaaS product that provides automated data integration. It is built on a fully managed ELT architecture that delivers zero-maintenance pipelines and ready-to-query schemas. Fivetran brings in several ready out-of-box connectors to enable quick development cycles.
Stitch is a cloud-first, open-source platform for rapidly moving data. A simple, powerful ETL service, Stitch connects to all your data sources – from multiple sources such as databases like MySQL and MongoDB, to SaaS applications like Salesforce and Zendesk – and replicates that data to a destination of your choosing, which could be your data warehouse. Several companies, both large and small are ingesting billions of rows a day using Stitch. Stitch also provides an enterprise edition.
Data Fusion is built using open-source project CDAP, and this open core ensures data pipeline portability for users. CDAP’s broad integration with on-premises and public cloud platforms gives Cloud Data Fusion allows users to break down silos and deliver insights that were previously inaccessible. Cloud Data Fusion offers pre-built transformations for both batch and real-time processing. Users can also create an internal library of custom connections and transformations that can be validated, shared, and reused across teams. It lays the foundation of collaborative data engineering and improves productivity, thereby reducing the waiting time for ETL developers and data engineers and, more importantly, leading to less concern about code quality.
Firebolt, is yet another cloud data warehouse offering with a difference. It promises query capabilities on top of data lakes with high price-performance ratio and support for semi structured queries.
Presto SQL, with its origins at Facebook, is a distributed query engine to address problems related to low-latency interactive analytics over Hadoop-based data stores.
ThoughtSpot is the search and AI-driven analytics platform for the enterprises. It allows you to search and find insights in your company's data. All users can create, consume, and implement data-driven insights. You can build interactive data apps on a developer-friendly, low-code platform with flexible APIs. Thoughtspot Everywhere enables to build interactive data apps that integrate with your existing cloud ecosystem. It also delivers insights at a massive scale.
Yellowbrick is based on PostgreSQL and natively supports stored procedures, reducing migration timelines and allowing your team to be productive from day one. It works out of the box with the most common industry tools that use ANSI SQL, such as Tableau, MicroStrategy, SAS, Microsoft Power BI, as well as with Python and R programming languages.
Mattillion is a modern cloud-based ETL and integration platform for cloud data warehouses. It comprises connectors to all popular cloud data warehouses with a low-code approach.
Dataplex is an intelligent data fabric that provides unified analytics and data management across data lakes, data warehouses, and data marts. It provides a single pane of glass for end-to-end data management through metadata-led data management, centralized security and governance, and an integrated task-based analytics experience.
The Denodo platform is a data virtualization product that provides agile, high performance data integration and data abstraction across the broadest range of enterprise, cloud, big data and unstructured sources, as real-time data services that support business transactions and analytics.
DeltaShare is the industry’s first open protocol for secure sharing of data. You can easily share data with other organizations regardless of which computing platforms they use. It leverages Apache Parquet as the standard format for data exchange.
AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. You can choose from over 250 pre-built transformations to automate data preparation tasks, all without the need to write any code. You can automate filtering anomalies, converting data to standard formats, and correcting invalid values, and other tasks.