6.5 C
New York

The 3Ds of Migrating Teradata Workloads to the Databricks Lakehouse Platform

Many giant enterprises have used Teradata knowledge warehouses for years, however the storage and processing prices of on-premises infrastructure severely restricted who may use the useful resource and for what functions. On prime of that, an improve is an extended course of, plus Teradata must ship the purchasers the {hardware} and set up it within the buyer’s knowledge heart within the occasion of an outage.

Migrating your legacy Teradata knowledge warehouse to the Databricks Lakehouse Platform can speed up your knowledge modernization journey. Nonetheless, it might appear complicated and daunting, requiring a well-orchestrated and deliberate effort. Throughout the preliminary scoping section, it’s possible you’ll uncover that your group has collected a whole bunch of jobs and 1000’s of SQL scripts over time.

Nonetheless, worry not! Enterprise clients like Walgreens, Sam’s Membership and lots of others have efficiently migrated their legacy Teradata Information Warehouse to Databricks, enabling them to avoid wasting thousands and thousands in infrastructure financial savings whereas on the identical time accelerating innovation and productiveness. This weblog publish presents a scientific technique to speed up your journey towards adopting the lakehouse in a framework encapsulated in an acronym, 3Ds: Discovery, Improvement, and Deployment.

In case you are questioning how Databricks is completely different from Teradata, the abstract matrix beneath illustrates how the Databricks Lakehouse Platform compares to a conventional knowledge warehouse like Teradata:

Capabilities comparability

  Databricks On-Prem Teradata
Information format Delta (open supply) Proprietary
Information varieties Structured, Semi-structured, Unstructured Structured, Semi-structured
Languages supported SQL, Python, Scala, R SQL solely
Use instances BI, SQL, Machine Studying/ Information Science, Actual-Time Streaming BI, SQL
Reliability Excessive-quality, dependable knowledge with ACID transactions Excessive-quality, dependable knowledge with ACID transactions
Governance High-quality-grained entry management for tables, rows, columns with Unity Catalog High-quality-grained entry management for tables, rows, columns
Architectural paradigm Lakehouse Platform Conventional Information Warehouse
Licensing Mannequin Consumption based mostly Annual subscription + Further function(s) + Help/ Upkeep + Upgrades value

Step 1: Discovery

Profile the Teradata Setting

Step one within the migration course of is to comprehensively perceive your Teradata atmosphere to find out the general scope and energy required for the initiative. Under are two key questions you’d wish to find out about your Teradata deployment:

Query #1: What does my Teradata topology seem like concerning {hardware} specs? (e.g., storage utilization, system utilization, warehouse objects data, question varieties)

To reply this query, extracting and analyzing syslogs out of your Teradata warehouse atmosphere is a good place to begin. To speed up this course of the Databricks migrations group has developed a Teradata Profiler instrument to speed up this course of. The profiler makes use of Teradata’s system tables and repair referred to as PDCR (Efficiency Information Assortment and Reporting), an information utility that gives an understanding of system efficiency, workload utilization, and administration. This migration evaluation utility helps you mechanically uncover and profile the present Teradata Warehouse atmosphere. As well as, the instrument additionally helps in DBU (Databricks Unit) forecasting. The screenshot beneath illustrates the important thing insights generated by the Teradata Profiler dashboard (for extra data on operating the profiler, please get in contact along with your Databricks consultant):

This is a sample Teradata profiler dashboard
This can be a pattern Teradata profiler dashboard

Query #2: What number of workloads do I have to migrate to Databricks? How simple or complicated are my atmosphere’s jobs and code patterns?

To reply this second query, it’s worthwhile to evaluate the precise code, scripts, knowledge warehouse objects, and jobs collected in your Teradata warehouse atmosphere and create a abstract stock of those artifacts. To speed up this evaluation course of, Databricks extremely recommends using a code profiler or analyzer (e.g., BladeBridge, LeapLogic, DataMetica and so forth.) or resolution accelerators constructed by certainly one of our BrickBuilder Migration Resolution companions. These instruments usually present detailed stories of what is inside your atmosphere and break down knowledge pipelines and jobs into numerous buckets based mostly on complexity. It permits you to scope out the hassle required for the migration initiative and any code refactoring which may be essential throughout this course of.

Along with analyzing jobs complexity, these analyzer instruments produce a number of helpful summaries, together with a listed stock of belongings and artifacts within the Teradata atmosphere; examples embrace

  • SQL scripts
  • Dynamic SQL
  • Features,
  • Referenced Objects
  • Program-Object Cross Reference
  • Complete Traces of Code
Sample summary output of the BladeBridge Code Analyzer
Pattern abstract output of the BladeBridge Code Analyzer

The analyzer offers an excellent understanding of your Teradata warehouse atmosphere by auto-analyzing the code/scripts, as you are able to do an in depth migration evaluation and energy estimation. You’re able to embark on the subsequent step in your migration journey!

Step 2: Improvement

Now that you’ve assessed your Teradata workloads within the discovery step, the subsequent step is the precise migration of historic knowledge and related workloads to the Databricks Lakehouse Platform. This part will stroll you thru the event actions to attain that.

2.1 Information Warehouse Extraction

To get began with knowledge migration, the Teradata Parallel Transporter (TPT) is a shopper utility that gives scalable, high-speed, and parallel knowledge extraction and loading. Utilizing TPT, you may extract the info from all of your tables in Teradata at scale after which push the info information into cloud object shops equivalent to AWS S3, Azure Information Lake Storage, or Google Cloud Storage. Using TPT to unload knowledge gives a number of crucial advantages listed beneath:

  • Capacity to outline subject delimiter, date codecs, and encoding kind
  • Management to find out useful resource allocation for knowledge unloading to attain higher efficiency
  • Outline the variety of generated output information and their corresponding file kind
  • Helps checkpointing and resume operations in case of failures and interruptions

Alternatively, you need to use an ETL instrument of your option to carry out the above operation as nicely. You’ll be able to push these extracted information to the cloud storage utilizing cloud-native CLI or managed companies or use any open supply/third-party ingestion instruments.

As soon as the extracted load-ready information in csv or textual content codecs have landed on the cloud storage, you need to use Databricks Autoloader for computerized incremental ingestion. It can deal with the historic knowledge ingestion.

Diagram illustrating the transfer of data from Teradata to AWS S3
Diagram illustrating the switch of information from Teradata to AWS S3

From an incremental load standpoint, it’s essential to make sure that the ingestion course of pushes the information to the cloud storage location for all of the tables in scope. Often, this can be a CDC supply, which is taken care of by ingestion instruments like Fivetran (HVR), Airbyte, Debezium, Azure Information Manufacturing facility, AWS DMS or others., relying in your selection of ingestion tooling and supply system(s). Within the case of current logic utilizing MLoad, TPT or Quick Load scripts, the place you’re performing incremental masses into Teradata right this moment, that may be taken care of as a part of MERGE INTO performance in Databricks.

2.2 Code conversion and pipeline growth

When changing code out of your Teradata Warehouse atmosphere to Databricks, the first purpose is to leverage automated strategies as a lot as potential. The conversion of the Teradata logic and performance utilizing certainly one of our migration tooling ISV companions or a BrickBuilder resolution simplifies and accelerates the modernization effort to a big extent. As a finest apply for migration, we suggest that you just group associated code belonging to a knowledge utility end-to-end or topic space collectively and hint it backwards from the reporting layer to the bottom tables.

Migrating code that has collected over time might appear to be an intimidating and daunting activity. Let’s break them down into 4 main classes listed beneath and discover every space in additional element to method the code migration systematically:

  • Information Sort Conversion
  • Desk DDLs
  • Desk DMLs
  • BTEQ Scripts
  • Saved Procedures

Teradata has its dialect of the SQL language however conforms intently to the ANSI SQL that Databricks adheres to. Under are the indicative pointers for code conversion between Teradata and Databricks:

1. Information Sort conversion

The conversion of SQL knowledge varieties from Teradata to Databricks is simple, as a result of ANSI-compliance of Databricks SQL. DDL statements and scripts in Teradata will be ported over to Databricks seamlessly, with most supply knowledge varieties being retained.

On sure events, the method of kind promotion will happen, which is the method of casting a sort into one other kind of the identical kind household which incorporates all potential values of the unique kind. For example with an instance, TINYINT has a spread from -128 to 127, and all its potential values will be safely promoted to INTEGER. For a full record of supported SQL knowledge varieties in Databricks and their kind priority in the course of the conversion course of, kindly consult with the hyperlink right here and our launch notes.

A graphical representation of the type precedence hierarchy
A graphical illustration of the kind priority hierarchy

2. Desk DDLs utilizing Id Columns

Id Columns are actually GA (Usually Obtainable) in Databricks Runtime 10.4 and past. By id columns, now you can allow all of your knowledge warehousing workloads to have all the advantages of a lakehouse structure.

An example of a Teradata DDL statement converted to Databricks equivalent
An instance of a Teradata DDL assertion transformed to Databricks equal

3. Desk DMLs and Perform Substitutions

Databricks SQL (DB SQL) helps many commonplace SQL capabilities; therefore probably the most generally used SQL capabilities in Teradata are additionally interoperable with DB SQL code with out required refactoring. Any Teradata capabilities not supported in native DB SQL will be dealt with utilizing Consumer-Outlined Features (UDFs). This hyperlink incorporates an alphabetically ordered record of built-in capabilities and operators in Databricks.

4. BTEQ Scripts

When you’ve got BTEQ scripts, it’s essential to convert them into SQL-based logic wrapped in Python and import them into your Databricks workspace atmosphere as notebooks. A fast abstract of the highest 5 most typical BTEQ performance, instructions and their equal transformed state in Databricks is proven beneath:

# Teradata BTEQ Command Databricks Equal
3 RUN dbutils.pocket book.run
4 IF THEN Python if block
5 IF, ELSEIF, ELSE, ENDIF Python if…elif…else block

5. Saved Procedures

Saved procedures in Teradata will be transformed into Python Consumer-Outlined Features (UDFs) in Databricks. The beneficial method is leveraging the automated code conversion instruments talked about above to speed up this course of. Following is a high-level abstract of how most auto conversion instruments deal with Teradata Saved Procedures and its equal performance in Databricks.

  • CREATE Saved Process statements from the enter code are transformed to Databricks notebooks utilizing Python and SQL within the output
  • Every Saved Process maps to an equal Databricks pocket book.
  • CALL Saved Process statements to equal dbutils.pocket book.run calls with acceptable parameter serialization and return worth deserialization
An example of a Teradata Stored Procedure converted to Databricks
An instance of a Teradata Saved Process transformed to Databricks

The tabular matrix beneath summarizes particular Saved Process performance in Teradata and how one can migrate its options into Databricks:

# Teradata Saved
Process Assemble
Migration Course of / Equal Part in Databricks
1 SQL Statements Saved Procedures include SQL statements that bear the identical conversion guidelines to Databricks as described on this weblog’s SQL conversion part
2 Parameters Parameters are transformed to output Python pocket book parameters by means of Databricks’ widgets performance. Information kind conversion from Teradata SQL varieties to Python varieties is taken care of within the conversion course of
3 Variable declarations Transformed to Python variables with acceptable knowledge kind conversions
4 IF THEN Transformed to Python if block
5 IF, ELSEIF, ELSE, and ENDIF Transformed to Python if…elif..else block
6 CASE Transformed to Python if…elif..else block
7 CALL statements Saved Process CALL statements are transformed to dbutils.pocket book.run calls with acceptable parameter serialization and return worth deserialization.

You can even share the context between completely different duties utilizing activity values, if you wish to break a big piece of code logically and leverage Databricks workflows successfully.

2.3 Information modeling

Aside from the code, in case you are fearful about migrating your customized knowledge mannequin on Teradata, Databricks helps all knowledge modeling paradigms. You should utilize that as-is on the Lakehouse. Information Modelers and designers can rapidly re-engineer or reconstruct databases and their underlying tables or views on Databricks. You would leverage instruments like erwin Information Modeler with the Databricks Lakehouse Platform to serve these wants and reverse engineer utilizing the prevailing mannequin to fast-track migration to Databricks. We suggest that you just observe our blogs for knowledge modeling finest practices.

Step 3: Deployment

Now that you’ve transformed your core Teradata logic into Databricks equal, you’re prepared for deployment. There are a number of finest practices of which you need to be conscious.

3.1 Workspace setup

When designing your workspace, there are numerous choices to arrange logical boundaries based mostly in your current knowledge mannequin, governance mannequin and enterprise architectural design selections:

  1. Three workspace approaches based mostly on environments – e.g., dev, stage and prod. As well as, we anticipate it to align with the logical separation of the corresponding schemas for the info.
  2. Isolation by the road of enterprise ( LOB ), and one would have LOB-based dev, stage and prod workspaces inside every LOB. You would even have sub-LOBs inside every LOB and, inside that, completely different worth streams or initiatives isolate the possession. This could possibly be aligned with the cloud account setup as nicely. This technique works nicely with enterprises sure by privateness and regulatory necessities.
  3. Create separate workspaces for every group that owns the info. This, in flip, permits every group to manage the info it produces and helps guarantee knowledge possession is obvious. It really works nicely for enterprises that wish to implement knowledge mesh structure.

For extra data, we suggest you observe these finest practices on workspace setup.

As soon as the workspaces are arrange, you may check your migrated workloads and deploy them into manufacturing. For CI/CD, you need to use Databricks Repos and the finest practices round it.

We usually allow you to carry out a Complete Price of Possession(TCO) evaluation and consumption planning for the required Databricks Models (DBUs) to assist these workloads from a budgeting standpoint. Please get in contact along with your Databricks consultant for this train.

3.2 Orchestration utilizing Workflows

Historically, Teradata workloads are orchestrated utilizing schedulers like Management-M, Autosys or comparable instruments with Unix-based wrapper scripts. Enterprises additionally embed the ELT logic as a part of the enterprise ETL elements, which push down the SQL logic on execution.

With Databricks, you need to use Workflows out of the field and orchestrate something anyplace. Workflows are freed from value, and you’ll orchestrate any mixture of notebooks, SQL, Spark, ML fashions, and so forth., as a Jobs workflow, together with calls to different methods. These Workflows will be scheduled utilizing Databricks scheduler.

As a part of the migration exercise, you may modernize your Teradata workloads to Databricks and get rid of licensed scheduling instruments to undertake the fashionable knowledge stack as an choice totally. For instance, when you have transformed your BTEQ scripts into notebooks, now you can orchestrate them as Duties utilizing Workflows with the required dependencies for deploying the end-to-end pipeline.

3.3 Information validation and consumer acceptance testing

To deploy workloads efficiently into manufacturing, you will want to plan for knowledge validation by the tip customers/enterprise analysts’ groups. The enterprise analysts’ groups use row counts and summaries of key attributes or metrics of the tables in scope and examine them by operating their SQL fashions on Teradata and Databricks. We suggest that you just preserve an overlapping window between the 2 methods for validations facet by facet. As soon as the groups log out with the completion of Consumer Acceptance Testing (UAT), you may plan for a cutover for all of the associated workloads. Often, these capabilities are a subset of the BrickBuilder options or migration tooling ISV companions and will be simply automated for ease of use and accelerated journey.

One other crucial issue throughout consumer acceptance testing is assembly the efficiency SLAs. You’ll get best-in-class efficiency by migrating to Databricks with a a lot decrease TCO, because it makes use of a Photon engine offering high-speed question efficiency at a low value for every type of workloads straight on prime of the lakehouse.

For extra particulars, please go to this weblog publish on data-warehousing-performance-record.
To make sure you get the most effective efficiency, we suggest you observe the delta optimizations finest practices.

3.4 Information governance technique

Databricks brings fine-grained governance and safety to lakehouse knowledge with Unity Catalog. Unity Catalog permits organizations to handle fine-grained knowledge permissions utilizing commonplace ANSI SQL or a easy UI, enabling them to unlock their lakehouse for consumption safely. It really works uniformly throughout clouds and knowledge varieties.

Unity Catalog strikes past managing tables to different knowledge belongings, equivalent to machine studying fashions and information. In consequence, enterprises can simplify how they govern all their knowledge and AI belongings. It’s a crucial architectural tenet for enterprises and one of many key causes clients migrate to Databricks as a substitute of utilizing a conventional knowledge warehousing platform.

On this case, you may simply migrate over the Teradata-based entry controls to Databricks utilizing Unity Catalog.

3.5 Repointing BI workloads

One of many crucial necessities of a profitable Teradata to Databricks migration is guaranteeing enterprise continuity, enabling adoption and assuaging any downstream impacts. Databricks has validated integrations along with your favourite BI instruments, together with Databricks Dashboards, Energy BI, Tableau, Redash, Preset, AWS Quicksight, Looker and others, permitting you to work with knowledge by means of Databricks SQL warehouses. The overall norm for a given set of stories for a given KPI is to make sure all of the upstream tables and views are migrated, together with their related workloads and dependencies.

Assuming the metadata is migrated to Unity Catalog, we may the next method for seamless repointing of stories, as relevant. Allow us to assume that the brand new tables or views beneath UAT have the suffix _delta. As soon as the tables/views in scope with UAT are accomplished, and related pipelines are migrated, you must rename the prevailing Teradata tables/views with the suffix ( e.g., _td) and rename the brand new tables/views (e.g., _delta) to the present desk or view names. This method ensures that finish customers do not need to refactor the desk or view names throughout the SQL fashions or stories, and current stories will be migrated utilizing your automated resolution with minimal syntax modifications. Be aware: You would observe an method with separate database/schema names maintained for the lakehouse, as dictated by your knowledge technique group as nicely.

Upon getting migrated the first set of KPI dashboards or stories, now you can iterate by means of the rest of the reporting layer and its migration.


A seamless migration is a vital step to make sure the success of what you are promoting outcomes. Within the above weblog sections, we walked you thru the necessary points of finishing your migration journey.

Subsequent steps

Many enterprises right this moment are operating a hybrid structure — knowledge warehouses for enterprise analytics and knowledge lakes for machine studying. However with the appearance of the info lakehouse, now you can unify each on a single fashionable platform. The Databricks Lakehouse Platform overcomes conventional MPP knowledge warehouse limitations as a result of it’s designed to handle all sorts of knowledge – structured, semi-structured, and unstructured – and helps conventional BI workloads and Machine Studying / AI natively. It provides all this performance to your knowledge lake, making a unified, single, and multicloud platform.

Migrating your Teradata atmosphere to Databricks delivers important enterprise advantages, together with

  • Discount of operational prices,
  • Elevated productiveness of your knowledge groups,
  • Unlocking superior analytics use instances whereas retaining full knowledge warehouse capabilities.

Please see this EDW Migration web page for extra data concerning migrating to Databricks and schedule a personalized Teradata Migration evaluation by reaching out to your Databricks consultant.

Strive Databricks free for 14 days.

Related Articles


S'il vous plaît entrez votre commentaire!
S'il vous plaît entrez votre nom ici

Latest Articles