Architecture
This section describes the main components of Thales Data Discovery and Classification (DDC) and Data Discovery and Classification Machine Learning (DDC ML) and how they interact together to provide the DDC solution. Before you go ahead with the actual deployment, review the graphics included in this section to get a feel for what a typical DDC and DDC ML deployment looks like. The concepts used in this diagram are introduced in the later sections of this document and explained at length in the Data Discovery and Classification Administration Guide
DDC architecture

DDC ML architecture

At the heart of the DDC solution is CipherTrust Manager on which runs the DDC Server. It is from here that users interact with the DDC GUI or use the DDC APIs to create classification profiles, add data stores, launch scans, and generate reports.
To understand the architecture components in depth, see Overview section in the administrator guide.
Where to install the DDC agents
DDC supports a number of different data stores. In order to access these data stores, the DDC Server communicates with one or more DDC agents. The DDC agent is a software component that is used to scan a data store for Infotypes (such as credit card numbers, email addresses and so on) that are part of a classification profile. All data that is collected is sent from the agent to the DDC Server which stores the data, together with any user requested reports, on an external Hadoop cluster.
Generally speaking, if you are scanning data stores that are local to Windows or Linux (no network shares), you should install the DDC agent on the server where the data is located. For all other types of storage (top part of the figure), the DDC agent should be installed on a proxy server.
Note
A Windows Proxy is needed to connect to databases.
As an example, let’s assume that you wish to scan an NFS share. In this case, the NFS share should be mounted on the proxy server and the DDC agent should be installed on the proxy server. To scan the share, specify the mount point of the NFS share when creating the scan. For DDC agent requirements and the types of data stores supported, see DDC agents. For information on securing the deployment, refer to Hardening guidelines.
How DDC uses Thales Data Platform
Thales provides on-premises (Thales Data Platform — TDP (On-prem)) and cloud-based (Thales Data Platform as a Service — TDPaaS) platforms for storing big data. The user has the flexibility to switch between TDP (On-prem) and TDPSaaS as many times as needed while scanning data with DDC. When switching from TDP (On-prem) to TDPaaS or from TDPaaS to TDP (On-prem), the configuration settings for previous platform type are lost. Switching platforms does not impact the remaining DDC features.
Note
After a successful DDC scan is executed, users cannot switch from one platform to another.
Thales Data Platform (On-Prem)
DDC uses Hadoop to generate reports from scans and store their results (report data). TDP (On-prem) is the only Hadoop flavor available for this purpose. It is different from the Hadoop cluster that DDC also supports as a data store, that is where the user stores the data.
DDC uses Spark and Livy to process the data and stores it in HDFS. Tez is a requirement to use Spark.
The DDC server retrieves the results of the scan from the DDC agent and stores this information to on-prem TDP together with any reports that are generated. It is imperative that your TDP cluster is highly available to avoid losing any data store scans or reports.
DDC also requires Apache Knox as a single point of access to the TDP cluster (both Livy and HDFS), to ensure all the communications are protected with TLS, and for authentication. Therefore, you only need to connect DDC to Knox. For information about configuring DDC to use on-prem TDP, see Configuring TDP (On-prem).
Thales Data Platform as a Service
DDC uses cloud-based Thales Data Platform as a Service (TDPaaS) platform, also referred as Data Management Service, to store the scan and report data over cloud. It is a SaaS component and provides an alternative to the Hadoop Services offered by on-prem TDP. TDPaaS is server-less and doesn't require any manual administration and management of the services.
For information about configuring DDC to use TDPaaS, see Configuring TDPaaS.