DataGrip 2020.1 Help

Configure Big Data Tools environment

Before you start working with Big Data Tools, you need to install the required plugins and configure connections to servers.

Install the required plugins

  1. Whatever you do in DataGrip, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).

  2. In the Settings/Preferences dialog Ctrl+Alt+S, select Plugins | Marketplace.

  3. Install the Big Data Tools plugin.

  4. Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

    The view of DataGrip after the Big Data Tools plugin is installed

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Spark, Google Storage, and S3 server. You can connect to HDFS, WebHDFS, AWS S3, and a local drive using config files and URI.

Configure a server connection

  1. In the Big Data Tools window, click Add a connection and select the server type. The Big Data Tools Connection dialog opens.

  2. In the Big Data Tools Connection dialog, specify the following parameters depending on the server type:

    • File Systems: FS | Local, FS | HDFS, SFTP

    • Storages: AWS S3, Minio, Linode, Digital Open Space, GS, Azure

    • Monitoring: Spark, Hadoop

    Local FS

    Mandatory parameters:

    • Root path: a path to the root directory.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    HDFS connection

    Mandatory parameters:

    • Root path: a path to the root directory on the target server.

    • Config path a path to the HDFS configuration files directory. See the samples of configuration files.

    • File system URI an explicit uri of an HDFS server. Once you select this option, you need to specify the file system URI, for example localhost:9000 and a username to connect.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (press ... to create a new SSH configuration), any free port on a local host, address of the target remote host, and the port of the target application.

    Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to login to the server. It this variable is not defined then the user.name property is used.

    See more examples of the Hadoop File System configuration files.

    SFTP connection

    Mandatory parameters:

    • SSH Config: Select any of the available SSH configurations or click ... and create a new SSH configuration.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    Configure S3 connection

    Mandatory parameters:

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: an AWS region of the specified bucket. You can select one from the list or let DataGrip to auto detect it.

    • Root path: a path to the root directory in the specified bucket.

    • Authentication type: the authentication method. You can use your AWS account credentials (by default), or opt to entering the access and secret keys.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use custom endpoint: select if you want to specify a custom endpoint and a signing region.

    Configure Minio connection

    Mandatory parameters:

    • Endpoint: specify an endpoint to connect to.

    • Bucket: a globally unique Minio bucket name.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path to the root directory in the specified bucket.

    • Access credentials: Access Key and Secret Key.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure Linode connection

    Mandatory parameters:

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: a region of the specified bucket. You can select one from the list or let DataGrip to auto detect it.

    • Root path: a path to the root directory in the specified bucket.

    • Access credentials: Access Key and Secret Key.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure Digital Open Space connection

    Mandatory parameters:

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: a Digital Ocean region of the specified bucket. You can select one from the list or let DataGrip to auto detect it.

    • Root path: a path to the root directory in the specified bucket.

    • Access credentials: Access Key and Secret Key.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Connection settings for Google Storage

    Mandatory parameters:

    • Bucket: a name of the basic container to store your data in Google Storage.

    • Cloud store JSON location: a path to the Cloud Storage JSON file.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Base directory (root by default): storage base directory.

    Connection settings for Azure Storage

    Mandatory parameters:

    • Endpoint: specify an endpoint to connect to.

    • Container: a name of the basic container to store your data in Microsoft Azure.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path to the root directory in the specified bucket.

    • Authentication type: the authentication method. You can access the storage by username and key, by a connection string, or using a SAS token.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Configure Spark connection

    Mandatory parameters:

    • URL: the path to the target server.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (press ... to create a new SSH configuration), any free port on a local host, address of the target remote host, and the port of the target application.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.
      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into DataGrip log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    Configure Hadoop connection

    Mandatory parameters:

    • URL: the path to the target server.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (press ... to create a new SSH configuration), any free port on a local host, address of the target remote host, and the port of the target application.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.
      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into DataGrip log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    • You can also reuse any of the existing Spark connections. Just select it from the Spark Monitoring list.

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

Samples of Hadoop File System configuration files

TypeSample configuration
HDFS
<?xml version="1.0"?> -<configuration> -<property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>
S3
<?xml version="1.0"?> -<configuration> -<property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> -<property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> -<property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> -<property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> </configuration>
WebHDFS
<?xml version="1.0"?> -<configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>
WebHDFS and Kerberos
<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> -<property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> -<property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> -<property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 21 August 2020