DataGrip 2020.1 Help

Configure Big Data Tools environment

Before you start working with Big Data Tools, you need to install the required plugins and configure connections to servers.

Install the required plugins

  1. Whatever you do in DataGrip, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).

  2. In the Settings/Preferences dialog Ctrl+Alt+S, select Plugins | Marketplace.

  3. Install the Big Data Tools plugin.

  4. Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

    The view of DataGrip after the Big Data Tools plugin is installed

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Spark, Google Storage, and S3 server. You can connect to HDFS, WebHDFS, S3a, and a local drive using config files and URI.

Configure a server connection

  1. In the Big Data Tools window, click Add a connection and select the server type. The Big Data Tools Connection dialog opens.

  2. In the Big Data Tools Connection dialog, specify the following parameters depending on the server type.

    Configure S3 connection

    Mandatory parameters:

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: an AWS region of the specified bucket. You can select one from the list or let DataGrip to auto detect it.

    • Root path: a path to the root directory in the specified bucket.

    • Authentication type: the authentication method. You can use your AWS account credentials (by default), or opt to entering the access and secret keys.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use custom endpoint:

    Configure Spark connection

    Mandatory parameters:

    • URL: the path to the target server.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.
      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into DataGrip log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    Local FS

    Mandatory parameters:

    • Root path: a path to the root directory.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    HDFS connection

    Mandatory parameters:

    • Root path: a path to the root directory on the target server.

    • Config path a path to the HDFS configuration files directory. See the samples of configuration files.

    • File system URI an explicit uri of an HDFS server. Once you select this option, you need to specify the file system URI, for example localhost:9000 and a username to connect.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to login to the server. It this variable is not defined then the user.name property is used.

    Connection settings for Google Storage

    Mandatory parameters:

    • Bucket: a name of the basic container to store your data in Google Storage.

    • Cloud store JSON location: a path to the Cloud Storage JSON file.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Base directory (root by default): storage base directory.

    SFTP connection

    Mandatory parameters:

    • SSH Config: Select any of the available SSH configurations or click ... and create a new SSH configuration.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

Samples of Hadoop File System configuration files

Type

Sample configuration

HDFS

<?xml version="1.0"?> <configuration> -<property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>

S3

<?xml version="1.0"?> <configuration> -<property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> -<property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> -<property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> -<property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> /configuration>

WebHDFS

<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>

WebHDFS and Kerberos

<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> -<property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> -<property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> -<property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 26 May 2020