IntelliJ IDEA 2019.3 Help

Configure Big Data Tools environment

Before you start working with the notebooks using the Big Data Tools support, you need to install the required plugins and configure connections to servers.

Install the required plugins

  1. Whatever you do in IntelliJ IDEA, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).

  2. In the Settings/Preferences dialog Ctrl+Alt+S, select Plugins | Marketplace.

  3. Install the following plugins:

    • Big Data Tools

    • Scala

    Also, check that the Python plugin is enabled. It should be installed by default for IntelliJ IDEA Ultimate.

  4. Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

    The view of IDEA Ultimate after the Big Data Tools plugin is installed

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Zeppelin, Spark, and S3 server. You can connect to HDFS, WebHDFS, S3a, and a local drive using config files and URI.

Configure a server connection

  1. In the Big Data Tools window, click Add a connection and select the server type. The Big Data Tools Connection dialog opens.

  2. In the Big Data Tools Connection dialog, specify the following parameters depending on the server type.

    Connection Settings

    Mandatory parameters:

    • URL: the path to the target server.

    • Login and Password: your credentials to access the target server.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Login as anonymous: select to login without using your credentials.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Scala Version, Spark Version, and Hadoop Version: these values are derived from the plugin bundles. If needed, specify any alternative version values.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password. Select if you want to authenticate using the NGINX and the HTTP basic authentication mechanism provided by the target server.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    Configure S3 connection

    Mandatory parameters:

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: an AWS region of the specified bucket. You can select one from the list or let IntelliJ IDEA to auto detect it.

    • Root path: a path to the root directory in the specified bucket.

    • Authentication type: the authentication method. You can use your AWS account credentials (by default), or opt to entering the access and secret keys.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use custom endpoint:

    Configure Spark connection

    Mandatory parameters:

    • URL: the path to the target server.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    • Advanced settings
      • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

      • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

      • Kerberos authentication settings: opens the Kerberos authentication settings.
        Kerberos settings

        Specify the following options:

        • Enable Kerberos auth: select to use the Kerberos authentication protocol.

        • Krb5 config file: a file that contains Kerberos configuration information.

        • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

        • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

        • To include additional login information into IntelliJ IDEA log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    Local FS

    Mandatory parameters:

    • Root path: a path to the root directory.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    HDFS connection

    Mandatory parameters:

    • Root path: a path to the root directory on the target server.

    • Config path a path to the HDFS configuration files directory. See the samples of configuration files.

    • File system URI an explicit uri of an HDFS server. Once you select this option, you need to specify the file system URI, for example localhost:9000 and a username to connect.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.

    • Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.

    Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to login to the server. It this variable is not defined then the user.name property is used.

    Connection settings for Google Storage

    Mandatory parameters:

    • Bucket: a name of the basic container to store your data in Google Storage.

    • Cloud store JSON location: a path to the Cloud Storage JSON file.

    Optionally, you can set up:

    • Name: the name of the connection to distinguish it between the other connections.

    • Base directory (root by default): storage base directory.

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

Now that you have established a connection to the server, you can start working with your notebooks. However, it might be a good practice to ensure that all the libraries and packages required for execution on a particular server are installed and available.

Configure notebook dependencies

  1. From the main menu, select File | Project Structure.

  2. In the Project Structure dialog, select Modules in the list of the Project Settings. Then select any of the configured connections in the list of the modules and double-click System Dependencies.

  3. Inspect the list of the added libraries. Click the list and start typing to search for a particular library.

    Configure dependencies

  4. If needed, modify the list of the libraries

    • Click the Add button to add a new library.

    • Click the Specify Documentation URL button and specify the URL of the external documentation.

    • Click the Execute button to select the items that you want IntelliJ IDEA to ignore (folders, archives and folders within the archives), and click OK.

    • Click the Remove button to remove the selected ordinary library from the library or restore the selected excluded items. The items themselves will stay in the library.

Samples of Hadoop File System configuration files

Type

Sample configuration

HDFS

<?xml version="1.0"?> <configuration> -<property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>

S3

<?xml version="1.0"?> <configuration> -<property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> -<property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> -<property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> -<property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> /configuration>

WebHDFS

<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>

WebHDFS and Kerberos

<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> -<property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> -<property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> -<property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 20 March 2020