Configure Big Data Tools environment

Before you start working with the notebooks using the Big Data Tools support, you need to install the required plugins and configure connections to servers.

Install the required plugins

Whatever you do in IntelliJ IDEA, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).
In the Settings/Preferences dialog Ctrl+Alt+S, select Plugins | Marketplace.
Install the following plugins:
- Big Data Tools
- Scala
Also, check that the Python plugin is enabled. It should be installed by default for IntelliJ IDEA Ultimate.
Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Zeppelin, Spark, and S3 server. You can connect to HDFS, WebHDFS, S3a, and a local drive using config files and URI.

Configure a server connection

In the Big Data Tools window, click and select the server type. The Big Data Tools Connection dialog opens.
In the Big Data Tools Connection dialog, specify the following parameters depending on the server type.
Mandatory parameters:
- URL: the path to the target server.
- Login and Password: your credentials to access the target server.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Login as anonymous: select to login without using your credentials.
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.
- Library Versions: Scala Version, Spark Version, and Hadoop Version: these values are derived from the plugin bundles. If needed, specify any alternative version values.
- Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.
- HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.
Mandatory parameters:
- Bucket: a globally unique Amazon S3 bucket name.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Region: an AWS region of the specified bucket. You can select one from the list or let IntelliJ IDEA to auto detect it.
- Root path: a path to the root directory in the specified bucket.
- Authentication type: the authentication method. You can use your AWS account credentials (by default), or opt to entering the access and secret keys.
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.
- Use custom endpoint:
Mandatory parameters:
- URL: the path to the target server.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.
- Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.
- Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.
- HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.
- Kerberos authentication settings: opens the Kerberos authentication settings.
  
  Specify the following options:
  - Enable Kerberos auth: select to use the Kerberos authentication protocol.
  - Krb5 config file: a file that contains Kerberos configuration information.
  - JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.
  - Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.
  - To include additional login information into IntelliJ IDEA log, select the Kerberos debug logging and JGSS debug logging.
    
    Note that the Kerberos settings are effective for all you Spark connections.
Mandatory parameters:
- Root path: a path to the root directory.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.
Mandatory parameters:
- Root path: a path to the root directory on the target server.
- Config path a path to the HDFS configuration files directory. See the samples of configuration files.
- File system URI an explicit uri of an HDFS server. Once you select this option, you need to specify the file system URI, for example localhost:9000 and a username to connect.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Per project: select to enable these connection settings only for the current project. Deselect it if you want this connection to be visible in other projects.
- Enable connection: deselect if, for some reasons, you want to restrict using this connection. By default, the newly created connections are enabled.
Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to login to the server. It this variable is not defined then the user.name property is used.
Mandatory parameters:
- Bucket: a name of the basic container to store your data in Google Storage.
- Cloud store JSON location: a path to the Cloud Storage JSON file.
Optionally, you can set up:
- Name: the name of the connection to distinguish it between the other connections.
- Base directory (root by default): storage base directory.
Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

Now that you have established a connection to the server, you can start working with your notebooks. However, it might be a good practice to ensure that all the libraries and packages required for execution on a particular server are installed and available.

Configure notebook dependencies

From the main menu, select File | Project Structure.
In the Project Structure dialog, select Modules in the list of the Project Settings. Then select any of the configured connections in the list of the modules and double-click System Dependencies.
Inspect the list of the added libraries. Click the list and start typing to search for a particular library.
If needed, modify the list of the libraries
- Click to add a new library.
- Click and specify the URL of the external documentation.
- Click to select the items that you want IntelliJ IDEA to ignore (folders, archives and folders within the archives), and click OK.
- Click to remove the selected ordinary library from the library or restore the selected excluded items. The items themselves will stay in the library.

Samples of Hadoop File System configuration files

Type	Sample configuration
HDFS	<?xml version="1.0"?> <configuration> -<property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>
S3	<?xml version="1.0"?> <configuration> -<property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> -<property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> -<property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> -<property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> /configuration>
WebHDFS	<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>
WebHDFS and Kerberos	<?xml version="1.0"?> <configuration> -<property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> -<property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> -<property> <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> -<property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> -<property> <name>hadoop.security.authorization</name> <value>true</value> </property> </configuration>

Last modified: 26 April 2020