DataSpell 2022.3 Help

Configure Big Data Tools environment

Before you start working with Big Data Tools, you need to install the required plugins and configure connections to servers.

Install the required plugins

  1. Whatever you do in DataSpell, you do it in a project. So, open an existing project (File | Open) or create a new project (File | New | Project).

  2. Press Ctrl+Alt+S to open the IDE settings and select Plugins | Marketplace.

  3. Install the Big Data Tools plugin.

  4. Restart the IDE. After the restart, the Big Data Tools tab appears in the rightmost group of the tool windows. Click it to open the Big Data Tools window.

Once the Big Data Tools support is enabled in the IDE, you can configure a connection to a Zeppelin, Spark, Google Storage, and S3 server. You can connect to HDFS, WebHDFS, AWS S3, and a local drive using config files and URI.

Configure a server connection

  1. In the Big Data Tools window, click Add a connection and select the server type. The Big Data Tools Connection dialog opens.

  2. In the Big Data Tools Connection dialog, specify the following parameters depending on the server type:

    • File Systems: HDFS, Local, SFTP

    • Storages: AWS S3, Minio, Linode, DigitalOcean Spaces, GS, Azure, Yandex Object Storage, Alibaba OSS

    • Monitoring: Hadoop, Kafka, Spark, Hive Metastore, Flink.

    • Notebooks: Zeppelin

    • Data Processing Platforms: AWS EMR

    Configure AWS EMR connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get clusters from.

    • Authentication type lets you select the authentication method:

      • Default credential providers chain: use the credentials from the default provider chain. For more info on the chain, refer to Using the Default Credential Provider Chain.

      • Explicit access key and secret key: enter your credentials manually.

      • Profile from credentials file: select a profile from your credentials file. Click Open Credentials to locate the directory where the credentials file is stored (for AWS, it's usually ~/.aws/credentials on Linux or macOS, or C:\Users\<USERNAME>\.aws\credentials on Windows). You can also select Use custom configs to use a profile file and credentials file from another directory.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Click the Open SSH Key Settings link to create an SSH connection authenticated with a private key file. You need to specify the Amazon EC2 key pair private key in the EMR SSH Keystore dialog.

    Local FS

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path to the root directory.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    HDFS connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Root path: a path on the target server to be the root for HDFS connection.

      When the connection is successfully established, the Driver home path field shows the target IP address of connection including a port number. Example: hdfs://127.0.0.1:65224/.

    • Config path a path to the HDFS configuration files directory. See the samples of configuration files.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Username: enter a username to log in to the server. If not specified, the HADOOP_USER_NAME environment variable is used. If this variable is not defined, the user.name property is used. If Kerberos is enabled, it overrides any of these three values.

    • Enable tunneling (Only NameNode operation). Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available. SSH tunneling currently works only for operators with the following name nodes: list files, get meta info

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    Note that the Big Data Tools plugin uses the HADOOP_USER_NAME env variable to log in to the server. It this variable is not defined then the user.name property is used.

    See more examples of the Hadoop File System configuration files.

    HDFS connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • SSH config: select an SSH configuration, which contains the needed server address and credentials.

    • Root path: a path on the target server to be the root for the SFTP connection.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use sudo to run SFTP server: select if your target server requires root access. With this option selected, you will be prompted to enter a root user password while connecting to the SFTP server.

    • Use custom command to start SFTP server: select if you want to customize the server startup command. With this option selected, the following parameters become available:

      • Command to start SFTP server: enter a path to the SFTP server or provide SFTP connection options. If the Use sudo to run SFTP server option is selected, you can leave the field empty and let DataSpell detect the path to the SFTP server. Click Test connection to view the detected path.

      • Suggest using sudo for files with restricted permission: with the option selected, DataSpell will ask if you want to use the sudo password each time you try to read or write files with restricted access. If not selected, accessing such files will result in the "Permission denied" error. The option is available if Use sudo to run SFTP server is not selected.

      • Use password from SSH configuration: select if, for accessing files, you want to use the password provided by the selected SSH configuration. If the password is empty or if Authentication type of the SSH configuration is not Password, DataSpell will require you to enter a password while running the server and accessing files.

        The option is available if Use sudo to run SFTP server or Suggest using sudo for files with restricted permission is selected.

      • Command to run sudo: customize the sudo command; for example, you can enter the full path to sudo or provide options such as sudo -k.

    Configure S3 connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Select the storage type: AWS S3 or a custom S3 compatible storage.

    • Area: select the area of your region.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Authentication type lets you select the authentication method:

      • Default credential providers chain: use the credentials from the default provider chain. For more info on the chain, refer to Using the Default Credential Provider Chain.

      • Explicit access key and secret key: enter your credentials manually.

      • Profile from credentials file: select a profile from your credentials file. Click Open Credentials to locate the directory where the credentials file is stored (for AWS, it's usually ~/.aws/credentials on Linux or macOS, or C:\Users\<USERNAME>\.aws\credentials on Windows). You can also select Use custom configs to use a profile file and credentials file from another directory.

      • Anonymous: select if you want to connect to the server without authentication.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Use custom endpoint: select if you want to specify a custom endpoint and a signing region.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    Configure MinIO connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Endpoint: specify an endpoint to connect to.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Access credentials: Access Key and Secret Key.

    Optionally, you can set up:

    • Bucket filter and Filter type help define a specific set of buckets to preview and work with. You can set filters that contain, match, start with the pattern, or use regular expression.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    Configure Linode connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get buckets from.

    • Access credentials: Access Key and Secret Key.

    Optionally, you can set up:

    • Bucket filter and Filter type help define a specific set of buckets to preview and work with. You can set filters that contain, match, start with the pattern, or use regular expression.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    Configure DigitalOcean Spaces connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get buckets from.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Access credentials: Access Key and Secret Key.

    Optionally, you can set up:

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    Alibaba connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get buckets from.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Authentication type: the authentication method. You can use your account credentials (by default) or opt to enter the access and secret keys.

      You can also use a named profile that is located in the default OSS config location (~/.oss/credentials on Linux or macOS, or C:\Users\<USERNAME>\.oss\credentials on Windows). If needed you can specify any profile from a custom credential file.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    Connection settings for Yandex Object Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Authentication type: the authentication method. You can use your account credentials (by default) or opt to enter the access and secret keys. You can also use a named profile that is located in the default Yandex Object Storage config location. If needed you can specify any profile from a custom credential file.

    Optionally, you can set up:

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    • Trust all SSL certificates: select it if you trust the SSL certificate used for this connection and do not want to verify it. This can be useful if, for development purposes, you have a host with a self-signed certificate – verifying it could result in an error.

    Connection settings for Google Cloud Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Google app credentials: a path to the Cloud Storage JSON file (required if the bucket is not publicly shared).

    Optionally, you can set up:

    • Project ID: available if you have selected All buckets in the account. This overrides the project ID specified in the JSON credentials file. Enter a project ID to use buckets from a project other than the one specified in the credentials file.

    • Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    • Custom host: enter your custom endpoint URL, for example, if you want to use it to mock a Google Cloud Storage server.

    Connection settings for Azure Storage

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Endpoint: specify an endpoint to connect to.

    • Choose the way to get Microsoft Azure containers:

      • Select Custom roots and, in the Container field, specify the name of the container or the path to a directory in the container. You can specify multiple names or paths by separating them with a comma.

      • Select All containers in the account. You can then use the container filter to show only containers with particular names.

    • Authentication type: the authentication method. You can access the storage by username and key, by a connection string, or using a SAS token.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    Tencent connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get buckets from.

    • Choose the way to get buckets:

      • Select Custom roots and, in the Roots field, specify the name of the bucket or the path to a directory in the bucket. You can specify multiple names or paths by separating them with a comma.

      • Select All buckets in the account. You can then use the bucket filter to show only buckets with particular names.

    • Access key: access key of your Tencent Cloud account.

    • Secret key: secret key of your Tencent Cloud account.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • APPID: specify your Tencent cloud APPID if you want to create buckets using the IDE.

    • Show bucket versioning: show history of versions for files in buckets. Note that DataSpell will show versions for those buckets that have versioning enabled in your Tencent cloud.

    • Operation timeout (s): enter a timeout (in seconds) for operations performed on the remote storage, such as getting file info, listing or deleting objects. The default value is 15 seconds.

    Configure Hadoop connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.

      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries; each specifies which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into DataSpell log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    • You can also reuse any of the existing Spark connections. Just select it from the Spark Monitoring list.

    Configure Kafka connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Properties source: select Field to manually enter Kafka configuration properties or File to specify the path to a properties file. With the Field option selected, you can start typing a property name, and DataSpell will suggest matching property names and show the quick documentation for a selected property.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Click the question mark next to the Kafka support is limited message to preview the list of the currently supported features.

    Configure Spark connection

    Mandatory parameters:

    • URL: the path to the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Enable HTTP proxy: connection with the HTTP proxy using the specified host, port, username, and password.

    • HTTP Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Kerberos authentication settings: opens the Kerberos authentication settings.

      Kerberos settings

      Specify the following options:

      • Enable Kerberos auth: select to use the Kerberos authentication protocol.

      • Krb5 config file: a file that contains Kerberos configuration information.

      • JAAS login config file: a file that consists of one or more entries, each specifying which underlying authentication technology should be used for a particular application or applications.

      • Use subject credentials only: allows the mechanism to obtain credentials from some vendor-specific location. Select this checkbox and provide the username and password.

      • To include additional login information into DataSpell log, select the Kerberos debug logging and JGSS debug logging.

        Note that the Kerberos settings are effective for all you Spark connections.

    Configure Hive connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Properties: select how to specify your Hive configuration properties: enter them explicitly or load them from a configuration folder. If you select Explicit, you can enter a value for the metastore.thrift.uris property in the URL field and enter any other properties in the Other properties field.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Database pattern: if you want to view only some of your Hive databases in the editor tab, use this field to enter a regular expression for the database names.

    • Table pattern: if you want to view only some of your database tables in the editor tab, use this field to enter a regular expression for the table names.

    Configure AWS Glue connection

    Mandatory parameters:

    • Name: the name of the connection to distinguish it between the other connections.

    • Region: select a region to get buckets from.

    • Authentication type lets you select the authentication method:

      • Default credential providers chain: use the credentials from the default provider chain. For more info on the chain, refer to Using the Default Credential Provider Chain.

      • Explicit access key and secret key: enter your credentials manually.

      • Profile from credentials file: select a profile from your credentials file. Click Open Credentials to locate the directory where the credentials file is stored (for AWS, it's usually ~/.aws/credentials on Linux or macOS, or C:\Users\<USERNAME>\.aws\credentials on Windows). You can also select Use custom configs to use a profile file and credentials file from another directory.

    Optionally, you can set up:

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • HTTP Proxy: select if you want to use IDE proxy settings or if you want to specify custom proxy settings.

    Connection Settings

    Mandatory parameters:

    • URL: the path to the target server.

    • Login and Password: your credentials to access the target server.

    • Name: the name of the connection to distinguish it between the other connections.

    Optionally, you can set up:

    • Login as anonymous: select to log in without using your credentials.

    • Enable connection: deselect if you want to restrict using this connection. By default, the newly created connections are enabled.

    • Library Versions: Scala Version, Spark Version, and Hadoop Version: these values are derived from the plugin bundles. If needed, specify any alternative version values.

    • Enable HTTP basic authentication: connection with the HTTP authentication using the specified username and password.

    • Proxy: connection with the HTTP or SOCKS Proxy authentication. Select if you want to use IDEA HTTP Proxy settings or use custom settings with the specified host name, port, login, and password.

    • Enable tunneling. Creates an SSH tunnel to the remote host. It can be useful if the target server is in a private network but an SSH connection to the host in the network is available.

      Select the checkbox and specify a configuration of an SSH connection (click ... to create a new SSH configuration).

    • Notifications. Select Enable cell execution notification if you want to be notified when execution time exceeds the specified time interval (60 seconds by default).

  3. Once you fill in the settings, click Test connection to ensure that all configuration parameters are correct. Then click OK.

You can disable any connection if you temporarily do not need it. Right-click the corresponding item in the BigDataTools window and select Disable Connection from the context menu. The server changes its visual appearance and behavior: you cannot preview its content. To restore the connection, right-click it and select Enable Connection from the context menu or just double-click the connection.

For your convenience, you can rename the server root and copy a path to it. To quickly access all the required actions, right-click the target server in the Big Data Tools window and select the corresponding command from the context menu.

Samples of Hadoop File System configuration files

Type

Sample configuration

HDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000/</value> </property> </configuration>

S3

<?xml version="1.0"?> <configuration> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.access.key</name> <value>sample_access_key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>sample_secret_key</value> </property> <property> <name>fs.defaultFS</name> <value>s3a://example.com/</value> </property> </configuration>

WebHDFS

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070/</value> </property> </configuration>

WebHDFS and Kerberos

<?xml version="1.0"?> <configuration> <property> <name>fs.webhdfs.impl</name> <value>org.apache.hadoop.hdfs.web.WebHdfsFileSystem</value> </property> <property> <name>fs.defaultFS</name> <value>webhdfs://master.example.com:50070</value> </property> <property> ​ <name>hadoop.security.authentication</name> <value>Kerberos</value> </property> <property> <name>dfs.web.authentication.kerberos.principal</name> <value>testuser@EXAMPLE.COM</value> </property> <property>​ <name>hadoop.security.authorization</name>​ <value>true</value>​ </property> </configuration>
Last modified: 01 December 2022