More information about compilation and usage, please visit Spark Doris Connector
You need to copy customer_env.sh.tpl to customer_env.sh before build and you need to configure it before build.
git clone [email protected]:apache/doris-spark-connector.git
cd doris-spark-connector/spark-doris-connector
./build.sh- download and compile Spark Doris Connector from https://github.com/apache/doris-spark-connector, we suggest compile Spark Doris Connector by Doris official image。
 
$ docker pull apache/doris:build-env-ldb-toolchain-latest- 
the result of compile jar is like:spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar
 - 
download spark for https://spark.apache.org/downloads.html .if in china there have a good choice of tencent link https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/
 
#download
wget https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
#decompression
tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz- config Spark environment
 
vim /etc/profile
export SPARK_HOME=/your_parh/spark-3.1.2-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile- copy spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar to spark jars directory。
 
cp /your_path/spark-doris-connector/target/spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar  $SPARK_HOME/jars- 
created doris database and table。
create database mongo_doris; use mongo_doris; CREATE TABLE data_sync_test_simple ( _id VARCHAR(32) DEFAULT '', id VARCHAR(32) DEFAULT '', user_name VARCHAR(32) DEFAULT '', member_list VARCHAR(32) DEFAULT '' ) DUPLICATE KEY(_id) DISTRIBUTED BY HASH(_id) BUCKETS 10 PROPERTIES("replication_num" = "1"); INSERT INTO data_sync_test_simple VALUES ('1','1','alex','123');
- Input this coed in spark-shell.
 
 
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
  tableIdentifier = Some("mongo_doris.data_sync_test"),
  cfg = Some(Map(
    "doris.fenodes" -> "127.0.0.1:8030",
    "doris.request.auth.user" -> "root",
    "doris.request.auth.password" -> ""
  ))
)
dorisSparkRDD.collect()- mongo_doris:doris database name
 - data_sync_test:doris table mame.
 - doris.fenodes:doris FE IP:http_port
 - doris.request.auth.user:doris user name.
 - doris.request.auth.password:doris password
 
- if Spark is Cluster model,upload Jar to HDFS,add doris-spark-connector jar HDFS URL in spark.yarn.jars.
 
spark.yarn.jars=hdfs:///spark-jars/doris-spark-connector-3.1.2-2.12-1.0.0.jarLink:apache/doris#9486
- in pyspark,input this code in pyspark shell command.
 
dorisSparkDF = spark.read.format("doris")
.option("doris.table.identifier", "mongo_doris.data_sync_test")
.option("doris.fenodes", "127.0.0.1:8030")
.option("user", "root")
.option("password", "")
.load()
# show 5 lines data 
dorisSparkDF.show(5)| doris | spark | 
|---|---|
| BOOLEAN | BooleanType | 
| TINYINT | ByteType | 
| SMALLINT | ShortType | 
| INT | IntegerType | 
| BIGINT | LongType | 
| LARGEINT | StringType | 
| FLOAT | FloatType | 
| DOUBLE | DoubleType | 
| DECIMAL(M,D) | DecimalType(M,D) | 
| DATE | DateType | 
| DATETIME | TimestampType | 
| CHAR(L) | StringType | 
| VARCHAR(L) | StringType | 
| STRING | StringType | 
| ARRAY | ARRAY | 
| MAP | MAP | 
| STRUCT | STRUCT | 
If you find any bugs, feel free to file a GitHub issue or fix it by submitting a pull request.
Contact us through the following mailing list.
| Name | Scope | |||
|---|---|---|---|---|
| [email protected] | Development-related discussions | Subscribe | Unsubscribe | Archives | 
- Doris official site - https://doris.apache.org
 - Developer Mailing list - [email protected]. Mail to [email protected], follow the reply to subscribe the mail list.
 - Slack channel - Join the Slack