/r/DistributedComputing
For those who want to cure cancer and make first contact in their spare time
To discuss distributed computing projects such as Folding@Home or the many BOINC programs.
The centre for reddit teams:
/r/DistributedComputing
In a distributed transaction to have consensus, 2PC is used but I don't get what actually happens in a prepare phase vs a commit phase.
Can someone explain (in-depth would be even more helpful). I read that the databases/nodes start writing locally during the prepare phase while saving the status as "PREPARE". And once they get a commit cmd, they persist the changes.
I have incomplete info
Hi folks,
I am currently an SDE and want to learn distributed systems. I see completing the dslabs is a great way to do so. But I need the solutions to figure out since I am stuck. Has anyone here finished it?
I'm interested in learning P2P networks, but I noticed that there are not a fair amount of books out there. I would like to get recommendations about this topic.
Thanks!
I’ve been reflecting on my Master’s thesis topic, but I’m unsure what to choose. Many of my peers have selected various areas in machine learning, while I initially considered focusing on cryptography. However, I’m starting to think post-quantum cryptography might be too complex. Now, I’m leaning towards exploring the intersection of machine learning/AI, cryptography, and distributed systems, but I’m open to any suggestions.
I've been thinking about getting a masters in distributed systems/computing. As that's a role I'll like to settle in for the long term. But taking that two years career break to go for masters is not really making sense to me! What do you all think? How do think i can get into this type of role? Any advice is welcome
A little context: I recently transitioned from native Android dev to DevOps/cloud
Looking for a resource to understand various consensus concepts and algorithms (paxos/raft etc).
Finding it difficult to understand these concepts, looking for your favorite articles/resources!
I'm seeking ideas for BOINC projects that have a broad positive impact, such as a distributed chatbot (even though I understand that a fully distributed AI may not be practical with current CPUs/GPUs). Specifically looking for ideas that directly benefit anyone, not just researchers. Thank you!
Hey gang,
I’m looking into ways for my machine to generate revenue whilst idle. Just checking if people would be interested in borrowing my machine for their network/startup they are making.
Apologies for my terrible terminology I’m still learning the lingo
Project: Jetmaker
It is a framework for Python developers to connect multiple distributed nodes into one single system, so distributed apps can access one another's data and services. And it also provides tools to synchronize all the nodes just like how you do in multithreading and multiprocessing
Github link: https://github.com/gavinwei121/Jetmaker
Documentation: Documentation
Hi!
I have written a framework for distributed computing, which is free for non-commercial use. I would like to classify the framework, how it is correctly described and which frameworks it competes with. I would also be interested to know what you think of it. And what is still missing, what should be addressed next.
The framework is called nyssr.net and is written in JAVA. nyssr.net is a network of interconnected JAVA nodes using TCP channels to facilitate message exchange. Messages are routed through these channels, avoiding the need to establish new connections dynamically.
Each node is built around a nimble and quick-loading micro-kernel. This micro-kernel loads additional functionalities in the form of plugins during startup. Remarkably, even essential features like TCP or the transport layer are loaded as plugins, alongside various services and applications.
A range of services now exists based on these characteristics:
and much more
You can find the framework at sillysky.net.
Many greetings,
Michael Hoppe
michael.hoppe@sillysky.net
I have a basic machine translation transformer model that worked well on a single GPU. However, when I tried running it on an 8-GPU setup using DDP, I initially encountered many crashes due to data not being properly transferred to the correct GPUs. I believe I've resolved those issues, and the model now runs, but only up to a certain point.
I put a lot of prints along the way, it run and just freezes at some point.
If I run it using debugger it keeps going without any problem.
Is there anyone here fluent in DDP and PyTorch who can help me? I'm feeling pretty desperate.
Here is my training function:
def train(rank, world_size):
ddp_setup(rank, world_size)
torch.manual_seed(0)
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 1024
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
LOAD_MODEL = False
if LOAD_MODEL:
transformer = torch.load("model/_transformer_model")
else:
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
for p in transformer.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
transformer.move_positional_encoding_to_rank(rank) # moving positional_encoding into the current GPU
# Create the dataset
train_dataset = SrcTgtDatasetFromFiles(SRC_TRAIN_BASE, TGT_TRAIN_BASE, FILES_COUNT_TRAIN)
# create a DistributedSampler for data loading
train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
# create a DataLoader with the DistributedSampler
# train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, pin_memory=True,
collate_fn=collate_fn, sampler=train_sampler)
# create the model and move it to the GPU with the device ID
model = transformer.to(rank)
model.train() # set the model into training mode with dropout etc.
# wrap the model with DistributedDataParallel
model = DDP(model, device_ids=[rank])
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
for state in optimizer.state.values():
for k, v in state.items():
if torch.is_tensor(v):
state[k] = v.to(rank)
#######################################
EPOCHS_NUM = 2
for epoch in range(EPOCHS_NUM):
epoch_start_time = int(timer())
print("\n\nepoch number: " + str(epoch + 1) + " Rank: " + str(rank))
losses = 0.0
idx = 0
start_time = int(timer())
for src, tgt in train_dataloader:
if rank == 0:
print("rank=" + str(rank) + " idx=" + str(idx))
src = src.to(rank)
tgt = tgt.to(rank)
tgt_input = tgt[:-1, :]
if IS_DEBUG:
print("rank", rank, "idx", idx, "before create_mask")
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input, rank)
if IS_DEBUG:
print("rank", rank, "idx", idx, "after create_mask")
if IS_DEBUG:
print("rank",rank,"idx",idx,"before model")
logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
if IS_DEBUG:
print("rank", rank, "idx", idx, "after model")
try:
if IS_DEBUG:
print("rank",rank,"idx",idx,"before zero_grad")
optimizer.zero_grad()
if IS_DEBUG:
print("rank",rank,"idx",idx,"after zero_grad")
tgt_out = tgt[1:, :].long()
if IS_DEBUG:
print("rank",rank,"idx",idx,"before loss_fn")
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
if IS_DEBUG:
print("rank",rank,"idx",idx,"after loss_fn")
if IS_DEBUG:
print("rank",rank,"idx",idx,"before backward")
loss.backward()
if IS_DEBUG:
print("rank",rank,"idx",idx,"after backward")
# Delete unnecessary variables before backward pass
del src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, logits, tgt_out
torch.cuda.empty_cache() # Clear cache after deleting variables
if IS_DEBUG:
print("rank",rank,"idx",idx,"before step")
optimizer.step()
if IS_DEBUG:
print("rank",rank,"idx",idx,"after step")
losses += loss.item()
#######################################
# print(999,rank,loss)
# Free GPU memory
del loss
torch.cuda.empty_cache() # Clear cache after each batch
except Exception as e:
print("An error occurred: rank=" + str(rank) + " idx=" + str(idx))
print("Error message: ", str(e))
idx += 1
if rank == 0 and idx % 10000 == 0:
torch.save(model.module.state_dict(), "model/_transformer_model")
end_time = int(timer())
try:
my_test(model.module, rank, SRC_TEST_BASE, TGT_TEST_BASE, FILES_COUNT_TEST, epoch, 0,
0, int((end_time - start_time) / 60), epoch_start_time)
except:
print("error occurred test")
start_time = int(timer())
# Synchronize training across all GPUs
torch.distributed.barrier()
if rank == 0:
epoch_end_time = int(timer())
try:
my_test_and_save_to_file(model.module, rank, SRC_TEST_BASE, FILES_COUNT_TEST, epoch)
loss = evaluate(model.module, rank, SRC_VAL_BASE, TGT_VAL_BASE, FILES_COUNT_VAL, BATCH_SIZE,
loss_fn)
print("EPOCH NO." + str(epoch) + " Time: " + str(int((epoch_end_time - epoch_start_time) / 60)) +
" LOSS:" + str(loss))
except:
print("error occurred evaluation")
destroy_process_group()
here is part of the output:
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
Let's use 8 GPUs!
epoch number: 1 Rank: 0
epoch number: 1 Rank: 1
epoch number: 1 Rank: 2
epoch number: 1 Rank: 3
epoch number: 1 Rank: 4
epoch number: 1 Rank: 7
epoch number: 1 Rank: 6
epoch number: 1 Rank: 5
rank=0 idx=0
rank 0 idx 0 before src
rank 0 idx 0 after src
rank 0 idx 0 before tgt
rank 0 idx 0 after tgt
rank 0 idx 0 before create_mask
rank 0 idx 0 after create_mask
rank 0 idx 0 before model
rank 1 idx 0 before src
rank 1 idx 0 after src
rank 1 idx 0 before tgt
rank 1 idx 0 after tgt
rank 1 idx 0 before create_mask
rank 1 idx 0 after create_mask
rank 1 idx 0 before model
rank 4 idx 0 before src
rank 4 idx 0 after src
...
rank 0 idx 1 after tgt
rank 0 idx 1 before create_mask
rank 0 idx 1 after create_mask
rank 0 idx 1 before model
rank 0 idx 1 after model
rank 0 idx 1 before zero_grad
rank 0 idx 1 after zero_grad
rank 0 idx 1 before loss_fn
rank 0 idx 1 after loss_fn
rank 0 idx 1 before backward
I gave a presentation on using Redis as middleware for distributed processing at EuroTcl/OpenACS 2024. I think this is a simple but powerful technique, combining communication between multiple client and server instances with caching. Clients and servers connect over the net to a Redis instance, using Redis lists as queues of requests and responses. For the talk I did a demonstration using servers and Redis in different countries, but normally these would all be on a local network, with much less latency.
The implementation is in Tcl, but the same approach could be implemented in any language with a Redis interface. The video is at https://learn.wu.ac.at/eurotcl2024/lecturecasts/729149172?m=delivery and the slides are at https://openacs.org/conf2024/info/download/file/DisTcl.pdf . The code for the demonstration can be found at https://cmacleod.me.uk/tcl/mand/ . Some background info can be found at https://wiki.tcl-lang.org/page/DisTcl and https://colin-macleod.blogspot.com/2020/11/fundis-functional-distributed-processing.html .
I'm currently developing a project of a distributed key value database using RAFT.
My questions are:
PS: I'm using Apache Ratis for the Raft implementation and Java , Maven fot the development part.
Is there a syraightforward way of combining several Ubuntu workstations into what appears to the user to be a single workstation? My use case is running physical simulations in python and/or c++ using multiple CPU cores, and it would be great to extend these across multiple machines (but without spending too much time setting it up or adapting code). Is there any easy solution?
So, our third year of engineering has just started, and we have a subject named Cloud and Distributed Systems.
Our professor asked a few questions about who knows about the cloud, what you all know, who has used it, etc. I was the only one who had used AWS EC2 and S3 instances, so the professor was impressed. After the lecture, she asked me if I would like to do a project on the cloud where we can use university machines to create a cloud environment for our class. The environment would have features such as storage, instances like EC2, etc.
She told me to learn about OpenStack software, how it functions, and the system requirements so we could ask the university.
Long story short, can you all suggest what I can do, how I can start, and if this project is scalable?
Also, a small P.S.: I am interested in making my career in cyber/cloud security. How can I integrate a security aspect into this project?
I'm looking to create a distributed computing setup using two Windows 10 computers, where applications can run on either computer based on resource availability. The goal is for users to interact with the system as if it's a single workstation, with the applications being processed on the computer with the most available resources. The GUI should be displayed on the main computer regardless of where the application is processed.
Requirements:
Seamless User Experience: Users should interact with the system through a single Windows 10 interface. When an application is launched, it should run on the computer with the most available resources without requiring manual intervention.
Shared Storage: Applications and data should be stored on a shared storage accessible by both computers.
Resource Monitoring: The system should monitor resources (CPU, memory, disk usage) on both computers to decide where to run applications.
Remote Execution: Applications running on the secondary computer should have their GUI displayed on the main computer, ensuring a seamless user experience.
Questions: Feasibility: Is this setup feasible for achieving a seamless user experience? Are there potential issues such as network latency, security concerns, or compatibility problems that I should be aware of?
Best Tools and Methods: What are the best tools or methods for achieving this setup on Windows 10? Are there specific software solutions or configurations that can facilitate remote execution with GUI redirection to the main computer?
Resource Monitoring and Load Balancing: How can I effectively monitor resources and balance the load between the two computers? Are there existing tools that can handle this automatically?
Security Best Practices: What are the best practices for securing remote execution and shared storage in this setup to ensure system security and data integrity?
I appreciate any insights or suggestions on how to best achieve this setup and address any potential challenges.
i was studying distributed systems and i came across this question online, can you guys help me with it?
note: the question is part of a Homework Assignment given to students in 2009 at university of California San Diego and here is the link to it : https://cseweb.ucsd.edu/classes/fa09/cse91/resources/cse91hw2.pdf
Suppose you want your ATM to give you $100. and ATM has two separate processing steps:
it must record a debit for $100 and it must give you the cash.
By the two generals problem, it cannot do both at the same time. It can do these steps in either order and a crash can occur any time.
– Suppose it gives you the cash first. What can go wrong?
– Suppose it does the debit first. What can go wrong? How the problem might eventually be fixed.
– Based on your analysis, which option would banks choose?
In my PhD thesis, I proposed a novel fault-tolerant, self-configurable, scalable, secure, decentralized, and high-performance distributed database replication architecture, named “Parallel Committees”.
I utilized an innovative sharding technique to enable the use of Byzantine Fault Tolerance (BFT) consensus mechanisms in very large-scale networks.
With this innovative full sharding approach supporting both processing sharding and storage sharding, as more processors and replicas join the network, the system computing power and storage capacity increase unlimitedly, while a classic BFT consensus is utilized.
My approach also allows an unlimited number of clients to join the system simultaneously without reducing system performance and transactional throughput.
I introduced several innovative techniques: for distributing nodes between shards, processing transactions across shards, improving security and scalability of the system, proactively circulating committee members, and forming new committees automatically.
I introduced an innovative and novel approach to distributing nodes between shards, using a public key generation process, called “KeyChallenge”, that simultaneously mitigates Sybil attacks and serves as a proof-of-work. The “KeyChallenge” idea is published in the peer-reviewed conference proceedings of ACM ICCTA 2024, Vienna, Austria.
In this regard, I proved that it is not straightforward for an attacker to generate a public key so that all characters of the key match the ranges set by the system.I explained how to automatically form new committees based on the rate of candidate processor nodes.
The purpose of this technique is to optimally use all network capacity so that inactive surplus processors in the queue of a committee that were not active are employed in the new committee and play an effective role in increasing the throughput and the efficiency of the system.
This technique leads to the maximum utilization of processor nodes and the capacity of computation and storage of the network to increase both processing sharding and storage sharding as much as possible.
In the proposed architecture, members of each committee are proactively and alternately replaced with backup processors. This technique of proactively circulating committee members has three main results:
This procedure can improve and enhance the fault tolerance threshold of the consensus mechanism.I also elucidated strategies to thwart the malicious action of “Key-Withholding”, where previously generated public keys are prevented from future shard access. The approach involves periodically altering the acceptable ranges for each character of the public key. The proposed architecture effectively reduces the number of undesirable cross-shard transactions that are more complex and costly to process than intra-shard transactions.
I compared the proposed idea with other sharding-based data replication systems and mentioned the main differences, which are detailed in Section 4.7 of my dissertation.
The proposed architecture not only opens the door to a new world for further research in this field but also represents a significant step forward in enhancing distributed databases and data replication systems.
The proposed idea has been published in the peer-reviewed conference proceedings of IEEE BCCA 2023.
Additionally, I provided an explanation for the decision not to employ a blockchain structure in the proposed architecture, an issue that is discussed in great detail in Chapter 5 of my dissertation.
The complete version of my dissertation is accessible via the following link: https://www.researchgate.net/publication/379148513_Novel_Fault-Tolerant_Self-Configurable_Scalable_Secure_Decentralized_and_High-Performance_Distributed_Database_Replication_Architecture_Using_Innovative_Sharding_to_Enable_the_Use_of_BFT_Consensus_Mec
I compared my proposed database architecture with various distributed databases and data replication systems in Section 4.7 of my dissertation. This comparison included Apache Cassandra, Amazon DynamoDB, Google Bigtable, Google Spanner, and ScyllaDB. I strongly recommend reviewing that section for better clarity and understanding.
The main problem is as follows:
Classic consensus mechanisms such as Paxos or PBFT provide strong and strict consistency in distributed databases. However, due to their low scalability, they are not commonly used. Instead, methods such as eventual consistency are employed, which, while not providing strong consistency, offer much higher performance compared to classic consensus mechanisms. The primary reason for the low scalability of classic consensus mechanisms is their high time complexity and message complexity.
I recommend watching the following video explaining this matter:
https://www.college-de-france.fr/fr/agenda/colloque/taking-stock-of-distributed-computing/living-without-consensus
My proposed architecture enables the use of classic consensus mechanisms such as Paxos, PBFT, etc., in very large and high-scale networks, while providing very high transactional throughput. This ensures both strict consistency and high performance in a highly scalable network. This is achievable through an innovative approach of parallelization and sharding in my proposed architecture.
If needed, I can provide more detailed explanations of the problem and the proposed solution.
I would greatly appreciate feedback and comments on the distributed database architecture proposed in my PhD dissertation. Your insights and opinions are invaluable, so please feel free to share them without hesitation.
I get this error when I try to deploy tinykv cluster as shown in the repo of talent-plan/tinykv: A course to build distributed key-value service based on TiKV model (github.com)
mkdir -p data
./tinyscheduler-server
./tinykv-server -path=data
./tinysql-server --store=tikv --path="127.0.0.1:2379"
mysql -u root -h 127.0.0.1 -P 4000
you can find the implementations here:
sakura-ysy/TinyKV-2022-doc: TinyKV-2022,个人代码及文档,项目最终得分98.46。 (github.com)
RinChanNOWWW/tinysql-impl: Implementation of https://github.com/tidb-incubator/tinysql
[2024/04/05 20:17:19.026 +00:00] [WARN] [session.go:539] ["run statement failed"] [schemaVersion=0] [error="[schema:1049]Unknown database 'mysql'"] [errorVerbose="[schema:1049]Unknown database 'mysql'\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174\ngithub.com/pingcap/tidb/parser/terror.(*Error.GenWithStackByArgs\n\t/go/tinysql/parser/terror/terror.go:243\ngithub.com/pingcap/tidb/executor.(*SimpleExec.executeUse\n\t/go/tinysql/executor/simple.go:66\ngithub.com/pingcap/tidb/executor.(*SimpleExec.Next\n\t/go/tinysql/executor/simple.go:49\ngithub.com/pingcap/tidb/executor.Next\n\t/go/tinysql/executor/executor.go:161\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelayExecutor\n\t/go/tinysql/executor/adapter.go:227\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelay\n\t/go/tinysql/executor/adapter.go:214\ngithub.com/pingcap/tidb/executor.(*ExecStmt.Exec\n\t/go/tinysql/executor/adapter.go:190\ngithub.com/pingcap/tidb/session.runStmt\n\t/go/tinysql/session/tidb.go:219\ngithub.com/pingcap/tidb/session.(*session.executeStatement\n\t/go/tinysql/session/session.go:536\ngithub.com/pingcap/tidb/session.(*session.execute\n\t/go/tinysql/session/session.go:615\ngithub.com/pingcap/tidb/session.(*session.Execute\n\t/go/tinysql/session/session.go:563\ngithub.com/pingcap/tidb/session.checkBootstrapped\n\t/go/tinysql/session/bootstrap.go:162\ngithub.com/pingcap/tidb/session.bootstrap\n\t/go/tinysql/session/bootstrap.go:130\ngithub.com/pingcap/tidb/session.runInBootstrapSession\n\t/go/tinysql/session/session.go:792\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/go/tinysql/session/session.go:753\nmain.createStoreAndDomain\n\t/go/tinysql/tidb-server/main.go:133\nmain.main\n\t/go/tinysql/tidb-server/main.go:105\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"] [session="{\n \"currDBName\": \"\",\n \"id\": 0,\n \"status\": 2,\n \"strictMode\": true,\n \"user\": \"\"\n}"]
[2024/04/05 20:17:19.026 +00:00] [WARN] [session.go:606] ["compile SQL failed"] [error="[schema:1146]Table 'mysql.tidb' doesn't exist"] [errorVerbose="[schema:1146]Table 'mysql.tidb' doesn't exist\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174\ngithub.com/pingcap/tidb/parser/terror.(*Error.GenWithStackByArgs\n\t/go/tinysql/parser/terror/terror.go:243\ngithub.com/pingcap/tidb/infoschema.(*infoSchema.TableByName\n\t/go/tinysql/infoschema/infoschema.go:169\ngithub.com/pingcap/tidb/planner/core.(*preprocessor.handleTableName\n\t/go/tinysql/planner/core/preprocess.go:517\ngithub.com/pingcap/tidb/planner/core.(*preprocessor.Leave\n\t/go/tinysql/planner/core/preprocess.go:118\ngithub.com/pingcap/tidb/parser/ast.(*TableName.Accept\n\t/go/tinysql/parser/ast/dml.go:147\ngithub.com/pingcap/tidb/parser/ast.(*TableSource.Accept\n\t/go/tinysql/parser/ast/dml.go:191\ngithub.com/pingcap/tidb/parser/ast.(*Join.Accept\n\t/go/tinysql/parser/ast/dml.go:76\ngithub.com/pingcap/tidb/parser/ast.(*TableRefsClause.Accept\n\t/go/tinysql/parser/ast/dml.go:292\ngithub.com/pingcap/tidb/parser/ast.(*SelectStmt.Accept\n\t/go/tinysql/parser/ast/dml.go:449\ngithub.com/pingcap/tidb/planner/core.Preprocess\n\t/go/tinysql/planner/core/preprocess.go:42\ngithub.com/pingcap/tidb/executor.(*Compiler.Compile\n\t/go/tinysql/executor/compiler.go:34\ngithub.com/pingcap/tidb/session.(*session.execute\n\t/go/tinysql/session/session.go:603\ngithub.com/pingcap/tidb/session.(*session.Execute\n\t/go/tinysql/session/session.go:563\ngithub.com/pingcap/tidb/session.getTiDBVar\n\t/go/tinysql/session/bootstrap.go:191\ngithub.com/pingcap/tidb/session.checkBootstrapped\n\t/go/tinysql/session/bootstrap.go:168\ngithub.com/pingcap/tidb/session.bootstrap\n\t/go/tinysql/session/bootstrap.go:130\ngithub.com/pingcap/tidb/session.runInBootstrapSession\n\t/go/tinysql/session/session.go:792\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/go/tinysql/session/session.go:753\nmain.createStoreAndDomain\n\t/go/tinysql/tidb-server/main.go:133\nmain.main\n\t/go/tinysql/tidb-server/main.go:105\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"] [SQL="SELECT HIGH_PRIORITY VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME=\"bootstrapped\""]
[2024/04/05 20:17:19.037 +00:00] [INFO] [region_cache.go:976] ["mark store's regions need be refill"] [store=127.0.0.1:20160]
[2024/04/05 20:17:19.037 +00:00] [INFO] [region_cache.go:402] ["switch region peer to next due to send request fail"] [current="region ID: 2, meta: id:2 region_epoch:<conf_ver:1 version:1 > peers:<id:3 store_id:1 > , peer: id:3 store_id:1 , addr: 127.0.0.1:20160, idx: 0"] [needReload=true] [error="rpc error: code = Unknown desc = responses count 1 is not equal to requests count 2"] [errorVerbose="rpc error: code = Unknown desc = responses count 1 is not equal to requests count 2\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/tikv/tikvrpc.CallRPC\n\t/go/tinysql/store/tikv/tikvrpc/tikvrpc.go:319\ngithub.com/pingcap/tidb/store/tikv.(*rpcClient.SendRequest\n\t/go/tinysql/store/tikv/client.go:225\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.sendReqToRegion\n\t/go/tinysql/store/tikv/region_request.go:142\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.SendReqCtx\n\t/go/tinysql/store/tikv/region_request.go:112\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.SendReq\n\t/go/tinysql/store/tikv/region_request.go:70\ngithub.com/pingcap/tidb/store/tikv.(*tikvStore.SendReq\n\t/go/tinysql/store/tikv/kv.go:312\ngithub.com/pingcap/tidb/store/tikv.actionPrewrite.handleSingleBatch\n\t/go/tinysql/store/tikv/2pc.go:367\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnBatches\n\t/go/tinysql/store/tikv/2pc.go:313\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnKeys\n\t/go/tinysql/store/tikv/2pc.go:301\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.prewriteKeys\n\t/go/tinysql/store/tikv/2pc.go:533\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.execute\n\t/go/tinysql/store/tikv/2pc.go:572\ngithub.com/pingcap/tidb/store/tikv.(*tikvTxn.Commit\n\t/go/tinysql/store/tikv/txn.go:188\ngithub.com/pingcap/tidb/kv.RunInNewTxn\n\t/go/tinysql/kv/txn.go:61\ngithub.com/pingcap/tidb/ddl.(*ddl.genGlobalIDs\n\t/go/tinysql/ddl/ddl.go:370\ngithub.com/pingcap/tidb/ddl.(*ddl.CreateSchema\n\t/go/tinysql/ddl/ddl_api.go:53\ngithub.com/pingcap/tidb/executor.(*DDLExec.executeCreateDatabase\n\t/go/tinysql/executor/ddl.go:124\ngithub.com/pingcap/tidb/executor.(*DDLExec.Next\n\t/go/tinysql/executor/ddl.go:79\ngithub.com/pingcap/tidb/executor.Next\n\t/go/tinysql/executor/executor.go:161\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelayExecutor\n\t/go/tinysql/executor/adapter.go:227\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelay\n\t/go/tinysql/executor/adapter.go:214\ngithub.com/pingcap/tidb/executor.(*ExecStmt.Exec\n\t/go/tinysql/executor/adapter.go:190\ngithub.com/pingcap/tidb/session.runStmt\n\t/go/tinysql/session/tidb.go:219\ngithub.com/pingcap/tidb/session.(*session.executeStatement\n\t/go/tinysql/session/session.go:536\ngithub.com/pingcap/tidb/session.(*session.execute\n\t/go/tinysql/session/session.go:615\ngithub.com/pingcap/tidb/session.(*session.Execute\n\t/go/tinysql/session/session.go:563\ngithub.com/pingcap/tidb/session.mustExecute\n\t/go/tinysql/session/bootstrap.go:280\ngithub.com/pingcap/tidb/session.doDDLWorks\n\t/go/tinysql/session/bootstrap.go:215\ngithub.com/pingcap/tidb/session.bootstrap\n\t/go/tinysql/session/bootstrap.go:138\ngithub.com/pingcap/tidb/session.runInBootstrapSession\n\t/go/tinysql/session/session.go:792\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/go/tinysql/session/session.go:753"]
[2024/04/05 20:17:19.115 +00:00] [INFO] [region_cache.go:308] ["invalidate current region, because others failed on same store"] [region=2] [store=127.0.0.1:20160]
[2024/04/05 20:17:39.127 +00:00] [INFO] [region_cache.go:976] ["mark store's regions need be refill"] [store=127.0.0.1:20160]
[2024/04/05 20:17:39.127 +00:00] [INFO] [region_cache.go:402] ["switch region peer to next due to send request fail"] [current="region ID: 2, meta: id:2 region_epoch:<conf_ver:1 version:1 > peers:<id:3 store_id:1 > , peer: id:3 store_id:1 , addr: 127.0.0.1:20160, idx: 0"] [needReload=true] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20190809092503-95897b64e011/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/tikv/tikvrpc.CallRPC\n\t/go/tinysql/store/tikv/tikvrpc/tikvrpc.go:319\ngithub.com/pingcap/tidb/store/tikv.(*rpcClient.SendRequest\n\t/go/tinysql/store/tikv/client.go:225\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.sendReqToRegion\n\t/go/tinysql/store/tikv/region_request.go:142\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.SendReqCtx\n\t/go/tinysql/store/tikv/region_request.go:112\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender.SendReq\n\t/go/tinysql/store/tikv/region_request.go:70\ngithub.com/pingcap/tidb/store/tikv.(*tikvStore.SendReq\n\t/go/tinysql/store/tikv/kv.go:312\ngithub.com/pingcap/tidb/store/tikv.(*LockResolver.resolveLock\n\t/go/tinysql/store/tikv/lock_resolver.go:352\ngithub.com/pingcap/tidb/store/tikv.(*LockResolver.ResolveLocks\n\t/go/tinysql/store/tikv/lock_resolver.go:194\ngithub.com/pingcap/tidb/store/tikv.actionPrewrite.handleSingleBatch\n\t/go/tinysql/store/tikv/2pc.go:404\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnBatches\n\t/go/tinysql/store/tikv/2pc.go:313\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnKeys\n\t/go/tinysql/store/tikv/2pc.go:301\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.prewriteKeys\n\t/go/tinysql/store/tikv/2pc.go:533\ngithub.com/pingcap/tidb/store/tikv.actionPrewrite.handleSingleBatch\n\t/go/tinysql/store/tikv/2pc.go:380\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnBatches\n\t/go/tinysql/store/tikv/2pc.go:313\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.doActionOnKeys\n\t/go/tinysql/store/tikv/2pc.go:301\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.prewriteKeys\n\t/go/tinysql/store/tikv/2pc.go:533\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter.execute\n\t/go/tinysql/store/tikv/2pc.go:572\ngithub.com/pingcap/tidb/store/tikv.(*tikvTxn.Commit\n\t/go/tinysql/store/tikv/txn.go:188\ngithub.com/pingcap/tidb/kv.RunInNewTxn\n\t/go/tinysql/kv/txn.go:61\ngithub.com/pingcap/tidb/ddl.(*ddl.genGlobalIDs\n\t/go/tinysql/ddl/ddl.go:370\ngithub.com/pingcap/tidb/ddl.(*ddl.CreateSchema\n\t/go/tinysql/ddl/ddl_api.go:53\ngithub.com/pingcap/tidb/executor.(*DDLExec.executeCreateDatabase\n\t/go/tinysql/executor/ddl.go:124\ngithub.com/pingcap/tidb/executor.(*DDLExec.Next\n\t/go/tinysql/executor/ddl.go:79\ngithub.com/pingcap/tidb/executor.Next\n\t/go/tinysql/executor/executor.go:161\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelayExecutor\n\t/go/tinysql/executor/adapter.go:227\ngithub.com/pingcap/tidb/executor.(*ExecStmt.handleNoDelay\n\t/go/tinysql/executor/adapter.go:214\ngithub.com/pingcap/tidb/executor.(*ExecStmt.Exec\n\t/go/tinysql/executor/adapter.go:190\ngithub.com/pingcap/tidb/session.runStmt\n\t/go/tinysql/session/tidb.go:219\ngithub.com/pingcap/tidb/session.(*session.executeStatement\n\t/go/tinysql/session/session.go:536\ngithub.com/pingcap/tidb/session.(*session.execute\n\t/go/tinysql/session/session.go:615"]
[2024/04/05 20:17:39.241 +00:00] [INFO] [region_cache.go:308] ["invalidate current region, because others failed on same store"] [region=2] [store=127.0.0.1:20160]
[2024/04/05 20:17:41.532 +00:00] [INFO] [domain.go:126] ["full load InfoSchema success"] [usedSchemaVersion=0] [neededSchemaVersion=0] ["start time"=2.059864ms]
Hey! Can somebody share a sample implementation of model parallelism using pytorch and sockets, i have a project presentation coming up and its a tight schedule :/
Thanks in advance.
Hey Fellas!
My course project involves making a framework that uses each of our phones to try and distribute the running of a LLM. Motive is to eliminate the dependancy on a central server (like how all APIs function). How can i achieve this ? Using sockets/ Open MPI, etc ??
Can you help me with the project architecture too please? (P2P OR Master Slave - Algos like chord ?)
I'm new to this and any suggestions would be grateful.
It's a great way to contribute to projects focused on cancer research, long COVID, & Cyclone weather modeling. The app is available for iOS & Android. You can also run it on Windows 11 & Linux distros with emulators. There's also an unofficial subreddit, which has been running for a couple of years now.
So in my company we are changing our main workflow from one machine to a DisCo setup. 5 machines. One UI and 4 nodes (backend), they all are windows. The worfklow can be simplified by: UI needs to launch a Java program in the nodes. When calculation is complete the UI needs to know. Before, with only one backend machine, we had run the workflow using PsExec and now that we are enhancing the system we want to move away from it and implement the best approach. Problem is that most of the seniors are windows-oriented and they want to use PowerShell RM instead of a REST-API. Am I too biased for preferring the latter? I don't want to be attached to Windows OS but from security and reliability POV, is actually PS the best approach between these two?
Thanks in advance for your insights!
Hi, folks, I have been tasked to create a fintech system. This means the systems needs to be
Any references would help, books, courses, blogs etc.
I wanted to share the talks from last month’s Dask Demo Day, where folks from the Dask community give short demos to show off ongoing work. Hopefully this helps elevate some of the great work people are doing.
Last month’s talks:
Recording on YouTube: https://www.youtube.com/watch?v=07e1JL83ur8
Join the next one this Thursday, March 21st, 11am ET https://github.com/dask/community/issues/307